<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <link href="https://friendlybit.com/feed/" rel="self" type="application/atom+xml" />
    <link href="https://friendlybit.com/" rel="alternate" type="text/html" />
    <updated>2012-01-08T22:07:29+01:00</updated>
    <id>https://friendlybit.com</id>
    <title type="html">Friendly Bit - Web development blog</title>
    <subtitle>Friendly Bit is a blog by Emil Stenström, a Swedish web developer that occasionally gets ideas of how to improve the internet.</subtitle>
    
        <entry>
            <title type="html">What movies on Piratebay will you like the most?</title>
            <link href="http://friendlybit.com/python/what-movies-on-piratebay-will-you-like-the-most/" rel="alternate" type="text/html" title="What movies on Piratebay will you like the most?" />
            <published>2012-01-08T22:07:29+01:00</published>
            <updated>2012-01-08T22:07:29+01:00</updated>
            <id>http://friendlybit.com/python/what-movies-on-piratebay-will-you-like-the-most/</id>
            <author>
                <name>Emil Stenström</name>
            </author>
            <summary type="text">Christmas, and the weeks thereafter, are times for coding. And I&#39;ve been playing around with piratebay and filmtipset (a Swedish movie recommendation) a...</summary>
            <content type="html" xml:base="http://friendlybit.com/python/what-movies-on-piratebay-will-you-like-the-most/">
                &lt;p&gt;Christmas, and the weeks thereafter, are times for coding. And I&#39;ve been playing around with &lt;a href=&#34;http://thepiratebay.org&#34;&gt;piratebay&lt;/a&gt; and &lt;a href=&#34;http://filmtipset.se&#34;&gt;filmtipset&lt;/a&gt; (a Swedish movie recommendation) a little bit. I just pushed it to the &lt;a href=&#34;https://github.com/EmilStenstrom/filmtipset-piratebay&#34;&gt;filmtipset-piratebay project on GitHub&lt;/a&gt;, if you want to take a look.&lt;/p&gt;
&lt;h2 id=&#34;css-for-screen-scraping&#34;&gt;CSS for screen scraping&lt;a href=&#34;#css-for-screen-scraping&#34;&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The script is using &lt;strong&gt;CSS for screen scraping&lt;/strong&gt;; something that works extremely well:&lt;/p&gt;
&lt;div class=&#34;highlight&#34; data-language=&#34;PYTHON&#34;&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;nn&#34;&gt;lxml&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;html&lt;/span&gt;
&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;nn&#34;&gt;requests&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;response&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;requests&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;get&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;quot;http://thepiratebay.org/browse/207/0/7&amp;quot;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;document&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;html&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;document_fromstring&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;response&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;content&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;n&#34;&gt;links&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;document&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;cssselect&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;quot;.detLink&amp;quot;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;link&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;text_content&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;link&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;links&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Note: You need &lt;a href=&#34;http://lxml.de/&#34;&gt;lxml&lt;/a&gt; and &lt;a href=&#34;http://docs.python-requests.org&#34;&gt;requests&lt;/a&gt; to run the above example.&lt;/p&gt;
&lt;p&gt;Saving the above snippet to a py-file and running it will give you a list of all torrents on the given url. Play around with the CSS selector to get some other data from the page.&lt;/p&gt;
&lt;h2 id=&#34;extracting-movie-titles-from-torrent-names&#34;&gt;Extracting movie titles from torrent names&lt;a href=&#34;#extracting-movie-titles-from-torrent-names&#34;&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;It&#39;s surprisingly easy to convert torrent names to movie titles. Just follow this simple algorithm:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Split the torrent name into words by treating all non-alphanumeric characters as space.&lt;/li&gt;
&lt;li&gt;Loop over the remaining words, and look for a predefined set of &amp;quot;torrent endings&amp;quot;.&lt;/li&gt;
&lt;li&gt;When you find an ending, cut the name from there&lt;/li&gt;
&lt;li&gt;(Optional) Remove the year if there is one at the end of the remaining string&lt;/li&gt;
&lt;li&gt;(Optional) Remove all movies which really are bundles of movies, and not single movies. This is easily done by looking for a set of common strongs such as &amp;quot;trilogy&amp;quot; and &amp;quot;series&amp;quot;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Result: &amp;quot;Real.Steel.2011.720p.BluRay.x264-REFiNED&amp;quot; -&amp;gt; &amp;quot;Real Steel&amp;quot;&lt;/p&gt;
&lt;p&gt;You can find my movie title finder implementation in parse.py on GitHub.&lt;/p&gt;
&lt;h2 id=&#34;cache-all-http-request&#34;&gt;Cache all HTTP Request&lt;a href=&#34;#cache-all-http-request&#34;&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;To both save time, and be nice to the services we&#39;re querying, the script caches all HTTP requests for a number of days. I do this by simply saving the returned HTML/JSON to a file, and checking the file system for that file before making a new request. Saving the HTML/JSON, and not the processed result, makes it possible to experiment with the parsing, without having to wait for new requests from the server.&lt;/p&gt;
&lt;p&gt;My caching implementation is of course also on GitHub.&lt;/p&gt;
&lt;p&gt;***&lt;/p&gt;
&lt;p&gt;All and all, this has been a fun little project, and I&#39;ve learned a lot. But I&#39;m sure we can make this even better. Feel free to send pull requests!&lt;/p&gt;

            </content>
        </entry>
    
</feed>