Building A Concurrent Web Scraper With Python

by zen master

Another week and during my internet travels I stumbled upon a blog post by Aditya Bhargava titled “Building a concurrent web scraper with haskell”. I’m not a Haskell programmer and my experience of it is extremely limited. Reading the post most of it read like a cryptic magic spell!

Anyway it was still an interesting read and has inspired me to try my hand at writing something similar. So I reached out and and the quickest thing to hand was Python. In Linux just fire up a terminal and drop into your favorite text editor and away you go.

The Process

It is usually a good idea to think and plan out the basics of the solution design. Our target is a web page and our end result should be all images downloaded from it to disk. The target web page contains links to the images we want to scrape. Once we have a list of images instead of downloading them one at a time, the goal will be to download them “concurrently”.

A Concurrent Web Scraper Process

A Concurrent Web Scraper Process

.

The entire process can be broken down into the following distinct four steps:

  • Get contents of page
  • Parse and create a list of image links
  • For each link start a new thread
  • Download a single image

Now that we have a description of each step (or “function”) we can think about each function’s name and its input and output:

  • “from page” url -> html text
  • “all links” html text -> list of links
  • “in parallel” function, list of links -> start thread
  • “download image” link -> file saved to disk

You may have noticed that the output of one function is the input of another. This has been designed so that each function can be fed into the next one. Thus we can (at least in pseudo code) describe the entire program call:

in parallel ( download image, all links ( on page ( url ) ) )

Voila we are all done! Erm OK perhaps not quite, we just need to create the code for each function:

Lets create the code in the order that the functions are called. First start off with the parallel call, this function will simply take a “function” to invoke and a list of links. We can iterate through the list, and call the function each time and pass the link to the function as a parameter.

.


from threading import Thread

def in_parallel(fn, l):
       for i in l:
           Thread(target=fn, args=(i,)).start()

.

To get the html contents for a given url we can use Python’s urllib module:


import urllib

def from_page(u):
       return urllib.urlopen(u).read()

.

Let’s now create the function that downloads each image, we need to provide a filename. We could parse out the link that is passed in but being a lazy programmer I’ve decided to just create a GUID instead :)

.


from uuid import uuid4

def download_image(i):
       print "saving -> "+i
       f=open(str(uuid4())+".jpg",'wb')
       f.write(from_page(i))
       f.close()

.

Now all that is left is really the meat of the program, the task of parsing out the image links. To do this I have decided to use regex (shudder; I know I know, you can’t parse HTML with Regex). Being naughty sometimes is OK, right?

.


import re

def all_links(p):
       links = re.findall(r'href="([^"]+)"',p)
       return [links[i] for i in xrange(0,len(links)) if "http" and "jpg" in links[i]]

.

Finally now that all the functions have been created, we can call it and enjoy the images being downloaded concurrently, and what better web page than reddit’s /r/pics ?

.


in_parallel(download_image, all_links(from_page("http://www.reddit.com/r/pics")))

.

Conclusion

Programming in Python is fun because the code almost flows in the way you imagine it! Naturally this simple example is most certainly not production ready. However the basics are the same. Extending this example would include limiting the number of concurrent threads, error handling and perhaps recursively calling links on the website. Something for the reader to try out :)

The full 23 lines of source code are included below. Until next time have fun Pythoning!

.

from threading import Thread
from uuid import uuid4
import urllib
import re

def in_parallel(fn, l):
       for i in l:
           Thread(target=fn, args=(i,)).start()

def download_image(i):
       print "saving -> "+i
       f=open(str(uuid4())+".jpg",'wb')
       f.write(from_page(i))
       f.close()

def all_links(p):
       links = re.findall(r'href="([^"]+)"',p)
       return [links[i] for i in xrange(0,len(links)) if "http" and "jpg" in links[i]]

def from_page(u):
       return urllib.urlopen(u).read()

in_parallel(download_image, all_links(from_page("http://www.reddit.com/r/pics")))