technical

Developing a distributed web scraper using Celery

At nam.R, we gather publicly available, non-personal data and use our AI tools to build a Digital Twin of France. In that regard, we need a robust and scalable scraping infrastructure. Until recently, we were using Scrapy but, reaching its limits, we started developing our own solution, based on Celery.

read more

Distributed web scraper : Scraper, Scraping, Scrapy

A scraper is a program whose goal is to extract data by automated means from a format not intended to be machine-readable, such as a screenshot or a formatted web page. Scripting languages are mostly used for scraping, and in Python the most well-known library is Scrapy.

Scrapy describes itself as “An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.” and if your goal is exactly that “extracting data from website” and nothing more, Scrapy may be your best option. In a few lines of codes, you can have a working scraper.

 

web scraper namR

a working example to scrap our blog

You will reach its limits if you try to do something more complex, but you should not reinvent the wheel and if Scrapy works for you, you can even look at Frontera to scale your scrapers.

Let’s use an example, you need to retrieve the nearest and best-rated sushi restaurants around millions of addresses, but the website from where you want to scrap this information only let you list every restaurants in a city. You will have to:

  1. Scrap the location and rating of every restaurants in the cities present in your list of addresses
  2. Compute the nearest and best-rated restaurants
  3. Download the complete information for the selected restaurants (we don’t want to download everything on the first step as it would take too much time, disk space).

The first step can be easily achieved with Scrapy. You’ll hit the first obstacle when trying to do the computation. You could try to hack it in the middle of a Scrapy spider but trying to match millions of addresses with thousand of restaurants based on geospatial computations in Python is going to take a lot of time and any error means you’ll lose everything as scraped data do not persist between Scrapy runs. Worse, if you chose to use Frontera, you also need to synchronize scraped data across multiple instances.

At this point, it seems easier to build ”from scratch” a custom distributed web scraper from scratch than trying to hack your needs into existing products.

We need our scraper to be resilient because webpages are always changing, servers can crash, be overloaded and you don’t want a random error to halt the whole scraping. After the first scraping, we want to reuse the scraped data to speed up the following run. Obviously, it needs to be scalable, easy to monitor and finally we want its code base to be maintainable and fully tested.

 

To read more: click here.

about nam.R

nam.R is a data & deep-tech company, specialized in geolocated intelligence, building a digital representation of the physical world. nam.R aims at providing easy-to-use and actionable data to public and private organisations to massify and optimise their actions, investments and projects. 

visit our scientific blog
events
02 Dec 2020