At namR, we gather publicly available, non-personal data and use our AI tools to build a Digital Twin of France. In that regard, we need a robust and scalable scraping infrastructure. Until recently, we were using Scrapy but, reaching its limits, we started developing our own solution, based on Celery.
Distributed web scraper : Scraper, Scraping, Scrapy
A scraper is a program whose goal is to extract data by automated means from a format not intended to be machine-readable, such as a screenshot or a formatted web page. Scripting languages are mostly used for scraping, and in Python the most well-known library is Scrapy.
Scrapy describes itself as “An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.” and if your goal is exactly that “extracting data from website” and nothing more, Scrapy may be your best option. In a few lines of codes, you can have a working scraper.
You will reach its limits if you try to do something more complex, but you should not reinvent the wheel and if Scrapy works for you, you can even look at Frontera to scale your scrapers.
Why develop a custom solution?
Let’s use an example, you need to retrieve the nearest and best-rated sushi restaurants around millions of addresses, but the website from where you want to scrap this information only let you list every restaurants in a city. You will have to:
Scrap the location and rating of every restaurants in the cities present in your list of addresses
Compute the nearest and best-rated restaurants
Download the complete information for the selected restaurants (we don’t want to download everything on the first step as it would take too much time, disk space).
The first step can be easily achieved with Scrapy. You’ll hit the first obstacle when trying to do the computation. You could try to hack it in the middle of a Scrapy spider but trying to match millions of addresses with thousand of restaurants based on geospatial computations in Python is going to take a lot of time and any error means you’ll lose everything as scraped data do not persist between Scrapy runs. Worse, if you chose to use Frontera, you also need to synchronize scraped data across multiple instances.
At this point, it seems easier to build ”from scratch” a custom distributed web scraper from scratch than trying to hack your needs into existing products.
We need our scraper to be resilient because webpages are always changing, servers can crash, be overloaded and you don’t want a random error to halt the whole scraping. After the first scraping, we want to reuse the scraped data to speed up the following run. Obviously, it needs to be scalable, easy to monitor and finally we want its code base to be maintainable and fully tested.
nam.R is a data & deep-tech company, specialized in geolocated intelligence, building a digital representation of the physical world. nam.R aims at providing easy-to-use and actionable data to public and private organisations to massify and optimise their actions, investments and projects.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.