Scrapy state between job runs

The source code you can found here. It is available as part of scrapy-state project.

You can install scrapy-state from PyPi:

python3 -m pip install -U scrapy-state

You should also have Scrapy installed.

This is inspired by scrapy.extensions.spiderstate.SpiderState. I will provide relevant quote from documentation at the end. Now, I want to focus on my use-case first.

I am scraping some big site, it takes many dozens of hours to finish some logical chunk of scraping. There are also sporadic failures. I have retry mechanism, but sometimes, it takes couple of hours till couple of days until the site is fixed (there are also some SSL-related and other issues that retry does solve).

So, my requirements are:

  • I’m scraping dozens of GB of data (1 unit gives only 2.5 GB f temporary storage and 1 GB of RAM only), so I should stream the items while scraping. More on this in separate story.
  • Because of sporadic failure on the long-running job I want to know at which point it fails and I want to be able to resume from the point of failure.

In my case, I’m actually calling some REST API that have some datetime parameter. I want to re-iterate, I make one call with some t with granularity of seconds, I’m retrieving data, this actually doesn’t take a lot of time, but may fail, (and send it as item to my Pipeline) and than make another call with t+1 second (all other parameters are the same).

So, my solution is the following.

  • I’m logging this datetime parameter constantly. I also use some special attribute on my spider call state that holds it. Actually t is stored as state.
  • On the spider start state initial value is taken from settings's ‘STATE’ (or -s STATE=… from command line).
  • On the spider closestate attribute is printed to the log anyway.

So, on the first run, I put initial value on the settings, and run the the job. If there is failure, I’m going to the log file to figure out what happened (usually, it means to see exception) and on which point in time it happened.

If it looks like some problem on the scrapping site itself (for example, I’m receiving code 500 and endpoint and all it’s parameters looks fine), I’m creating new job with -s STATE=... and run it again. If it fails again with the same exception, but when I’m running with different ‘STATE’ value everything works, I’m just waiting for a few hours/days. Than I rerun again and, hopefully, the issue is fixed and I can continue scraping.

***

In order to make all of this works, you should:

  1. In the settings create STATE=… some initial value and/or use -s STATE=… from command line.
  2. In the settings add to EXTENSIONS dict alexber.spiderstate.state.SpiderSettingsState (with some priority). This class will put the value from p.1 to spider’s STATE attribute.
  3. Add logs for the state immediately before and/or after call to API/scraping the site. I’m using dedicate logger for this in order to easily find the output, but you may just add some unique prefix/suffix instead.
  4. Add spider_closed() hook on spider that will log out the STATE. This is helpfull, if you have exception in some unexpected place, you want to know what is last know STATE value.

Note: alexber.spiderstate.state.SpiderSettingsState is highly customizable. For example, you can automate all process described above and automatically evaluate new STATE based on your last failure. You can also retrieve you last item from the persistent storage you’re using in (S3, for example).

What are supported solutions and why I’m not using it?

  1. scrapy.extensions.spiderstate.SpiderState
  2. Syncing your .scrapy folder to an S3 bucket using DotScrapy Persistence
  3. DeltaFetch in Scrapy Cloud

Let’s start with scrapy.extensions.spiderstate.SpiderState

  1. scrapy.extensions.spiderstate.SpiderState

Quote from the documentation:

Jobs: pausing and resuming crawls

Sometimes, for big sites, it’s desirable to pause crawls and be able to resume them later.

Scrapy supports this functionality out of the box by providing the following facilities:

* a scheduler that persists scheduled requests on disk

* a duplicates filter that persists visited requests on disk

* an extension that keeps some spider state (key/value pairs) persistent between batches

Job directory

To enable persistence support you just need to define a job directory through the JOBDIR setting. This directory will be for storing all required data to keep the state of a single job (i.e. a spider run). It’s important to note that this directory must not be shared by different spiders, or even different jobs/runs of the same spider, as it’s meant to be used for storing the state of a single job.

https://docs.scrapy.org/en/latest/topics/jobs.html?highlight=state#jobs-pausing-and-resuming-crawls

As you see scrapy.extensions.spiderstate.SpiderStateis suitable if you want to pause crawls and be able to resume them later. Moreover, it is explicitly stated that It’s important to note that this [JOBDIR] directory must not be shared by different spiders, or even different jobs/runs of the same spider, as it’s meant to be used for storing the state of a single job.

I deliberately want to reuse stated in different jobs/runs of the same spider, so this approach is simply doesn’t work for me.

***

2. Syncing your .scrapy folder to an S3 bucket using DotScrapy Persistence

Quote from the documentation:

… the content of the .scrapy directory in a persistent store, which is loaded when the spider starts and saved when the spider finishes. It allows spiders to share data between different runs, keeping a state or any kind of data that needs to be persisted. For more details on the middleware, you can check the github repository: scrapy-dotpersistence.

https://support.scrapinghub.com/support/solutions/articles/22000225188-syncing-your-scrapy-folder-to-an-s3-bucket-using-dotscrapy-persistence

It is actually instructive to look on scrapy-dotpersistence source code. You can see that it use AWS CLI to sync local directory with one on S3. You also see that sync happens on Scrapy start-up and when it stops. The main disadvantages are:

  • The synс happens too rare.
  • We’re using the same path on S3 per Spider both to restore the state and to save it.

I’ll expand on these points. You don’t have visibility on your state. When your spiders run, until it finish, you don’t know it’s state. Also, it is occasionally happens that closing callback is not called (may be the JVM is crashed? I don’t know, I just see that logs output are stopped in the middle on the run), this means that sync to S3 is not happen in this case. Because of the long time it takes to run it is not acceptable just to lose the state. If you want to make sync more, it is more convenient to use FEED Exporter.

Started from Scrapy 2.1 you can have multiple destinations, you can use one to save your scraped items and the other to save the state. You will also have the control when you want to do it. Of course, you should write some code to restore the state (just use last file from the folder on S3) and you will have some space waste on S3. There are main disadvantages of this alternative.

The second point is a bit complicated. I will remind you that my spider run’s a lot of time. Than, in production it occasionally crashes at some point. I often rerun it in development (using the STATE to start from the last good point), make couple of iteration and exit from spider in a clean way (all close hooks are called in order to perform all necessarily cleanup, for example, to close tmp files). I want to emphasizes, I’m using the same Spider, that just have some if statement that will exit scraping loop in development (it looks on some settingsthat is different in development). Now, I do want to use the same path as in production to restore the STATE, but I don’t want to save it to the same location. I don’t want to save the STATE at all in development mode. Moreover, some time, I want to start a bit earlier from the saved STATE time point, that I don’t have clean way to do it.

One way to handle it is to use 2 different buckets: one for development and the second for production. But this means that you will have to copy STATE across the bucket. You can write some utility code to do it, but it is just a mess. I prefer not to save the STATE, but to reconstruct it whether from the logs or form the items that are scraped to S3 based on last created file. I also has ability to explicitly specify different STATE value.

3. DeltaFetch in Scrapy Cloud

Quote from the documentation:

The purpose of this is to avoid requesting pages that have already scraped items in previous crawls of the same spider, thus producing a delta crawl containing only new items. For more details on the middleware, you can check the github repository: scrapy-deltafetch.

NOTE 1: DeltaFetch only avoids sending requests to pages that have generated scraped items before, and only if these requests were not generated from the spider’s start_urls or start_requests. Pages from where no items were directly scraped will still be crawled every time you run your spiders, so DeltaFetch addon is great for detecting new records in directories.

…To use DeltaFetch in Scrapy Cloud you’ll also need to enable scrapy-dotpersistence extension in your project’s settings.py…

…When adding the settings through Scrapinghub, please set them on Spider level. Setting it on Project level or in settings.py won’t work because Scrapinghub’s default settings are propagated on Organization level and have higher priority, but lower than Spider level settings…

…If you want to re-scrape pages, you can reset the DeltaFetch cache by adding the following setting when running a job:

DELTAFETCH_RESET = 1 (or True)

Make sure to disable it for the following runs…

https://support.scrapinghub.com/support/solutions/articles/22000221912-incremental-crawls-with-scrapy-and-deltafetch-in-scrapy-cloud

Another quote:

This is a Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider, thus producing a “delta crawl” containing only new items.

This also speeds up the crawl, by reducing the number of requests that need to be crawled, and processed (typically, item requests are the most CPU intensive).

…DeltaFetch middleware depends on Python’s bsddb3 package.

On Ubuntu/Debian, you may need to install libdb-dev if it's not installed already.

…Supported Scrapy request meta keys

deltafetch_key — used to define the lookup key for that request. by default it's Scrapy's default Request fingerprint function, but it can be changed to contain an item id, for example. This requires support from the spider, but makes the extension more efficient for sites that many URLs for the same item.

https://github.com/scrapy-plugins/scrapy-deltafetch

First of all it’s setup is very cumbersome. There is no easy way to debug if it was done wrong. The second point it actually stores every unique requests in embedded database system. I have many requests through my scraping life-cycle, I don’t know whether I will hit limit on RAM/Disk (1 unit gives only 2.5 GB f temporary storage and 1 GB of RAM only). And there some limitation on the mechaninsm itself:

NOTE 1: DeltaFetch only avoids sending requests to pages that have generated scraped items before, and only if these requests were not generated from the spider’s start_urls or start_requests. Pages from where no items were directly scraped will still be crawled every time you run your spiders, so DeltaFetch addon is great for detecting new records in directories.

https://support.scrapinghub.com/support/solutions/articles/22000221912-incremental-crawls-with-scrapy-and-deltafetch-in-scrapy-cloud

All of these makes just too complicated to test whether it actually works for me (maybe request_fingerprint that is used by default is not sufficient for me? Maybe I have to include some request’s header?) or not.

And the last point this has the same severe limitation as scrapy-dotpersistence it doesn’t support re-run in development environment (see “I often rerun it in development (using the STATE to start from the last good point)” point above for the details).

Public API methods

  • get_extension() is utility method to get Spider extension. Example of intended usage:
  • get_settings_priority_name() is utility method to get human-readable representation of priority of settings, basically it convert priority:int to priority:str.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store