$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\I}{\mathcal{I}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} \newcommand{\inv}1 \newcommand{\map}{\text{MAP}} \newcommand{\L}{\mathcal{L}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

Scrapy as a Library in Long Running Process

There’s no denying that Scrapy is one of the best crawler frameworks for Python, if not the best. Scrapy’s developers designed Scrapy as a standalone process, i.e. in a different instance from our main program, and using database for communication between them.

Sometimes, we want to integrate Scrapy in our own framework. It’s a little bit tricky because it’s not what Scrapy designed for. In the documentation page, there are some information on how to run Scrapy from our own script: http://doc.scrapy.org/en/latest/topics/practices.html.

It’s sufficient if our goal is to use Scrapy from our own script for one time crawling, i.e. execute the script, retrieve the result, then terminate it. However, it won’t work if our goal is to make Scrapy run from our own long-running worker instance. After the first execution, it will raise twisted.internet.error.ReactorNotRestartable error.

There’s a workaround for this, and in my experience, this solution is quite robust: I haven’t encountered a single failure since I implemented this solution. The workaround is to use multiprocessing module in Python’s standard library.

Let’s define our crawler module. I’ll assume that we’ve already had our spiders defined.

from scrapy import signals
from scrapy.crawler import CrawlerProcess, Crawler
from scrapy.settings import Settings


class CustomCrawler(object):

    def crawl(self, spider):
        crawled_items = []

        def add_item(item):
            crawled_items.append(item)

        process = CrawlerProcess()

        crawler = Crawler(spider, self.settings)
        crawler.signals.connect(add_item, signals.item_scraped)
        process.crawl(crawler)

        process.start()

        return crawled_items

Here, we’re creating the CustomCrawler just as advised by the official Scrapy documentation. We use Scrapy’s signals to get the crawled item, and add it to a list that stores our crawled items. So, each time an item is crawled, it will send a signal. If we caught such signal, then we add the said crawled item to the list. In the end of this method, we’ll have a list that contains all of the items we’ve crawled.

This solution alone will suffice if our goal is to create a script for one-time-and-terminate crawler job. To make it works as long-running crawler job, we have to wrap the CustomCrawler class with Python’s multiprocessing library.

import multiprocessing as mp


def crawl():
    def _crawl(queue):
        crawler = CustomCrawler()
        # Assume we have a spider class called: WebSpider
        res = crawler.crawl(WebSpider)
        queue.put(res)

    q = mp.Queue()
    p = mp.Process(target=_crawl, args=(q))
    p.start()
    res = q.get()
    p.join()

    return res

One thing to note though, we have to execute this line: res = q.get() before p.join(). This is a known issue for this implementation. If you know what’s up with this, drop a comment or reply in this Stackoverflow thread: http://stackoverflow.com/questions/35810024/python-multiprocessing-queue-behavior

Having wrapped our CustomCrawler class, we then could use it inside our long-running worker.

import time


while True:
    items = crawl()
    # do something with crawled items ...
    time.sleep(3600)

And, that’s it! Now our worker will do the crawling job periodically without ever terminating!