python - Scrapy: crawl multiple spiders sharing same items, pipeline, and settings but with separate outputs -


i trying run multiple spiders using python script based on code provided in official documentation. scrapy project contains multiple spider (spider1, spider2, etc.) crawl different websites , save content of each website in different json file (output1.json, output2.json, etc.).

the items collected on different websites share same structure, therefore spiders use same item, pipeline, , setting classes. output generated custom json class in pipeline.

when run spiders separately work expected, when use script below run spiders scrapy api items mixed in pipeline. output1.json should contain items crawled spider1, contains items of spider2. how can crawl multiple spiders scrapy api using same items, pipeline, , settings generating separate outputs?

here code used run multiple spiders:

import scrapy scrapy.crawler import crawlerprocess web_crawler.spiders.spider1 import spider1 web_crawler.spiders.spider2 import spider2  settings = get_project_settings() process = crawlerprocess(settings) process.crawl(spider1) process.crawl(spider2) process.start() 

example output1.json:

{ "name": "thomas" "source": "spider1" } { "name": "paul" "source": "spider2" } { "name": "nina" "source": "spider1"  } 

example output2.json:

{ "name": "sergio" "source": "spider1" } { "name": "david" "source": "spider1" } { "name": "james" "source": "spider2" } 

normally, names crawled spider1 ("source": "spider1") should in output1.json, , names crawled spider2 ("source": "spider2") should in output2.json

thank help!

acording docs run spiders sequentially on same process, must chain deferreds.

try this:

import scrapy scrapy.crawler import crawlerrunner web_crawler.spiders.spider1 import spider1 web_crawler.spiders.spider2 import spider2  settings = get_project_settings() runner = crawlerrunner(settings)  @defer.inlinecallbacks def crawl():     yield runner.crawl(spider1)     yield runner.crawl(spider2)     reactor.stop()  crawl() reactor.run() 

Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -