scarpy 不仅提供了 scrapy crawl spider 命令来启动爬虫,还提供了一种利用 API 编写脚本 来启动爬虫的方法。
scrapy 基于 twisted 异步网络库构建的,因此需要在 twisted 容器内运行它。
可以通过两个 API 运行爬虫:scrapy.crawler.CrawlerProcess 和 scrapy.crawler.CrawlerRunner
scrapy.crawler.CrawlerProcess
这个类内部将会开启 twisted.reactor、配置log 和 设置 twisted.reactor 自动关闭,该类是所有 scrapy 命令使用的类。
运行单个爬虫示例
class QiushispiderSpider(scrapy.Spider):
name = 'qiushiSpider'
# allowed_domains = ['qiushibaike.com']
start_urls = ['https://tianqi.2345.com/']
def start_requests(self):
return [scrapy.Request(url=self.start_urls[0], callback=self.parse)] #
def parse(self, response):
print('proxy simida')
if __name__ == '__main__':
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(QiushispiderSpider) # 'qiushiSpider'
process.start()
process.crawl() 内的参数可以是 爬虫名'qiushiSpider',也可以是 爬虫类名QiushispiderSpider
这种方式并没有使用爬虫的配置文件settings
2019-05-27 14:39:57 [scrapy.crawler] INFO: Overridden settings: {}
获取配置
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl(QiushispiderSpider) # 'qiushiSpider'
process.start()
运行多个爬虫
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
...
class MySpider2(scrapy.Spider):
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()
scrapy.crawler.CrawlerRunner
- 更好的控制爬虫运行过程
- 显式运行 twisted.reactor,显式关闭 twisted.reactor
- 需要在 CrawlerRunner.crawl 返回的对象中添加回调函数
运行单个爬虫示例
class QiushispiderSpider(scrapy.Spider):
name = 'qiushiSpider'
# allowed_domains = ['qiushibaike.com']
start_urls = ['https://tianqi.2345.com/']
def start_requests(self):
return [scrapy.Request(url=self.start_urls[0], callback=self.parse)] #
def parse(self, response):
print('proxy simida')
if __name__ == '__main__':
# test CrawlerRunner
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())
d = runner.crawl(QiushispiderSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
configure_logging 设定日志输出格式
addBoth 添加 关闭 twisted.crawl 的回调函数文章来源:https://www.toymoban.com/news/detail-407276.html
运行多个爬虫
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
...
class MySpider2(scrapy.Spider):
...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
也可以异步实现文章来源地址https://www.toymoban.com/news/detail-407276.html
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
...
class MySpider2(scrapy.Spider):
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script
到了这里,关于Scrapy API 启动爬虫的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!