创建crawlspider爬虫文件:
scrapy genspider -t crawl 爬虫文件名 爬取的域名
scrapy genspider -t crawl read https://www.dushu.com/book/1206.html
LinkExtractor 链接提取器通过它,Spider可以知道从爬取的页面中提取出哪些链接,提取出的链接会自动生成Request请求对象
class ReadSpider(CrawlSpider):
name = "read"
allowed_domains = ["www.dushu.com"]
start_urls = ["https://www.dushu.com/book/1206_1.html"]
# LinkExtractor 链接提取器通过它,Spider可以知道从爬取的页面中提取出哪些链接。提取出的链接会自动生成Request请求对象
rules = (Rule(LinkExtractor(allow=r"/book/1206_\d+\.html"), callback="parse_item", follow=False),)
def parse_item(self, response):
name_list = response.xpath('//div[@class="book-info"]//img/@alt')
src_list = response.xpath('//div[@class="book-info"]//img/@data-original')
for i in range(len(name_list)):
name = name_list[i].extract()
src = src_list[i].extract()
book = ScarpyReadbook41Item(name=name, src=src)
yield book
开启管道
写入文件文章来源:https://www.toymoban.com/news/detail-665567.html
class ScarpyReadbook41Pipeline:
def open_spider(self, spider):
self.fp = open('books.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
self.fp.write(str(item))
return item
def close_spider(self, spider):
self.fp.close()
运行之后发现没有第一页数据
需要在start_urls里加上_1,不然不会读取第一页数据文章来源地址https://www.toymoban.com/news/detail-665567.html
start_urls = ["https://www.dushu.com/book/1206_1.html"]
到了这里,关于Python爬虫——scrapy_crawlspider读书网的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!