各位用户为了找寻关于python爬虫框架talonspider简单介绍的资料费劲了很多周折。这里教程网为您整理了关于python爬虫框架talonspider简单介绍的相关资料,仅供查阅,以下为您介绍关于python爬虫框架talonspider简单介绍的详细内容

1.为什么写这个?

一些简单的页面,无需用比较大的框架来进行爬取,自己纯手写又比较麻烦

因此针对这个需求写了talonspider:

•1.针对单页面的item提取 - 具体介绍点这里 •2.spider模块 - 具体介绍点这里

2.介绍&&使用

2.1.item

这个模块是可以独立使用的,对于一些请求比较简单的网站(比如只需要get请求),单单只用这个模块就可以快速地编写出你想要的爬虫,比如(以下使用python3,python2见examples目录):

2.1.1.单页面单目标

比如要获取这个网址http://book.qidian.com/info/1004608738 的书籍信息,封面等信息,可直接这样写:

? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import time from talonspider import Item, TextField, AttrField from pprint import pprint   class TestSpider(Item):   title = TextField(css_select='.book-info>h1>em')   author = TextField(css_select='a.writer')   cover = AttrField(css_select='a#bookImg>img', attr='src')     def tal_title(self, title):     return title     def tal_cover(self, cover):     return 'http:' + cover   if __name__ == '__main__':   item_data = TestSpider.get_item(url='http://book.qidian.com/info/1004608738')   pprint(item_data)

具体见qidian_details_by_item.py

2.1.1.单页面多目标

比如获取豆瓣250电影首页展示的25部电影,这一个页面有25个目标,可直接这样写:

? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 from talonspider import Item, TextField, AttrField from pprint import pprint   # 定义继承自item的爬虫类 class DoubanSpider(Item):   target_item = TextField(css_select='div.item')   title = TextField(css_select='span.title')   cover = AttrField(css_select='div.pic>a>img', attr='src')   abstract = TextField(css_select='span.inq')     def tal_title(self, title):     if isinstance(title, str):       return title     else:       return ''.join([i.text.strip().replace('xa0', '') for i in title])   if __name__ == '__main__':   items_data = DoubanSpider.get_items(url='https://movie.douban.com/top250')   result = []   for item in items_data:     result.append({       'title': item.title,       'cover': item.cover,       'abstract': item.abstract,     })   pprint(result)

具体见douban_page_by_item.py

2.2.spider

当需要爬取有层次的页面时,比如爬取豆瓣250全部电影,这时候spider部分就派上了用场:

? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 # !/usr/bin/env python from talonspider import Spider, Item, TextField, AttrField, Request from talonspider.utils import get_random_user_agent     # 定义继承自item的爬虫类 class DoubanItem(Item):   target_item = TextField(css_select='div.item')   title = TextField(css_select='span.title')   cover = AttrField(css_select='div.pic>a>img', attr='src')   abstract = TextField(css_select='span.inq')     def tal_title(self, title):     if isinstance(title, str):       return title     else:       return ''.join([i.text.strip().replace('xa0', '') for i in title])     class DoubanSpider(Spider):   # 定义起始url,必须   start_urls = ['https://movie.douban.com/top250']   # requests配置   request_config = {     'RETRIES': 3,     'DELAY': 0,     'TIMEOUT': 20   }   # 解析函数 必须有   def parse(self, html):     # 将html转化为etree     etree = self.e_html(html)     # 提取目标值生成新的url     pages = [i.get('href') for i in etree.cssselect('.paginator>a')]     pages.insert(0, '?start=0&filter=')     headers = {       "User-Agent": get_random_user_agent()     }     for page in pages:       url = self.start_urls[0] + page       yield Request(url, request_config=self.request_config, headers=headers, callback=self.parse_item)     def parse_item(self, html):     items_data = DoubanItem.get_items(html=html)     # result = []     for item in items_data:       # result.append({       #   'title': item.title,       #   'cover': item.cover,       #   'abstract': item.abstract,       # })       # 保存       with open('douban250.txt', 'a+') as f:         f.writelines(item.title + 'n')     if __name__ == '__main__':   DoubanSpider.start()

控制台:

? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 /Users/howie/anaconda3/envs/work3/bin/python /Users/howie/Documents/programming/python/git/talonspider/examples/douban_page_by_spider.py 2017-06-07 23:17:30,346 - talonspider - INFO: talonspider started 2017-06-07 23:17:30,693 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250 2017-06-07 23:17:31,074 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=25&filter= 2017-06-07 23:17:31,416 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=50&filter= 2017-06-07 23:17:31,853 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=75&filter= 2017-06-07 23:17:32,523 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=100&filter= 2017-06-07 23:17:33,032 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=125&filter= 2017-06-07 23:17:33,537 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=150&filter= 2017-06-07 23:17:33,990 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=175&filter= 2017-06-07 23:17:34,406 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=200&filter= 2017-06-07 23:17:34,787 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=225&filter= 2017-06-07 23:17:34,809 - talonspider - INFO: Time usage:0:00:04.462108   Process finished with exit code 0

此时当前目录会生成douban250.txt,具体见douban_page_by_spider.py。

3.说明

学习之作,待完善的地方还有很多,欢迎提意见,项目地址talonspider。

原文链接:http://www.jianshu.com/p/c20b7f3d5a78