欢迎加入QQ讨论群258996829
麦子学院 头像
苹果6袋
6
麦子学院

Scrapy基础之CrawlSpider详解

发布时间:2016-08-10 18:44  回复:0  查看:3337   最后回复:2016-08-10 18:44  

写在前面

Scrapy基础——Spider中,我简要地说了一下Spider类。Spider基本上能做很多事情了,但是如果你想爬取知乎或者是简书全站的话,你可能需要一个更强大的武器。
CrawlSpider基于Spider,但是可以说是为全站爬取而生。

简要说明

CrawlSpider是爬取那些具有一定规则网站的常用的爬虫,它基于Spider并有一些独特属性

· rules: Rule对象的集合,用于匹配目标网站并排除干扰

· parse_start_url: 用于爬取起始响应,必须要返回ItemRequest中的一个。

因为rulesRule对象的集合,所以这里也要介绍一下Rule。它有几个参数:link_extractorcallback=Nonecb_kwargs=Nonefollow=Noneprocess_links=Noneprocess_request=None
其中的link_extractor既可以自己定义,也可以使用已有LinkExtractor类,主要参数为:

· allow:满足括号中正则表达式的值会被提取,如果为空,则全部匹配。

· deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。

· allow_domains:会被提取的链接的domains

· deny_domains:一定不会被提取链接的domains

· restrict_xpaths:使用xpath表达式,和allow共同作用过滤链接。还有一个类似的restrict_css 

下面是官方提供的例子,我将从源代码的角度开始解读一些常见问题:

import scrapyfrom scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):

    name = 'example.com'

    allowed_domains = ['example.com']

    start_urls = ['http://www.example.com']

 

    rules = (

        # Extract links matching 'category.php' (but not matching 'subsection.php')

        # and follow links from them (since no callback means follow=True by default).

        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

 

        # Extract links matching 'item.php' and parse them with the spider's method parse_item

        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),

    )

 

    def parse_item(self, response):

        self.logger.info('Hi, this is an item page! %s', response.url)

        item = scrapy.Item()

        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')

        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()

        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()

        return item

问题:CrawlSpider如何工作的?

因为CrawlSpider继承了Spider,所以具有Spider的所有函数。
首先由start_requests对start_urls中的每一个url发起请求(make_requests_from_url),这个请求会被parse接收。在Spider里面的parse需要我们定义,但CrawlSpider定义parse去解析响应(self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True))
_parse_response根据有无callback,follow和self.follow_links执行不同的操作

    def _parse_response(self, response, callback, cb_kwargs, follow=True):

    ##如果传入了callback,使用这个callback解析页面并获取解析得到的reques或item

        if callback:

            cb_res = callback(response, **cb_kwargs) or ()

            cb_res = self.process_results(response, cb_res)

            for requests_or_item in iterate_spider_output(cb_res):

                yield requests_or_item

    ## 其次判断有无follow,用_requests_to_follow解析响应是否有符合要求的link。

        if follow and self._follow_links:

            for request_or_item in self._requests_to_follow(response):

                yield request_or_item

其中_requests_to_follow又会获取link_extractor(这个是我们传入的LinkExtractor)解析页面得到的linklink_extractor.extract_links(response),url进行加工(process_links,需要自定义),对符合的link发起Request。使用.process_request(需要自定义)处理响应。

问题:CrawlSpider如何获取rules?

CrawlSpider类会在__init__方法中调用_compile_rules方法,然后在其中浅拷贝rules中的各个Rule获取要用于回调(callback),要进行处理的链接(process_links)和要进行的处理请求(process_request)

    def _compile_rules(self):

        def get_method(method):

            if callable(method):

                return method

            elif isinstance(method, six.string_types):

                return getattr(self, method, None)

 

        self._rules = [copy.copy(r) for r in self.rules]

        for rule in self._rules:

            rule.callback = get_method(rule.callback)

            rule.process_links = get_method(rule.process_links)

            rule.process_request = get_method(rule.process_request)

那么Rule是怎么样定义的呢?

    class Rule(object):

 

        def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):

            self.link_extractor = link_extractor

            self.callback = callback

            self.cb_kwargs = cb_kwargs or {}

            self.process_links = process_links

            self.process_request = process_request

            if follow is None:

                self.follow = False if callback else True

            else:

                self.follow = follow

因此LinkExtractor会传给link_extractor

有callback的是由指定的函数处理,没有callback的是由哪个函数处理的?

由上面的讲解可以发现_parse_response会处理有callback的(响应)respons
cb_res = callback(response, **cb_kwargs) or ()
而_requests_to_follow会将self._response_downloaded传给callback用于对页面中匹配的url发起请求(request)。
r = Request(url=link.url, callback=self._response_downloaded)

如何在CrawlSpider进行模拟登陆

因为CrawlSpiderSpider一样,都要使用start_requests发起请求,用从Andrew_liu大神借鉴的代码说明如何模拟登陆:

##替换原来的start_requests,callback为def start_requests(self):

    return [Request("http://www.zhihu.com/#signin", meta = {'cookiejar' : 1}, callback = self.post_login)]def post_login(self, response):

    print 'Preparing login'

    #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单

    xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]

    print xsrf

    #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单

    #登陆成功后, 会调用after_login回调函数

    return [FormRequest.from_response(response,   #"http://www.zhihu.com/login",

                        meta = {'cookiejar' : response.meta['cookiejar']},

                        headers = self.headers,

                        formdata = {

                        '_xsrf': xsrf,

                        'email': '1527927373@qq.com',

                        'password': '321324jia'

                        },

                        callback = self.after_login,

                        dont_filter = True

                        )]#make_requests_from_url会调用parse,就可以与CrawlSpider的parse进行衔接了def after_login(self, response) :

    for url in self.start_urls :

        yield self.make_requests_from_url(url)

 

理论说明如上,有不足或不懂的地方欢迎在留言区和我说明。
其次,我会写一段爬取简书全站用户的爬虫来说明如何具体使用CrawlSpider

 

最后贴上Scrapy.spiders.CrawlSpider的源代码,以便检查

"""

This modules implements the CrawlSpider which is the recommended spider to use

for scraping typical web sites that requires crawling pages.

 

See documentation in docs/topics/spiders.rst

"""

import copyimport six

from scrapy.http import Request, HtmlResponsefrom scrapy.utils.spider import iterate_spider_outputfrom scrapy.spiders import Spider

 

def identity(x):

    return x

 

class Rule(object):

 

    def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):

        self.link_extractor = link_extractor

        self.callback = callback

        self.cb_kwargs = cb_kwargs or {}

        self.process_links = process_links

        self.process_request = process_request

        if follow is None:

            self.follow = False if callback else True

        else:

            self.follow = follow

 

class CrawlSpider(Spider):

 

    rules = ()

 

    def __init__(self, *a, **kw):

        super(CrawlSpider, self).__init__(*a, **kw)

        self._compile_rules()

 

    def parse(self, response):

        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

 

    def parse_start_url(self, response):

        return []

 

    def process_results(self, response, results):

        return results

 

    def _requests_to_follow(self, response):

        if not isinstance(response, HtmlResponse):

            return

        seen = set()

        for n, rule in enumerate(self._rules):

            links = [lnk for lnk in rule.link_extractor.extract_links(response)

                     if lnk not in seen]

            if links and rule.process_links:

                links = rule.process_links(links)

            for link in links:

                seen.add(link)

                r = Request(url=link.url, callback=self._response_downloaded)

                r.meta.update(rule=n, link_text=link.text)

                yield rule.process_request(r)

 

    def _response_downloaded(self, response):

        rule = self._rules[response.meta['rule']]

        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

 

    def _parse_response(self, response, callback, cb_kwargs, follow=True):

        if callback:

            cb_res = callback(response, **cb_kwargs) or ()

            cb_res = self.process_results(response, cb_res)

            for requests_or_item in iterate_spider_output(cb_res):

                yield requests_or_item

 

        if follow and self._follow_links:

            for request_or_item in self._requests_to_follow(response):

                yield request_or_item

 

    def _compile_rules(self):

        def get_method(method):

            if callable(method):

                return method

            elif isinstance(method, six.string_types):

                return getattr(self, method, None)

 

        self._rules = [copy.copy(r) for r in self.rules]

        for rule in self._rules:

            rule.callback = get_method(rule.callback)

            rule.process_links = get_method(rule.process_links)

            rule.process_request = get_method(rule.process_request)

    @classmethod

    def from_crawler(cls, crawler, *args, **kwargs):

        spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)

        spider._follow_links = crawler.settings.getbool(

            'CRAWLSPIDER_FOLLOW_LINKS', True)

        return spider

 

    def set_crawler(self, crawler):

        super(CrawlSpider, self).set_crawler(crawler)

        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

 

 


原文来自:简书/hoptop

 

您还未登录,请先登录

热门帖子

最新帖子