scrapy如何正确的处理cookie?

时间: 2013-11-16来源：开源中国

前景提要

HDC调试需求开发（15万预算）,能者速来！>>>
试着写一个spider 爬一个论坛(dz7.2)，目标板块是要求登录了，目标站是cookie来验证身份的。我模拟登录了，登录成功后，在访问目标板块的时候，一直提示要求登录，有点不明白怎么回事了，不知道是不是cookie没有处理好，过来请教一下。下面是运行时log和spider 代码.
2013-11-16 14:50:57+0000 [scrapy] INFO: Scrapy 0.18.4 started (bot: magnetbot) 2013-11-16 14:50:57+0000 [scrapy] DEBUG: Optional features available: ssl, http11, django 2013-11-16 14:50:57+0000 [scrapy] DEBUG: Overridden settings: {'COOKIES_DEBUG': True, 'NEWSPIDER_MODULE': 'magnetbot.spiders', 'ITEM_PIPELINES': ['magnetbot.pipelines.MagnetbotPipeline'], 'DUPEFILTER_CLASS': 'magnetbot.middleware.filter.DuplicateFilter', 'SPIDER_MODULES': ['magnetbot.spiders'], 'RETRY_HTTP_CODES': [500, 502, 503, 504, 400, 408, 403], 'BOT_NAME': 'magnetbot', 'DOWNLOAD_TIMEOUT': 30, 'DOWNLOAD_DELAY': 3} 2013-11-16 14:50:57+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2013-11-16 14:50:58+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2013-11-16 14:50:58+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, IgnoreVisitedLinkMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2013-11-16 22:50:59+0800 [scrapy] DEBUG: Enabled item pipelines: MagnetbotPipeline 2013-11-16 22:50:59+0800 [sisbot] INFO: Spider opened 2013-11-16 22:50:59+0800 [sisbot] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-11-16 22:50:59+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2013-11-16 22:50:59+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2013-11-16 22:50:59+0800 [scrapy] DEBUG: using user agent Mozilla/5.0 (Windows NT 5.1) Gecko/20100101 Firefox/14.0 Opera/12.0 2013-11-16 22:51:00+0800 [sisbot] DEBUG: Received cookies from: <200 http://www.xxx.com/forum/logging.php?action=login> Set-Cookie: cdb2_sid=XbbV5z; expires=Sat, 23-Nov-2013 14:51:00 GMT; path=/ 2013-11-16 22:51:00+0800 [sisbot] DEBUG: Crawled (200) <GET http://www.xxx.com/forum/logging.php?action=login> (referer: None) 2013-11-16 22:51:00+0800 [sisbot] DEBUG: {"questionid": "0", "loginfield": "username", "referer": "index.php", "formhash": "55be5d02", "loginsubmit": "true", "c95b1308bda0a3589f68f75d23b15938": "xxxxxx", "62838ebfea47071969cead9d87a2f1f7": "username", "cookietime": "315360000"} 2013-11-16 22:51:00+0800 [scrapy] DEBUG: using user agent Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; chromeframe/11.0.696.57) 2013-11-16 22:51:00+0800 [sisbot] DEBUG: Sending cookies to: <POST http://www.xxx.com/forum/logging.php?action=login&> Cookie: cdb2_sid=XbbV5z 2013-11-16 22:51:03+0800 [sisbot] DEBUG: Received cookies from: <200 http://www.xxx.com/forum/logging.php?action=login&> Set-Cookie: cdb2_sid=SLZhIh; expires=Sat, 23-Nov-2013 14:51:03 GMT; path=/ Set-Cookie: cdb2_cookietime=315360000; expires=Sun, 16-Nov-2014 14:51:03 GMT; path=/ Set-Cookie: cdb2_auth=RIaGRiRAHl5qHkk9TkSVg%2FYaF43pSkY87as6B0L87WyrTi4FXQtxgCmChtXG%2BoYptQ; expires=Tue, 14-Nov-2023 14:51:03 GMT; path=/ Set-Cookie: cdb2_isShowPWNotice=0; expires=Tue, 14-Nov-2023 14:51:03 GMT; path=/ 2013-11-16 22:51:03+0800 [sisbot] DEBUG: Crawled (200) <POST http://www.xxx.com/forum/logging.php?action=login&> (referer: http://www.xxx.com/forum/logging.php?action=login) 2013-11-16 22:51:03+0800 [sisbot] DEBUG: login.... 2013-11-16 22:51:03+0800 [sisbot] DEBUG: login success 2013-11-16 22:51:03+0800 [scrapy] DEBUG: using user agent Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36 2013-11-16 22:51:03+0800 [sisbot] DEBUG: Sending cookies to: <GET http://www.xxx.com/forum/forum-270-1.html> Cookie: cdb2_isShowPWNotice=0; cdb2_cookietime=315360000; cdb2_auth=RIaGRiRAHl5qHkk9TkSVg%2FYaF43pSkY87as6B0L87WyrTi4FXQtxgCmChtXG%2BoYptQ; cdb2_sid=SLZhIh 2013-11-16 22:51:07+0800 [sisbot] DEBUG: Received cookies from: <200 http://www.xxx.com/forum/forum-270-1.html> Set-Cookie: cdb2_sid=wTx2ho; expires=Sat, 23-Nov-2013 14:51:07 GMT; path=/ 2013-11-16 22:51:07+0800 [sisbot] DEBUG: Crawled (200) <GET http://www.xxx.com/forum/forum-270-1.html> (referer: http://www.xxx.com/forum/logging.php?action=login&) 2013-11-16 22:51:07+0800 [sisbot] DEBUG: login error !!! 2013-11-16 22:51:07+0800 [sisbot] INFO: Closing spider (finished) 2013-11-16 22:51:07+0800 [sisbot] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1481, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 2, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 18632, 'downloader/response_count': 3, 'downloader/response_status_count/200': 3, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2013, 11, 16, 14, 51, 7, 302541), 'log_count/DEBUG': 22, 'log_count/INFO': 3, 'request_depth_max': 2, 'response_received_count': 3, 'scheduler/dequeued': 3, 'scheduler/dequeued/disk': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/disk': 3, 'start_time': datetime.datetime(2013, 11, 16, 14, 50, 59, 849339)} spider 代码：

class SisBot(BaseSpider): name = "sisbot" settings.overrides['COOKIES_DEBUG'] = True allowed_domains = ["www.xxx.com"] start_urls = ['http://www.xxx.com/forum/logging.php?action=login'] urls = ['http://www.xxx.com/forum/forum-270-%s.html'] link_reg = re.compile(r'<span id="thread_\d+"><a href="(thread-\d+-1-1.html)"[^>]*>[^<]*</a></span>') title_reg = re.compile(r'.*<h1><a[^>]*>\[(.*)\]</a>([^<]+)</h1>(.*)', re.DOTALL) ed2k_reg = re.compile(r'.*(ed2k://\|file\|.*/).*') def parse(self, response): hxs = HtmlXPathSelector(response) name = "".join(hxs.select('//*[@id="username"]/@name').extract()) password = "".join(hxs.select('//*[@id="password"]/@name').extract()) formhash = "".join(hxs.select('//*[@name="formhash"]/@value').extract()) formdata = {name: 'myusername', password: 'sercet', 'formhash': formhash, 'referer': 'index.php', 'cookietime': "315360000", 'loginfield': 'username', 'loginsubmit': 'true', 'questionid': "0"} self.log(json.dumps(formdata)) return [ FormRequest.from_response(response, formdata=formdata, callback=self.after_login)] def parse_link(self, response): self.log(response.body) if "myusername" not in response.body: self.log("login error !!!") for sub_url in self.link_reg.findall(response.body): yield Request("http://www.xxx.com/%s" % sub_url, callback=self.parse_item) def after_login(self, response): self.log("login....") if not "myusername" in response.body: self.log("login failed", level=log.ERROR) return self.log("login success") us = [u % page for u in self.urls for page in xrange(1, 2)] self.log(us) for url in us: yield Request(url=url, method='get', callback=self.parse_link)

科技资讯:

科技学院:

科技百科:

科技书籍:

网站大全:

软件大全:

更多数据

热门排行

咨询电话（周一至周五9：00-18：00）