python - unable to login with scrapy -
i'm trying scrape page have login first, reason, scrapy crawl page have nothing do, after use formrequest. see code below:
# coding: utf-8 import scrapy scrapy.http import request, formrequest usuario = 'myemail' senha = 'mypassword' urllogin = 'https://ludopedia.com.br/login' urlnotificacoes = 'https://ludopedia.com.br/notificacoes' class notificacao(scrapy.item): """contem os dados dos anuncios da ludopedia""" jogo = scrapy.field() colecao = scrapy.field() tipo = scrapy.field() link = scrapy.field() class loginspider(scrapy.spider): name = 'ludopedia' custom_settings = { 'concurrent_requests': 1, 'log_level': 'debug', } start_urls = [ urllogin ] def parse(self, response): return formrequest.from_response( response, formname='form', formid='form', formdata={'email': usuario, 'pass': senha}, callback=self.after_login, dont_filter=true ) def after_login(self, response): # check login succeed before going on if "minha conta" in response.body: self.logger.error("login falhou") return yield request(urlnotificacoes) self.logger.info("visitei %s", response.url) msg = response.selector.xpath ('//*[@id="page-content"]/div/div/table/tbody/tr[2]/td/a/div[2]/div') ...
the output of script is:
2017-07-25 12:02:55 [scrapy.utils.log] info: scrapy 1.4.0 started (bot: scrapybot) 2017-07-25 12:02:55 [scrapy.utils.log] info: overridden settings: {'spider_loader_warn_only': true} 2017-07-25 12:02:56 [scrapy.middleware] info: enabled extensions: ['scrapy.extensions.memusage.memoryusage', 'scrapy.extensions.logstats.logstats', 'scrapy.extensions.telnet.telnetconsole', 'scrapy.extensions.corestats.corestats'] 2017-07-25 12:02:56 [scrapy.middleware] info: enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.httpauthmiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.downloadtimeoutmiddleware', 'scrapy.downloadermiddlewares.defaultheaders.defaultheadersmiddleware', 'scrapy.downloadermiddlewares.useragent.useragentmiddleware', 'scrapy.downloadermiddlewares.retry.retrymiddleware', 'scrapy.downloadermiddlewares.redirect.metarefreshmiddleware', 'scrapy.downloadermiddlewares.httpcompression.httpcompressionmiddleware', 'scrapy.downloadermiddlewares.redirect.redirectmiddleware', 'scrapy.downloadermiddlewares.cookies.cookiesmiddleware', 'scrapy.downloadermiddlewares.httpproxy.httpproxymiddleware', 'scrapy.downloadermiddlewares.stats.downloaderstats'] 2017-07-25 12:02:56 [scrapy.middleware] info: enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.httperrormiddleware', 'scrapy.spidermiddlewares.offsite.offsitemiddleware', 'scrapy.spidermiddlewares.referer.referermiddleware', 'scrapy.spidermiddlewares.urllength.urllengthmiddleware', 'scrapy.spidermiddlewares.depth.depthmiddleware'] 2017-07-25 12:02:56 [scrapy.middleware] info: enabled item pipelines: [] 2017-07-25 12:02:56 [scrapy.core.engine] info: spider opened 2017-07-25 12:02:56 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-07-25 12:02:56 [scrapy.extensions.telnet] debug: telnet console listening on 127.0.0.1:6024 2017-07-25 12:02:58 [scrapy.core.engine] debug: crawled (200) <get https://ludopedia.com.br/login> (referer: none) 2017-07-25 12:02:59 [scrapy.core.engine] debug: crawled (200) <post https://ludopedia.com.br/login> (referer: https://ludopedia.com.br/login) 2017-07-25 12:02:59 [ludopedia] info: visitei https://ludopedia.com.br/login <200 https://ludopedia.com.br/login> 2017-07-25 12:03:00 [scrapy.core.engine] debug: crawled (200) <get https://ludopedia.com.br/notificacoes> (referer: https://ludopedia.com.br/login) 2017-07-25 12:03:01 [scrapy.core.engine] debug: crawled (200) <get https://ludopedia.com.br/search?search=&email=myemail&pass=mypassword> (referer: https://ludopedia.com.br/notificacoes) 2017-07-25 12:03:01 [scrapy.dupefilters] debug: filtered duplicate request: <get https://ludopedia.com.br/notificacoes> - no more duplicates shown (see dupefilter_debug show duplicates) 2017-07-25 12:03:01 [ludopedia] info: visitei https://ludopedia.com.br/search?search=&email=myemail&pass=mypassword <200 https://ludopedia.com.br/search?search=&email=myemail&pass=mypassword> 2017-07-25 12:03:01 [scrapy.core.engine] info: closing spider (finished) 2017-07-25 12:03:01 [scrapy.statscollectors] info: dumping scrapy stats: {'downloader/request_bytes': 1357, 'downloader/request_count': 4, 'downloader/request_method_count/get': 3, 'downloader/request_method_count/post': 1, 'downloader/response_bytes': 134813, 'downloader/response_count': 4, 'downloader/response_status_count/200': 4, 'dupefilter/filtered': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 7, 25, 15, 3, 1, 355077), 'log_count/debug': 6, 'log_count/info': 9, 'memusage/max': 51732480, 'memusage/startup': 51732480, 'request_depth_max': 4, 'response_received_count': 4, 'scheduler/dequeued': 4, 'scheduler/dequeued/memory': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/memory': 4, 'start_time': datetime.datetime(2017, 7, 25, 15, 2, 56, 35121)} 2017-07-25 12:03:01 [scrapy.core.engine] info: spider closed (finished)
so, problem that, reason, i'm getting redirected ludopedia.com.br/search?search=&email=myemail&pass=mypassword don't know why.
what i'm trying is, visit ludopedia.com.br/login, fill forms e-mail , password, visit ludopedia.com.br/notificacoes , parse html there.
how can avoid link ludopedia.com.br/search?search=&email=myemail&pass=mypassword ?
i've made it! think logic problem, here working code:
# coding: utf-8 import scrapy scrapy.http import request, formrequest usuario = 'myemail' senha = 'mypassword' urllogin = 'https://ludopedia.com.br/login' urlnotificacoes = 'https://ludopedia.com.br/notificacoes' class notificacao(scrapy.item): """contem os dados dos anuncios da ludopedia""" jogo = scrapy.field() colecao = scrapy.field() tipo = scrapy.field() link = scrapy.field() class loginspider(scrapy.spider): name = 'ludopedia' custom_settings = { 'concurrent_requests': 1, 'log_level': 'debug', } start_urls = [ urllogin ] def parse(self, response): return formrequest.from_response( response, formname='form', formid='form', formdata={'email': usuario, 'pass': senha}, callback=self.after_login, dont_filter=true ) def after_login(self, response): # check login succeed before going on if "minha conta" in response.body: self.logger.error("login falhou") return request = request(urlnotificacoes, callback=self.parse_notificacoes) yield request def parse_notificacoes(self, response): msg = response.selector.xpath ('//*[@id="page-content"]/div/div/table/tbody/tr[2]/td/a/div[2]/div') ...
the difference here added, on "after_login" request page want scrape, i've used callback, function parse new response, added "yield request" , parse page (now loged in) whit function "parse_notificacoes".
Comments
Post a Comment