python - How to restrain duplicate links from getting parsed? -

February 15, 2015

i've written script in python scrape next page links available in webpage running @ moment. issue scraper can't shake off duplicate links. hope me accomplish this. i've tried with:

import requests lxml import html  page_link = "https://yts.ag/browse-movies"  def nextpage_links(main_link):     response = requests.get(main_link).text     tree = html.fromstring(response)     item in tree.cssselect('ul.tsc_pagination a'):         if "page" in item.attrib["href"]:             print(item.attrib["href"])  nextpage_links(page_link)

this partial image of i'm getting:

you can use set purpose:

import requests lxml import html  page_link = "https://yts.ag/browse-movies"  def nextpage_links(main_link):     links = set()     response = requests.get(main_link).text     tree = html.fromstring(response)     item in tree.cssselect('ul.tsc_pagination a'):         if "page" in item.attrib["href"]:             links.add(item.attrib["href"])      return links  nextpage_links(page_link)

you can use scrapy default restrict duplicates.

Search This Blog

RT

python - How to restrain duplicate links from getting parsed? -

Comments

Post a Comment

Popular posts from this blog

python - Selenium remoteWebDriver (& SauceLabs) Firefox moseMoveTo action exception -

html - How to custom Bootstrap grid height? -

transpose - Maple isnt executing function but prints function term -