python - How to restrain duplicate links from getting parsed? -


i've written script in python scrape next page links available in webpage running @ moment. issue scraper can't shake off duplicate links. hope me accomplish this. i've tried with:

import requests lxml import html  page_link = "https://yts.ag/browse-movies"  def nextpage_links(main_link):     response = requests.get(main_link).text     tree = html.fromstring(response)     item in tree.cssselect('ul.tsc_pagination a'):         if "page" in item.attrib["href"]:             print(item.attrib["href"])  nextpage_links(page_link) 

this partial image of i'm getting:

enter image description here

you can use set purpose:

import requests lxml import html  page_link = "https://yts.ag/browse-movies"  def nextpage_links(main_link):     links = set()     response = requests.get(main_link).text     tree = html.fromstring(response)     item in tree.cssselect('ul.tsc_pagination a'):         if "page" in item.attrib["href"]:             links.add(item.attrib["href"])      return links  nextpage_links(page_link) 

you can use scrapy default restrict duplicates.


Comments

Popular posts from this blog

python - Selenium remoteWebDriver (& SauceLabs) Firefox moseMoveTo action exception -

html - How to custom Bootstrap grid height? -

transpose - Maple isnt executing function but prints function term -