python - How to restrain duplicate links from getting parsed? -
i've written script in python scrape next page links available in webpage running @ moment. issue scraper can't shake off duplicate links. hope me accomplish this. i've tried with:
import requests lxml import html page_link = "https://yts.ag/browse-movies" def nextpage_links(main_link): response = requests.get(main_link).text tree = html.fromstring(response) item in tree.cssselect('ul.tsc_pagination a'): if "page" in item.attrib["href"]: print(item.attrib["href"]) nextpage_links(page_link) this partial image of i'm getting:
you can use set purpose:
import requests lxml import html page_link = "https://yts.ag/browse-movies" def nextpage_links(main_link): links = set() response = requests.get(main_link).text tree = html.fromstring(response) item in tree.cssselect('ul.tsc_pagination a'): if "page" in item.attrib["href"]: links.add(item.attrib["href"]) return links nextpage_links(page_link) you can use scrapy default restrict duplicates.

Comments
Post a Comment