python - How to make my crawler parse data from start page -
i've written code in python grab details torrent site. however, when run code found results expected. problem crawler skips content of first page [as pagination urls start 2] can't fix. on highly appreciable.
import requests lxml import html page_link = "https://yts.ag/browse-movies" b_link = "https://yts.ag" def get_links(main_link): response = requests.get(main_link).text tree = html.fromstring(response) item in tree.cssselect('ul.tsc_pagination a'): if "page" in item.attrib["href"]: movie_details(b_link + item.attrib["href"]) def movie_details(link): response = requests.get(link).text tree = html.fromstring(response) titles in tree.cssselect("div.browse-movie-wrap"): title = titles.cssselect('div.browse-movie-bottom a.browse-movie-title')[0].text link = titles.cssselect('div.browse-movie-year')[0].text rating= titles.cssselect('figcaption.hidden-xs h4.rating')[0].text genre = titles.cssselect('figcaption.hidden-xs h4')[0].text genre1 = titles.cssselect('figcaption.hidden-xs h4')[1].text print(title, link, rating, genre, genre1) get_links(page_link)
why not call movie_details()
function on main_link before loop ?
def get_links(main_link): response = requests.get(main_link).text tree = html.fromstring(response) movie_details(main_link) item in tree.cssselect('ul.tsc_pagination a'): if "page" in item.attrib["href"]: movie_details(b_link + item.attrib["href"])
Comments
Post a Comment