Store scrape results and search in results with Python and Pandas? -
as part of ph.d. research, scraping numerous webpages , search keywords within scrape results.
this how far:
# load data pandas data frame column df.url df = pd.read_excel('sample.xls', header=0) # define keyword search function def contains_keywords(link, keywords): try: output = requests.get(link).text return int(any(x in output x in keywords)) except: return "wrong/missing url" # define relevant keywords mykeywords = ('for', 'bar') # store search results in new column 'results' df['results'] = df.url.apply(lambda l: contains_keywords(l, mykeywords)) this works fine. have 1 problem: list of relevant keywords mykeywordschanges frequently, whilst webpages stay same. running code takes long time, since request on , over.
i have 2 questions:
(1) there way store results of request.get(link).text?
(2) , if so, how search within saved file(s) producing same result current script?
as always, thank time , help! /r
you can download content of urls , save them in separate files in directory (eg: 'links')
def get_link(url): file_name = os.path.join('/path/to/links', url.replace('/', '_').replace(':', '_')) try: r = requests.get(url) except exception e: print("failded " + url) else: open(file_name, 'w') f: f.write(r.text) then modify contains_keywords function read local files, won't have use requests every time run script.
def contains_keywords(link, keywords): file_name = os.path.join('/path/to/links', link.replace('/', '_').replace(':', '_')) try: open(file_name) f: output = f.read() return int(any(x in output x in keywords)) except exception e: print("can't access file: {}\n{}".format(file_name, e)) return "wrong/missing url" edit: added try-except block in get_link , used absolute path file_name
Comments
Post a Comment