Store scrape results and search in results with Python and Pandas? -

May 15, 2012

as part of ph.d. research, scraping numerous webpages , search keywords within scrape results.

this how far:

# load data pandas data frame column df.url df = pd.read_excel('sample.xls', header=0)  # define keyword search function  def contains_keywords(link, keywords):     try:         output = requests.get(link).text         return int(any(x in output x in keywords))     except:         return "wrong/missing url"  # define relevant keywords mykeywords = ('for', 'bar')  # store search results in new column 'results' df['results'] = df.url.apply(lambda l: contains_keywords(l, mykeywords))

this works fine. have 1 problem: list of relevant keywords mykeywordschanges frequently, whilst webpages stay same. running code takes long time, since request on , over.

i have 2 questions:

(1) there way store results of request.get(link).text?

(2) , if so, how search within saved file(s) producing same result current script?

as always, thank time , help! /r

you can download content of urls , save them in separate files in directory (eg: 'links')

def get_link(url):      file_name = os.path.join('/path/to/links', url.replace('/', '_').replace(':', '_'))     try:          r = requests.get(url)     except exception e:         print("failded " + url)     else:         open(file_name, 'w') f:              f.write(r.text)

then modify contains_keywords function read local files, won't have use requests every time run script.

def contains_keywords(link, keywords):     file_name = os.path.join('/path/to/links', link.replace('/', '_').replace(':', '_'))     try:          open(file_name) f:              output = f.read()         return int(any(x in output x in keywords))     except exception e:         print("can't access file: {}\n{}".format(file_name, e))         return "wrong/missing url"

edit: added try-except block in get_link , used absolute path file_name

Search This Blog

RT

Store scrape results and search in results with Python and Pandas? -

Comments

Post a Comment

Popular posts from this blog

html - How to custom Bootstrap grid height? -

javascript - pass values from mssql to views in node -

ruby - unknown property method: 'wait' on EC2 windows server Instance -