python - pandas and nltk: get most common phrases -
fairly new python , i'm working pandas data frames column full of text. i'm trying take column , use nltk find common phrases (three or 4 word).
dat["text_clean"] = dat["description"].str.replace('[^\w\s]','').str.lower() dat["text_clean2"] = dat["text_clean"].apply(word_tokenize) finder = bigramcollocationfinder.from_words(dat["text_clean2"]) finder # bigrams appear 3+ times finder.apply_freq_filter(3) # return 10 n-grams highest pmi print finder.nbest(bigram_measures.pmi, 10)
the initial comments seem work fine. however, when attempt use bigramcollocation, throws following error.
n [437]: finder = bigramcollocationfinder.from_words(dat["text_clean2"]) finder traceback (most recent call last): file "<ipython-input-437-635c3b3afaf4>", line 1, in <module> finder = bigramcollocationfinder.from_words(dat["text_clean2"]) file "/users/abrahammathew/anaconda/lib/python2.7/site-packages/nltk/collocations.py", line 168, in from_words wfd[w1] += 1 typeerror: unhashable type: 'list'
any idea refers or workaround.
same error following commands also.
gg = dat["text_clean2"].tolist() finder = bigramcollocationfinder.from_words(gg) finder = bigramcollocationfinder.from_words(dat["text_clean2"].values.reshape(-1, ))
the following works, returns there no common phrases.
gg = dat["description"].str.replace('[^\w\s]','').str.lower() finder = bigramcollocationfinder.from_words(gg) finder # bigrams appear 3+ times finder.apply_freq_filter(2) # return 10 n-grams highest pmi print finder.nbest(bigram_measures.pmi, 10)
it seem bigramcollocationfinder
class wants list of words, not list of lists. try this:
finder = bigramcollocationfinder.from_words(dat["text_clean2"].values.reshape(-1, ))
Comments
Post a Comment