python 2.7 - How to refine text data? -
i built many spiders news articles different websites , have api convert text audio clips, need framework or python tools refine articles' text such as:
removing related source. removing dates formats. removing urls. change acronyms such ceo chief excution officer example. removing special characters , typos.
making sure sentence written correctly after edits. use edited articles reference new articles.
i using python, nltk , re, it's exhausting , each time think covered cases, find new cases add , think stuck in infinite loop.
any suggestions?
first of all, expanding acronyms full form non-trivial , should not considered part of scraping rather part of second step of processing (cf. ibm's the art of tokenization).
cleaning scraped data tedious, unfortunately: there no magical solution because interested in scaping different — might interested only in urls, example. nevertheless, have not tried using beautifulsoup? — it's python library offers nice api handling many common scraping-related tasks.
Comments
Post a Comment