python 2.7 - How to refine text data? -


i built many spiders news articles different websites , have api convert text audio clips, need framework or python tools refine articles' text such as:

removing related source. removing dates formats. removing urls. change acronyms such ceo chief excution officer example. removing special characters , typos.

making sure sentence written correctly after edits. use edited articles reference new articles.

i using python, nltk , re, it's exhausting , each time think covered cases, find new cases add , think stuck in infinite loop.

any suggestions?

first of all, expanding acronyms full form non-trivial , should not considered part of scraping rather part of second step of processing (cf. ibm's the art of tokenization).

cleaning scraped data tedious, unfortunately: there no magical solution because interested in scaping different — might interested only in urls, example. nevertheless, have not tried using beautifulsoup? — it's python library offers nice api handling many common scraping-related tasks.


Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -