azure - Is there anyway to use cognitive services to detect if a string contains words vs just junk shift chars/gibberish? -

August 15, 2013

i'm trying find way use cognitive services detect if string contains piece of coherent text or junk. example:

sdf#%# asfsds b

hi name sam.

this seems impossible do. had idea of running text through keywords text analysis (which give me keyword of asdsds (how useful!)) , run keyword though bing spell check. i'm not sure going on in the usa seems asfsds english. quite... erm.. dumb.

i've tried running similar text through bunch of services (like language detection) , seem convinced gibberish samples 100% coherent english.

i'm going quiz ms rep on friday wondering if has achieved using cognitive services?

rather binary is-word-or-not question, might consider instead probability of word being gibberish. can choose threshold like.

for computing word probalities, might try web language model api. @ joint probability, example. set of words, response looks follows (values body corpus):

{   "results": [     {       "words": "sdf#%#",       "probability": -12.215     },     {       "words": "asfsds",       "probability": -12.215     },     {       "words": "b",       "probability": -3.127     },     {       "words": "hi",       "probability": -3.905     },     {       "words": "my",       "probability": -2.528     },     {       "words": "name",       "probability": -3.128     },     {       "words": "is",       "probability": -2.201     },     {       "words": "sam.",       "probability": -12.215     },     {       "words": "sam",       "probability": -4.431     }   ] }

you notice couple of idiosyncrasies:

probabilities negative. because logarithmic.
all terms case-folded. means corpus won't distinguish between, say, goat , goat.
caller must perform amount of normalization (note probability of sam. vs sam)
corpora available en-us market. problematic depending on use case.

an advanced use case computing conditional probabilities, i.e. probability of word in context of words preceding it.

Search This Blog

RT

azure - Is there anyway to use cognitive services to detect if a string contains words vs just junk shift chars/gibberish? -

Comments

Post a Comment

Popular posts from this blog

Ansible warning on jinja2 braces on when -

Parsing a protocol message from Go by Java -

javascript - Replicate keyboard event with html button -