azure - Is there anyway to use cognitive services to detect if a string contains words vs just junk shift chars/gibberish? -
i'm trying find way use cognitive services detect if string contains piece of coherent text or junk. example:
sdf#%# asfsds b
vs
hi name sam.
this seems impossible do. had idea of running text through keywords text analysis (which give me keyword of asdsds (how useful!)) , run keyword though bing spell check. i'm not sure going on in the usa seems asfsds english. quite... erm.. dumb.
i've tried running similar text through bunch of services (like language detection) , seem convinced gibberish samples 100% coherent english.
i'm going quiz ms rep on friday wondering if has achieved using cognitive services?
rather binary is-word-or-not question, might consider instead probability of word being gibberish. can choose threshold like.
for computing word probalities, might try web language model api. @ joint probability, example. set of words, response looks follows (values body
corpus):
{ "results": [ { "words": "sdf#%#", "probability": -12.215 }, { "words": "asfsds", "probability": -12.215 }, { "words": "b", "probability": -3.127 }, { "words": "hi", "probability": -3.905 }, { "words": "my", "probability": -2.528 }, { "words": "name", "probability": -3.128 }, { "words": "is", "probability": -2.201 }, { "words": "sam.", "probability": -12.215 }, { "words": "sam", "probability": -4.431 } ] }
you notice couple of idiosyncrasies:
- probabilities negative. because logarithmic.
- all terms case-folded. means corpus won't distinguish between, say, goat , goat.
- caller must perform amount of normalization (note probability of
sam.
vssam
) - corpora available en-us market. problematic depending on use case.
an advanced use case computing conditional probabilities, i.e. probability of word in context of words preceding it.
Comments
Post a Comment