Noise is an unavoidable fact of life. It can manifest itself
at the earliest stages of processing in the form of degraded
inputs that our systems must be prepared to handle. People
are adept when it comes to pattern recognition tasks involving
typeset or handwritten documents or recorded speech,
machines less-so. From the perspective of down-stream processes that take as their inputs the outputs of recognition
systems, including document analysis and OCR, noise can
be viewed as the errors made by earlier stages of processing,
which are rarely perfect and sometimes quite brittle.  


Noisy unstructured text data is also found in informal settings such as online chat, SMS, email, message board and newsgroup postings, blogs, wikis and web pages. In addition to the aforementioned recognition errors, such text may contain spelling errors, abbreviations, non-standard terminology, missing punctuation, misleading case information, as well as false starts, repetitions, and pause-filling sounds such as “um” and “uh” in the case of speech.


