Noisy unstructured text data is ubiquitous in real-world communication. Natural language and the creative ways that humans use it can create problems for computational techniques. Electronic text from the Internet (emails, message boards, newsgroups, blogs, wikis, chatlogs and web pages), contact centers (complaints, emails, call transcriptions, message summaries), and mobile phones (SMS) is often noisy – contains spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuation, missing case information and special characters.
Informal communications are not the only source of noisy text; Text produced by processing signals intended for human use such as printed/handwritten documents, spontaneous speech, and camera-captured scene images, are prime examples. Recognition errors made by Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) systems can result in imperfect transcriptions. An increasing stream of imperfect OCR results are featured by ongoing mass-digitization of the world’s written cultural heritage.
Such noise in text has raised new sets of challenges for the task of Information Retrieval and Knowledge Management. Special handling of noise as well as noise robust IR and KM techniques are essential to overcome those challenges