1. Natural Language Processing (NLP)
Natural language processing (NLP) concerns itself with the interaction between natural human languages and computing devices. NLP may be a major aspect of linguistics, and also falls within the realms of computing and AI.
2. Tokenization
Tokenization is, generally, an early step within the NLP process, a step that splits longer strings of text into smaller pieces, or tokens. Larger chunks of text are often tokenized into sentences, sentences are often tokenized into words, etc. Further processing is usually performed after a bit of text has been appropriately tokenized.
3. Normalization
Before further processing, the text must be normalized. Normalization generally refers to a series of related tasks meant to place all text on A level playing field: converting all text to an equivalent case (upper or lower), removing punctuation, expanding contractions, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing and allows processing to proceed uniformly.
4. Stemming
Stemming is that the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word so as to get a word stem.
5. Lemmatization
Lemmatization is said to stemming, differing therein lemmatization is in a position to capture canonical forms supported a word's lemma.
6. Corpus
In linguistics and NLP, corpus refers to a set of texts. Such collections could also be formed of one language of texts, or can span multiple languages -- there are numerous reasons that multilingual corpora (the plural of corpus) could also be useful. Corpora can also contain themed texts (historical, Biblical, etc.). Corpora are generally solely used for statistical linguistic analysis and hypothesis testing.
7. Stop Words
Stop words are those words that are filtered out before further processing of text, since these words contribute little to overall meaning, as long as they're generally the foremost common words during a language. as an example, "the," "and," and "a," while all required words during a particular passage, don't generally contribute greatly to one's understanding of content. As an easy example, the subsequent pangram is simply as legible if the stop words are removed
8. Parts-of-speech (POS) Tagging
POS tagging consists of assigning a category tag to the tokenized parts of a sentence. the foremost popular POS tagging would be identifying words as nouns, verbs, adjectives, etc.
9. Statistical Language Modeling
Statistical Language Modeling is that the process of building a statistical language model which is supposed to supply an estimate of a tongue. For a sequence of input words, the model would assign a probability to the whole sequence, which contributes to the estimated likelihood of varied possible sequences. this will be especially useful for NLP applications that generate text.
10. Bag of Words
A Bag of words may be a particular representation model wont to simplify the contents of a variety of text. The bag of words model omits grammar and order but is curious about the count of occurrences of words within the text. the last word representation of the text selection is that of a bag of words (bag pertaining to the pure mathematics concept of multisets, which differ from simple sets).
Actual storage mechanisms for the bag of words representation can vary, but the subsequent may be a simple example employing a dictionary for intuitiveness
11. n-grams
n-grams is another representation model for simplifying text selection contents. As against the orderless representation of bag of words, n-grams modelling is curious about preserving contiguous sequences of N items from the text selection.
12. Regular Expressions
Regular expressions, often abbreviated RegExp or regexp, are a tried and true method of concisely describing patterns of text. a daily expression is represented as a special text string itself and is supposed for developing search patterns on selections of text. Regular expressions are often thought of as an expanded set of rules beyond the wildcard characters of? and *. Though often cited as frustrating to find out, regular expressions are incredibly powerful text searching tools.
13. Zipf's Law
Zipf's Law is employed to explain the connection between word frequencies in document collections. If a document collection's words are ordered by frequency, and y is employed to explain the N of times that the Xth word appears, Zipf's observation is concisely captured as y = cx-1/2 (item frequency is inversely proportional to item rank).
14. Similarity Measures
There are numerous similarity measures that may be applied to NLP. What are we measuring the similarity of? Generally, strings.
Levenshtein - the count of characters that has got to be deleted, inserted, or substituted so as to form a pair of strings equal
Jaccard - the measure of overlap between 2 sets; within the case of NLP, generally, documents are sets of words
Smith-Waterman - almost like Levenshtein, but with costs assigned to substitution, insertion, and deletion
15. Syntactic Analysis
Also mentioned as parsing, syntactic analysis is that the task of analyzing strings as symbols, and ensuring their conformance to an established set of grammatical rules. This step must, out necessarily, precede any longer analysis which attempts to extract insight from the text -- semantic, sentiment, etc. -- treating it as something beyond symbols.
16. Semantic Analysis
Also referred to as meaning generation, semantic analysis is curious about determining the meaning of text selections (either character or word sequences). After an input selection of text is read and parsed (analyzed syntactically), the text selection can then be interpreted for meaning. Simply put, syntactic analysis cares about what words a text selection was made from, while semantic analysis wants to understand what the gathering of words actually means. the subject of semantic analysis is both broad and deep, with a good sort of tools and techniques at the researcher's disposal.
17. Sentiment Analysis
Sentiment analysis is that the process of evaluating and determining the sentiment captured during a selection of text, with sentiment defined as feeling or emotion. This sentiment is often simply positive (happy), negative (sad or angry), or neutral, or are often some more precise measurement along a scale, with neutral within the middle, and positive and negative increasing in either direction
18. Information Retrieval
Information retrieval is that the process of accessing and retrieving the foremost appropriate information from text supported a specific query, using context-based indexing or metadata. one among the foremost famous samples of information retrieval would be Google Search.