Sentence segregation, tokenization (normalization, americanize, etc.)
A lemma is a word that stands at the head of a definition in a dictionary. ... A lexeme is a unit of meaning, and can be more than one word. A lexeme is the set of all forms that have the same meaning, while lemma refers to the particular form that is chosen by convention to represent the lexeme.
For Chinese, Arabic, etc.
Part-of-speech (POS) tager, assign parts of speech and other tokens to each word (noun, verb, noun-plural, etc.)
Named entity recognizer (NER), extracts named entities (PERSON, ORGANIZATION, LOCATION)
Looks like grammatical parser….
Works out the grammatical structure of sentences
Resolves “he”,”she”,”it”,”his”, etc.
Deterministic: fast rule based for EN&CN
Statistical: machine learning based for EN
Neural: most accurate but slow for EN & CN
Email -> spam/normal
Input: seed sets (dictionaries) of entities for classes, and unlabeled text
Output: more entities belonging to the classes extracted from the text
Extraction of relation tuples (typically binary relations) from plain text, no need to specify schema
https://drops.dagstuhl.de/opus/volltexte/2016/6008/pdf/OASIcs-SLATE-2016-3.pdf
Looks like OpenNLP has best esults amont NLTK, OpenNLP, CoreNLP, Pattern
However OpenNLP is language agnostic, is a tool for training NLP models. Need to train own model. And it's not necessarily easy to get a good model for Chinese. Training seems to require tagged data.
Second place seem to be
https://aws.amazon.com/comprehend/features/
Lacking some desired feature - like parser
https://cloud.ibm.com/apidocs/natural-language-understanding#text-analytics-features
Also desired features not so complete.