Check program committee websites, students pages, best papers, follow on twitter for new faculty
Mongo DB, Stream Processing databases. Trueviso,
NLP ToolsStanford NLP
CMU Twitter NLP - POS tagging
We just released a major update of the parallel subtitle corpus in OPUS:
2.8 million subtitle files in 60 languages with a total of over 17 billion tokens in 2.6 billion sentences and sentence fragments.
As usual in OPUS all languages are sentence-aligned creating a total of 1,689 bitexts.
The data sets are provided in standalone XML format with standoff sentence alignment, TMX and aligned plain text format (often used in training SMT models).
More information is available in:
Pierre Lison and Jörg Tiedemann, 2012, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
In addition, we also provide intra-lingual alignments between alternative subtitles in the same language:
More information about those alignments and how they are sorted into various categories can be found in:
Jörg Tiedemann, 2012, Finding Alternative Translations in a Large Corpus of Movie Subtitles.
In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
Note, that all data sets are automatically created using various pre-processing and alignment tools.
There will be problems at various levels. Feedback is very welcome!
Other new data sets in OPUS:
News Commentary version 11 (originally provided by CASMACAT):
Different to the original source, this release is truly multilingual with alignments across all languages.
Global Voices (also provided by CASMACAT):
Again, this version is multilingual.
A corpus of parallel sentences extracted from Wikipedia by Krzysztof Wołk and Krzysztof Marasek. More information: Krzysztof Wołk and Krzysztof Marasek: Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs., Procedia Technology, 18, Elsevier, p.126-132, 2014
For more information on OPUS:
Select the language pair you are interested in to see all resources that are available for that particular language pair.
Data formats are explained here: http://opus.lingfil.uu.se/trac