Good Stuff‎ > ‎

Data




Best Papers:

 Year  Conference  Best Papers
 2014 VLDB
 

 SIGMOD  
   KDD  
  ICDE  
   CIKM  
   EDBT  
   UAI  
   WWW  
   CHI  
   NACCLE  
   ACLWEB  
   ICML  
 2013 VLDB
 

 SIGMOD  
   KDD  
  ICDE  
   CIKM  
   EDBT  
   UAI  
   WWW  
   CHI  
   NACCLE  
   ACLWEB  
   ICML  

2014

VLDB

SIGMOD

KDD

ICDE

CIKM

EDBT

UAI

WWW

CHI

NACCLE

ACLWEB

2013

VLDB

SIGMOD

KDD

ICDE

CIKM

EDBT

UAI

WWW

CHI

NACCLE

ACLWEB



Top conferences by order:

 1  2  3  4  5  6
VLDB
Very Large Database Systems
SIGMOD
Management Of Data
 KDD
Knowledge Discovery and Data Mining
 ICDE
International Conference on Data Engineering
 CIKM
Conference on Information and Knowledge Management
EDBT
Extending Database Technology
 AAAI/IAAA/
UAI
Uncertainty in Artificial Intelligence
WWW
CHI
Conference on Human Factors in Computing Systems
 NACCLE
ACLWEB

Check program committee websites, students pages, best papers, follow on twitter for new faculty



Mongo DB, Stream Processing databases. Trueviso,




People

 Person  Affiliation  Projects/Papers
 Michael Franklin  Berkeley  Amplab(CrowdDB, Spark, Shark)
 Noah Smith  CMU  
 Tom Mitchell
 CMU  PIDGIN: onthology alighnmnt using web text as interlingua
NELL

 Chris Manning
 Stanford  NLP, Deep Learning for NLP
  Andrew McCallum  UMASS  Factorie,
     
     
     

,

NLP Tools
----------------
Stanford NLP
CMU Twitter NLP - POS tagging






[Corpora-List] New parallel corpus release: OpenSubtitles2016

Inbox
x

OpenSubtitles2016

We just released a major update of the parallel subtitle corpus in OPUS:

2.8 million subtitle files in 60 languages with a total of over 17 billion tokens in 2.6 billion sentences and sentence fragments.
As usual in OPUS all languages are sentence-aligned creating a total of 1,689 bitexts.
The data sets are provided in standalone XML format with standoff sentence alignment, TMX and aligned plain text format (often used in training SMT models).

More information is available in:
Pierre Lison and Jörg Tiedemann, 2012, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)


In addition, we also provide intra-lingual alignments between alternative subtitles in the same language:

More information about those alignments and how they are sorted into various categories can be found in:
Jörg Tiedemann, 2012, Finding Alternative Translations in a Large Corpus of Movie Subtitles.
In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

Note, that all data sets are automatically created using various pre-processing and alignment tools. 
There will be problems at various levels. Feedback is very welcome!


Other new data sets in OPUS:

News Commentary version 11 (originally provided by CASMACAT):
Different to the original source, this release is truly multilingual with alignments across all languages.

Global Voices (also provided by CASMACAT):
Again, this version is multilingual.

Wikipedia:
A corpus of parallel sentences extracted from Wikipedia by Krzysztof Wołk and Krzysztof Marasek. More information: Krzysztof Wołk and Krzysztof Marasek: Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs., Procedia Technology, 18, Elsevier, p.126-132, 2014


For more information on OPUS:
Select the language pair you are interested in to see all resources that are available for that particular language pair.
Data formats are explained here: http://opus.lingfil.uu.se/trac

Enjoy!






Subpages (1): Ecology Papers
ċ
Morteza Shahriari Nia,
Mar 31, 2013, 9:16 PM
ċ
Morteza Shahriari Nia,
Mar 31, 2013, 9:16 PM
Comments