Unsupervised Representations of Words

Profesor: Matthias Gallé

One of the recent developments with most impact in natural language processing applications is the way that words are represented. Learning those features not only avoids the laborious and application-dependent feature engineering, but it also allows for a smooth integration with deep learning frameworks that are revolutionizing the area since 2014.

In this tutorial we will see the study in depth those representation of words. A particular emphasis will be on "understanding by doing", and the attendants will implement the algorithms presented, mostly from scratch without the use of high-level toolkit in order to have a better understand of how those algorithms work.

Session 1: Introduction & Supervised Learning

basic introduction to supervised learning, covering loss functions and optimization through stochastic gradient descent (SGD). This first session will also be completed with a very high-level introduction to natural-language processing application.

Session 2: Hands-on

The attendants will derive their own loss function and implement SGD. This will then applied to a simplified version of Named Entity Recognition. For this, please install before comming:

Python + Numpy and Scikit-learn: We recommend to use the Anaconda installer (for Python version 3.6).
The Conll2002 dataset: The easiest is to access through the nltk library (nltk.corpus.conll2002). After installing nltk (via pip or conda), open a Python shell and do the following:
- >>> import nltk
- >>> nltk.download()

Get the template here.

A possible solution: classification and NER

Session 3: Word Embeddings

The third session will focus on word-embeddings, covering the theory behind them and the most popular implementations.

Recommended reading:
- "Improving Distributional Similarity with Lessons Learned from Word Embeddings", Goldberg.

Session 4: Hands-on

Implementing skip-gram with negative sampling from scratch. If time allows, also positive point-wise mutual information

Get the template here.

Download tokenized spanish text here.

The derivation is here.

A possible solution here.

Session 5: Advanded Topics

Relevant papers from the recent academic conferences about multi-lingual embeddings, biases in word-embeddings, fine-tuning, etc.

Google Sites

Report abuse