Text Processing and Corpora

Thursdays - 16:45-18:15, Chemicum 364 - room change: new room 214!!!

This course will build on what you have learnt in your NLP (Python) and Statistics (R) courses.

You will learn how to use the analytical corpus tools:

concordances, collocations, colligations

You will learn the technical NLP tools:

GREPS (regular expressions), corpus compilation (BootCaT), automatic annotation (TreeTagger)

You will learn the technical statistical tools

correspondence analysis (pattern identification), cluster analysis (classification), regression analysis (machine learning)

##########

Assessment

Assessment will be based on two reports, one treating the use of the corpus tools, the other statistical tools

Here is a link to the assessment description

Due Date: to be determined

Important : send reports to studentwork.glynnp8@gmail.com

#############

Data for Statistics Section

Semantics Data from 1st semester

Discourse Data from 1st semester

#############

Part 1 - Introduction to Categorical Stats
This class will look at what categorical statistics is, what are assumptions it makes, why it is essential and what it can do to help us.

The class will be entirely discussion based and may (hopefully) overlap a little with what you have done in your 1st semester R course.

slides 1

and if we have time

slides 2

#############

Part 2 - Basics

a. Chi-Squared, statistical independence and statistical significance
This class will make sure that everyone is confident with the basics in R, leading data, examining the data in R and then running the most basic categorical test - the Chi2 Test. We will also make sure that everyone understands how it works, since the principles involved form the basis of all or most categorical statistics.

b. Correspondence analysis and indentifying structure in complex data
This class will introduce an exploratory method for complex categorical data - multiple correspondence analysis. The methods produce complex plots and confidence in interpreting these plots will be the aim of the class.

#############

Part 3 - Correspondence Analysis Part 1

Today we work on the basics, getting a data frame into R and running a Chi2

data

R-commands-Chi2

R Commands - Correspondence Analysis

#############

Part 4 - Correspondence Analysis Part 2

Today we will summarize bivariate analysis, and introduce multivariate analysis.

Please pre-install the packages:

ca
FactoMineR
explor

We will use these data from your first semester - the representation of women in magainzes

We will use the R Commands for correspondence analysis

#################################

Part 5 - Cluster Analysis (16 April)

We will look at classification of data

1. Looking again at results of MCA - trying to sort the output of correspondence analysis

2. Looking at Hierarchical Cluster Analysis - exploring how best to sort data

Please pre-install the packages:

pvclust

We will use the - R Commands for cluster analysis

We will use these data from your first semester - the representation of women in magainzes

Other data for learning Cluster Analysis: Data for play in cluster analysis

#####################################

Part 6 - K-Means Cluster and Loglinear Analysis (23 April)

Please pre-install the packages:

cluster

vcd

Looking at K-medoid / K-Means Cluster Analysis - testing how best to sort data

We will use the - R Commands for cluster analysis

LogLinear analysis - basically this is just multinomial Chi2

We will use these R commands for loglinear analysis

#############

Part 7 - Logistic Regression Part 1 (14 May)

We will start Logistic Regression today

Please pre-install the package:

rms

These are the R commands for LogReg

We will use both sets of data

data - your lexical semantics results from semester 1

data - your discourse analysis results from semester 1

data - future constructions in English

#############

Part 8 - Logistic Regression 2 (21 May)

Today we will work on Logistic Regression and wrap up the statistics part of the course.

We will discuss and agree upon the assessment for this part of the course.

#############

Part 9 - Logistic Regression 2 (11 June)

Aims:

Today we will work on Logistic Regression and wrap up the statistics part of the course.

We will discuss and agree upon the assessment for this part of the course.

Wrap up the stats part of the course with revision for assessment task.

Discuss and agree upon assessment task, Introduce LaTeX

Start collocations and colligations in SketchEngine

https://www.sketchengine.eu/

#############

Part 10

Aims: look at how collocations and colligations can help answer research questions

LaTeX - https://www.overleaf.com/project

Create account and use the "Association for Computational Linguistics (ACL) conference" template

#############

Part 11

Revision and Preparation for Reports 1

Chi2

MCA

HCA

#############

Part 12

Finish HCA

Modelling Grammar through Meaning - Bresnan et al 2005

Syntax- Future constructions in English - semantic features

Syntax- Future constructions in English - formal features

Revision and Preparation for Reports 2

LateX (Association for Computational Linguistics (ACL) conference)

Regression - commands

Collocation / Colligation - sketchengine

REPORTS

Instructions

Stats
Corpora

Data for Stats Report

a. Maybe vs Perhaps

b. tough vs hard

c. nervous vs. anxious vs stressed

c. Appraisal of women

d. HAPPY in Czech, English and Polish

Commands for Stats Report

R-commands-Chi2

R Commands - Correspondence Analysis

R-Commands - Cluster Analysis

R-Commands - Logistic Regression

Data for Collocation Report

https://www.sketchengine.eu/

LaTeX - for Reports

Create account :

https://www.overleaf.com/project

Use the "Association for Computational Linguistics (ACL) conference" template

##########

Data for learning

data - your lexical semantics results from semester 1

data - your discourse analysis results from semester 1

data - last years lexical semantic results

data - discourse analysis from UAM previous year

data - discourse analysis from P8 previous year

######################

OLD

STRUCTURE OF CLASSES BEFORE ROOM MIX UP

##########

Class 3 -

a. Correspondence analysis and exploring complex data B.

This class will examine in more detail how to interpret and judge the reliability of the correspondence plots. It will also introduce clustering tools for aiding in the interpretation of the plots.

b. Agglomerative Cluster analysis and sorting complex data

This class will examine agglomerative methods for "sorting" data discreetly, relative to a range of variables.

R Commands - Cluster Analysis

Class 4 -

a. K-Means Cluster analysis and confirming categories in complex data

This class will examine "top-down" methods for "sorting" data discreetly, relative to a range of variables.

data - Some more lexical semantic data to play with (happy in Czech, English and Polish)

b. Loglinear analysis and confirming complex associations

R commands - Loglinear Analysis

Class 5 - Logistic Regression and machine learning

R commands - Logistic regression 1 - NEW VERSION

New Data - English Future CXs