Evaluation

Metrics

Two different evaluation metrics will be defined according to the task setting:

Baseline

The baseline for both tasks will be computed by employing the one-hot vectors representation:

To decide whether the target sentence t is coherent with the paragraph P we will first compute the median value across the whole training dataset, and then we will use this as a threshold: all the occurrences with a value above the median will be considered coherent, incoherent otherwise.

The proximity between each two vectors ⟨vx, vx+1⟩ ∈ V will then be computed through a distance metric Dist(s1,s2) (e.g. Jaccard), thereby resulting in (n − 1) distance scores, grasping the degree of semantic overlap between each two neighbouring sentences. In order to compute the coherence score for the paragraph P score(P), we will average the scores featuring each pair of adjacent sentences. The value will then be compared with the human rating with correlation indices:

where corr indicates the Pearson or Spearman correlation index.

The code to run the baseline has been published on DisCoTex's GitHub repository.

Upload submission

Upload submission by pointing your browser to the URL https://forms.gle/dsWGuLEJdGPykfvx7 .