Dataset & Evaluation Plan

Data:

Task 1 (Catchphrase Extraction):

The data will contain:

(i) 100 case documents and their corresponding gold standard catchphrases, for training.

(ii) Another 300 documents as test data. For each of the document in the test data, the participants have to find the catchphrases.

Task 2 (Precedence Retrieval):

Two sets of documents will be provided:

Query_docs -- Current cases, formed by removing the links to the prior cases
Object_docs -- The prior cases which have been cited by the cases in Query_docs (links to which are removed from the Query_docs) along with some random cases (not among Query_docs)

There will be 200 documents in Query_docs and more than 2000 documents in Object_docs. About half of the documents in Object_docs are cited by at least one of the cases in Query_docs. The other half of the cases in Object_docs were randomly chosen to deliberately make the task challenging. For each of the 200 documents in Query_docs, the task will be to rank the list of 2000+ documents in Object_docs (or a subset), so that the actually cited prior cases are ranked higher than the other documents.

Evaluation plan:

For Task 1, a set of catchphrases is expected as result for each document in the test data. We plan to use set-based IR measures such as Precision, Recall, F-Score, etc., to check how well the set of extracted catchphrases match with the set of gold standard catchphrases (obtained from the Manupatra legal system).

For Task 2, a ranked list of documents is expected as result for each document in the Query_docs. Measures like Precision, Recall, MAP, DCG and Mean Reciprocal Rank will be used to check how well the documents that were actually cited are ranked in the retrieved list of documents.

NOTE:

A participant team may participate in either or both the sub-tasks.
Each team can have at most 4 participants.

Download the dataset here

Please click the following link to download the dataset. The instructions for use is mentioned in the included README file.

[LINK TO THE DATASET]

We are happy to share the dataset. Please cite the overview paper while using this dataset in your research work.

The dataset is also used by the following papers:

Automatic Catchphrase Identification from Legal Court Case Documents [Mandal et al.- 2017][URL=https://dl.acm.org/citation.cfm?id=3133102]

Google Sites

Report abuse