ELL884 (2024) - Assignments

Assignments

Assignment 1 - Part-Of-Speech Tagging

Link: https://www.kaggle.com/t/b04f415a66074f68b074d317d9fa3af9

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.

https://paperswithcode.com/sota/part-of-speech-tagging-on-penn-treebank

Above URL shows the current state-of-the-art performance in PoS tagging on Penn Treebank Benchmark. While most of the entries in the leaderboard use a deep neural network architecture, classical models like HMM and MEMM are still being utilized given their ease of interpretation and their backing by exhaustive literature.

In the first assignment of this course, you will be tasked with implementing both HMM and MEMM taggers. Your models should be coded from scratch without use of any
already existing implementations.

Both these taggers ought to be trained on the dataset provided and will be tested for their performance on the test dataset provided. You are encouraged to apply any suitable modifications to these models or try different techniques to get as high a score as possible. Remember, all that is required is that you maintain integrity while apply novel techniques to get a high score, and DO NOT copy code from external sources at all. You are however, encouraged to acknowledge and cite sources you refer to.
The end of this competition will be followed by a short discussion with each student on how they tackled this assignment and what all techniques they could apply outside HMM and MEMM to improve their performance on the Leaderboard.

BONUS TASK

It is expected that at the end of the first week of the assignment, you should have atleast implement one basic model for PoS tagging. At the end of the first week, a bonus
task will be shared with you all, in direction to guide you to perform basic NLP research with your coded model.

The task, however basic and intuitive, will be a good way to face you all to NLP research and its niches.

P.S. The task will be released only if the instructors find it suitable for the class to take its load following the class's experience with one week of coding PoS taggers.

DATA

train.csv : Training file with columns untagged_sentence and tagged_sentence.
test_small.csv: Test file with ID to match during submission and untagged_sentence for inference.
sample_submission.csv: Sample submission format consisting of ID and tagged_sentence columns.

RULES

To ensure that the assignment is completed in a fair and ethical manner, and that all participants should hold to a high standard of conduct:

Participants are expected to maintain the integrity of the competition and not engage in any activity that could compromise the fairness or accuracy of the results.
Participants are not supposed to submit manually tagged solutions for submissions as the instructors would also run the code to confirm the results.
Plagiarism is not allowed and would result in deduction in marks.
Participants are expected to comply with all applicable laws, regulations, and ethical standards.
All participants are required to submit their assignments as short reports.

Assignment 3 - MultiLabel Text Classification

Deadline: 15/04/2024

Task: Multi-label classification refers to a supervised learning scenario wherein a singular instance or sample may be linked to numerous labels or categorizations. For instance, named entity recognition.

Dataset: This dataset consists of an approximately 50,000 collection of research articles. Each article is described in terms of 14 labels. The dataset can be downloaded from:

https://drive.google.com/file/d/1iqk6XbNtTMVBSw3ON-uCrJ7oifEW6n4V/view?usp=sharing

The labels are mapped to actual names as follows:

"A": "Anatomy"

"B": "Organisms"

"C": "Diseases"

"D": "Chemicals and Drugs"

"E": "Analytical, Diagnostic and Therapeutic Techniques, and Equipment"

"F": "Psychiatry and Psychology"

"G": "Phenomena and Processes"

"H": "Disciplines and Occupations"

"I": "Anthropology, Education, Sociology, and Social Phenomena"

"J": "Technology, Industry, and Agriculture"

"L": "Information Science"

"M": "Named Groups"

"N": "Health Care"

"Z": "Geographicals"

Note: Only PyTorch or Tensorflow is allowed for the assignment.

Your task is to split the dataset into the train, test and preferably validation datasets and train a deep learning model using the dataset to predict the class of the text. Test the model on the test dataset. You are allowed to use pre-trained language models like BERT, GPT or whatever you feel like. But keep in mind that you have to either finetune the existing model or add extra layers to it to be trained for our downstream task. You are not allowed to use Word2Vec, GLoVE, FastText etc, to generate the embeddings and feed them into neural networks. You are supposed to use the title and abstract text to learn the embeddings. It would be good if you could show a comparison between the various settings you have tried; however, implementing one model completely would be sufficient as well. You are free to create additional features. Since it is a multilabel classification, you are supposed to experiment with the activation functions and loss functions.

Helper Modules:

Read File:

import pandas as pd

data='Multi Label Text Classification Dataset.csv'

df= pd.read_csv(dataset_Name)

Encode Labels as One-Hot Vectors:

df_train['one_hot_labels'] = list(df_train[mesh_Heading_categories].values)

Evaluation:

To avoid confusion, you will be evaluated on the performance of one of your models. The following metrics need to be computed for evaluation:

For each class
- Precision
- Recall
- F1-Score
Aggregate Metrics
- Micro Average F1 Score
- Macro Average F1 Score

Submission: Please submit the assignment by joining the following Google Classroom.

Page updated

Google Sites

Report abuse