The first assignment is about N-gram language modeling, smoothing, and simple binary text classification. It uses sentiment analysis as the application scenario for the text classification task, using the IMDB movie review corpus released by Maas et al. (2011).
The assignment is released entirely as a notebook on Google Colab. Students can avail it through their Stony Brook University single sign-on (NetID credentials):
Please go through the entire assignment before you jump into coding! If you start coding right away without understanding the overall requirements, you may find yourself having to undo or modify a lot of your own work.
This assignment is due on Brightspace by 11:59 pm, Feb 29 (Thursday).
References
A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. ACL.
This assignment introduces an important task in NLP research, called named entity recognition (NER), which is concerned with identifying real-world entities and categorizing them as person, location, organization, etc. Your task involves two main stages: first, you will identify person mentions through a binary logistic regression classifier; second, you will identify multiple types of entities through a multinomial logistic regression classifier. This assignment employs the dataset and ideas from two famous shared research tasks introduced by Tjong Kim Sang (2002) and Tjong Kim Sang and DeMuelder (2003).
The assignment is released entirely as a notebook on Google Colab. Students can avail it through their Stony Brook University single sign-on (NetID credentials):
As with the first assignment, please go through the entire assignment first, and try to understand the overall scope of this work before you start programming.
This assignment is due on Brightspace by 11:59 pm, March 26 (Tuesday).
References
Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142--147.
Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002).
Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the CoNLL-2000 Shared Task Chunking. In Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop.
This assignment explores static word embeddings through Word2Vec, using two distinct loss functions. It also introduces extrinsic evaluation of models, where instead of evaluating the model by dividing your data into training and test sets, you evaluate the model in terms of how good it is in an external task. Here, that task is solving word analogy questions. This assignment is based on two famous research papers that led to Word2Vec (Mikolov et al., 2013a; 2013b). Plus, I have also referred to a third paper that explains how to learn model parameters (Rong, 2016). While the third paper is a recommended reading, I have condensed its material into a document in the style of lecture notes.
The assignment is released as a notebook on Google Colab. Students can access it through their Stony Brook University single sign-on (NetID credentials):
The additional data for the study of word analogies can be downloaded from here: word2vec-word-analogies.txt
This assignment is due on Brightspace by 11:59 pm, April 12 (Friday).
References
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. 2013a. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, 26.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. 2013b. Efficient Estimation of Word Representations in Vector Space. arXiv, 1301.3781.
Rong, X. 2016. word2vec Parameter Learning Explained. arXiv, 1411.2738.
This assignment moves from word embeddings to sentence embeddings by introducing a semantic textual similarity (STS) task. Unlike the previous assignment assignments, there is no additional reading material this time. Any conceptual advances have been described directly in the notebook (linked below). It also asks you to write your own code instead of only filling in todo blocks. The implementation, however, heavily relies on the type of code given to you in previous assignments. So, you should leverage that prior knowledge here (e.g., make use of the training loop code given to you in the previous assignment).
The assignment is released as a notebook on Google Colab. Students can access it through their Stony Brook University single sign-on (NetID credentials):
This assignment is due on Brightspace by 11:59 pm, April 28 (Sunday) April 30 (Tuesday).
This last assignment is a concept-based walk-through of developing a realistic NLP system. The entire assignment is provided in this PDF file, and the submission instructions are included in it.
This assignment is due on Brighspace by 11:59 pm, May 6 (Monday).
The due date is on the last day before final exams begin, so extending the due date is not possible.
Quiz submissions must be made on Brightspace before each Friday, 11:59 pm (the submission portal will close exactly at this time).
Each quiz submission must be in a single PDF. Please do NOT submit a photograph or a scanned copy of a handwritten document! Either use LaTeX or use MS Word and export to PDF.
Access to the quiz documents require Stony Brook University NetID.