Professor of Natural Language Processing, University of Cambridge
Delighted to announce our paper 'On Reality and the Limits of Language Data' in collaboration with @EhsanShareghi and @hardy_qr at https://arxiv.org/abs/2208.11981 . We've spent the last 9 months reading and thinking about the limitations of pre-trained language models like GPT-3 and what they understand about the complex physical world we live in. Establishing strong human norms for a wide set of physical common sense relations and comparing models has helped us narrow down our focus for future algorithmic development. We identify concrete categories of common sense relations on which learning from language data alone, regardless of the size of the model, falls short.
I'm happy to announce our papers accepted for ACL 2022, thank you so much to all the co-authors:
It's a great honour to receive the award for best paper from EMNLP 2021. All credit to co-authors: Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy and Desmond Elliott.
Zaiqiao Meng, Fangyu Liu, Thomas Clark, Ehsan Shareghi and Nigel Collier. Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT. To appear in EMNLP 2021.
Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang and Nigel Collier. Plan-then-Generate: Controlled Data-to-Text Generation via Planning. To appear in Findings of EMNLP 2021.
Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier and Desmond Elliott. Visually Grounded Reasoning across Languages and Cultures. To appear in EMNLP 2021.
Yixuan Su, Zaiqiao Meng, Simon Baker and Nigel Collier. Few-Shot Table-to-Text Generation with Prototype Memory. To appear in Findings of EMNLP 2021.
Fangyu Liu, Ivan Vulic, Anna Korhonen and Nigel Collier. Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders. To appear in EMNLP 2021.
Qianchu Liu, Fangyu Liu, Nigel Collier, Anna Korhonen and Ivan Vulic. On Eliciting Word-in-Context Representations from Pretrained Language Models. To appear in CoNLL 2021.
Victor Prokhorov, Yingzhen Li, Ehsan Shareghi and Nigel Collier. Learning Sparse Sentence Encoding without Supervision: An Exploration of Sparsity in Variational Autoencoders. To appear in RepL4NLP 2021.
Fangyu Liu, Ivan Vulić, Anna Korhonen and Nigel Collier. Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking. To appear in ACL 2021.
Yixuan Su, Deng Cai, Qingyu Zhou, Zibo Lin, Simon Baker, Yunbo Cao, Shuming Shi, Nigel Collier, and Yan Wang. Dialogue Response Selection with Hierarchical Curriculum Learning. To appear in ACL 2021.
Yixuan Su, David Vandyke, Simon Baker, Yan Wang and Nigel Collier. Keep the Primary, Rewrite the Secondary: A Two-Stage Approach for Paraphrase Generation. To appear in Findings of ACL 2021.
Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella and Nigel Collier. Self-alignment Pretraining for Biomedical Entity Representations. To appear in NAACL-HLT 2021.
Fangyu Liu, Muhao Chen, Dan Roth and Nigel Collier. Visual Pivoting for (Unsupervised) Entity Alignment. To appear in AAAI 2021.
I have been working in NLP and AI for over 25 years. Before joining the University of Cambridge on an EPSRC Experienced Researcher Fellowship (2015-2020) I spent the early part of my career in Japan (1996-2012). I was a Toshiba Fellow, a postdoc at Tokyo University with Junichi Tsujii and Associate Professor at the newly formed National Institute of Informatics where I led the NLP lab for 12 years before returning to the UK on a Marie Curie Research Fellowship. As an undergraduate I studied for a BSc. in Computer Science at the University of Leeds (1992). I received an MSc in Machine Translation (1994) and a PhD in Computational Linguistics (1996) from the University of Manchester (UMIST) for my research on English-Japanese Lexical Transfer using a Hopfield Neural Network.
My work focuses on natural language processing and machine learning. My research interests are broadly in creating better models for natural language understanding (see selected publications below). I also have an interest in applications with the potential for tangible social impact, for example in the area of global health (see below for the BioCaster and EPI-AI projects) where I am a member of the WHO's Epidemic Intelligence from Open Sources initiative.
For a list of publications please see Google Scholar.
Prospective PhD students: I am always interested to supervise new NLP projects on the PhD in Computation, Cognition and Language. Before contacting me please make sure that you meet the minimum requirements and take time to check out my publications. In your email please send a CV with a brief statement of research interests. Please note the application deadline and documents you need to submit with your application.
Enquiries for postdoctoral opportunities are always welcome. When needed I can help explore funding sources for fellowships from UK, EU and other agencies.
Over the years many people have contributed to the research and publications in my lab. Here's a list of current students, postdocs and alumni.
2020.2 to 2023.8 ESRC EPI-AI: Automated Understanding and Alerting of Disease Outbreaks from Global News Media (with Professor David Buckeridge and Dr Nick King, McGill University). The EPI-AI project aims to achieve a step change in automated global epidemic alerting using news media monitoring. Teams at McGill and Cambridge universities, in collaboration with national and international public health agencies, are adopting an interdisciplinary approach that combines natural language processing, epidemiology, biomedical informatics and bioethics to address this complex task.
2020.4 to 2022.3 Alan Turing Institute: Interpretable and Explainable Deep Learning for Natural Language Understanding and Commonsense Reasoning (with Professor Thomas Kukasiewicz, University of Oxford).
Selected publications on Language Representation
Prokhorov, V., Li, Y., Shareghi, E. and Collier, N (2021). Learning Sparse Sentence Encoding without Supervision: An Exploration of Sparsity in Variational Autoencoders. To appear in RepL4NLP 2021.
Liu, F., Vulić, I., Korhonen, A. and Collier, N. (2021). Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking. To appear in ACL 2021.
Prokhorov, V., Li, Y., Shareghi, E., & Collier, N. (2020). Hierarchical Sparse Variational Autoencoder for Text Encoding. arXiv preprint arXiv:2009.12421. [pdf]
Basaldella, M., and Collier, N. (2019). "BioReddit: Word Embeddings for User-Generated Biomedical NLP." Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019). [pdf]
Prokhorov, V., Pilehvar, M. T., Kartsaklis, D., Lio, P., & Collier, N. (2019). Unseen Word Representation by Aligning Heterogeneous Lexical Semantic Spaces. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, pp. 6900-6907). [pdf]
Selected publications on Entities, Relations and Reasoning
Liu, F., Shareghi, E., Meng, Z., Basaldella, M. and Collier, N. Self-alignment Pre-training for Biomedical Entity Representations. In Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2021), Mexico City, Mexico ). [pdf]
Basaldella, M., Liu, F., Shareghi, E., & Collier, N. (2020, November). COMETA: A Corpus for Medical Entity Linking in the Social Media. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3122-3137). [pdf]
Gritta, M., Pilehvar, M. T., & Collier, N. (2019). A pragmatic guide to geoparsing evaluation. Language Resources and Evaluation, 1-30. [pdf]
Le, H. Q., Can, D. C., Ha, Q. T., & Collier, N. (2019). A Richer-but-Smarter Shortest Dependency Path with Attentive Augmentation for Relation Extraction. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (Vol. 1, pp. 2902-2912). Association for Computational Linguistics. [pdf]
Kartsaklis, D., Pilehvar, M. T. and Collier, N. (2018), “Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs”, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium, pp. 1959-1970. [pdf]
Selected publications on Language and Image
Liu, F., Chen, M., Roth, D., & Collier, N. (2020), "Visual Pivoting for (Unsupervised) Entity Alignment"", in Proceedings of the 35th AAAI International Conference on Artificial Intelligence (AAAI). Pre-print available on Arxiv [pdf].
Selected publications on Fact Verification
Conforti, C., Berndt, J., Pilehvar, M. T., Giannitsarou, C., Toxvaerd, F., & Collier, N. (2020). Will-They-Won't-They: A Very Large Dataset for Stance Detection on Twitter. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). [pdf]
Conforti, C., Berndt, J., Pilehvar, M. T., Giannitsarou, C., Toxvaerd, F., & Collier, N. (2020, November). STANDER: An Expert-Annotated Dataset for News Stance Detection and Evidence Retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (pp. 4086-4101). [pdf]
Selected publications on Generation
Su, Y., Cai, D., Zhou, Q., Lin, Z., Baker, S., Cao, Y., Shi, S., Collier, N. and Wang, Y. Dialogue Response Selection with Hierarchical Curriculum Learning. To appear in ACL 2021.
Su, Y., Vandyke, D., Baker, S., Wang Y. and Collier, N. Keep the Primary, Rewrite the Secondary: A Two-Stage Approach for Paraphrase Generation. To appear in Findings of ACL 2021.
Su, Y., Cai, D., Wang, Y., Vandyke, D., Baker, S., Li, P., & Collier, N. (2021). Non-Autoregressive Text Generation with Pre-trained Language Models. arXiv preprint arXiv:2102.08220. [pdf]
Su, Y., Cai, D., Wang, Y., Baker, S., Korhonen, A., Collier, N., & Liu, X. (2020). Stylistic dialogue generation via information-guided reinforcement learning strategy. arXiv preprint arXiv:2004.02202. [pdf]
Prokhorov, V., Shareghi, E., Li, Y., Pilehvar, M. T., & Collier, N. (2019). On the Importance of the Kullback-Leibler Divergence Term in Variational Autoencoders for Text Generation. In Proceedings of the 3rd Workshop on Neural Generation and Translation (pp. 118-127). [pdf]
Prokhorov, V., Pilehvar, M. T., & Collier, N. (2019). Generating Knowledge Graph Paths from Textual Definitions using Sequence-to-Sequence Models. In Proceedings of NAACL-HLT (pp. 1968-1976). [pdf]
Selected publications on NLP for Epidemic Detection and Mapping
Collier, N., Doan, S., Kawazoe, A., Goodwin, R. M., Conway, M., Tateno, Y., ... & Shigematsu, M. (2008). BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics, 24(24), 2940-2941. [pdf]
Hay, S. I., Battle, K. E., Pigott, D. M., Smith, D. L., Moyes, C. L., Bhatt, S., ... & Gething, P. W. (2013). Global mapping of infectious disease. Philosophical Transactions of the Royal Society B: Biological Sciences, 368(1614), 20120250. [pdf]
Collier, N., Son, N. T., & Nguyen, N. M. (2011). OMG U got flu? Analysis of shared health messages for bio-surveillance. Journal of biomedical semantics, 2(5), S9. [pdf]
Kawazoe, A., Jin, L., Shigematsu, M., Barrero, R., Taniguchi, K., & Collier, N. (2006). The Development of a Schema for the Annotation of Terms in the Biocaster Disease Detecting/Tracking System. In KR-MED. [pdf]
Collier, N., Goodwin, R. M., McCrae, J., Doan, S., Kawazoe, A., Conway, M., ... & Dien, D. (2010). An ontology-driven system for detecting global health events. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 215-222). Association for Computational Linguistics. [pdf]
Recently completed projects
2015.2 to 2020.2 EPSRC SIPHS (EP/M005089/1): I was funded by a 1.2 million 5-year EPSRC fellowship to investigate the Semantic Interpretation of Personal Health messages on the Web (SIPHS) project. This international collaborative effort leveraged social media data for digital disease applications such as detecting infectious disease outbreaks and adverse drug reaction.
2015.10 to 2018.10 MRC PheneBank (MR/M025160/1): This project aimed to develop a new method for the identification and harmonisation of human phenotypes from the scientific literature as well as their associations to entities of interest such as diseases, genes and other phenotypes.
Nigel Collier, Professor of Natural Language Processing
The Language Technology Lab, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge, 9 West Road, Cambridge CB3 9DB, United Kingdom
Tel: +44 (0)1223-760373
Email: nhc30 [AT] cam dot ac dot uk
Office: Room TR-23, English Faculty Building
ORCID ID: 0000-0002-7230-4164