I am an NLP Engineer at Arkhn.
I got my Ph.D. diploma in the Cotuelle program between La Rochelle University, France, and JoĹľef Stefan Institute, Slovenia supervised by Prof. Antoine Doucet and Assist. Prof. Senja Pollak. Previously, I worked as a Data Scientist at Samsung SDSV.
My research interests are natural language processing and machine learning, information extraction, low-resourced languages, generative AI, and large-scale language models.
NEWS
October 2024:Â I joined as an NLP Engineer in Arkhn, Paris, France.
September 2024:Â One of the 3 accepted papers got "Best Paper Awards" at the 28th International Conference on Theory and Practice of Digital Libraries (TPDL), 2024.
August 2024:Â I taught Generative AI for Everyone course at VietAI.
July 2024: Â I gave seminars about "Terminology in the era of LLMs" at the University of Coimbra, Portugal, and the University of Malta, Malta.
June 2024: Â I became a reviewer for The SIGNLL Conference on Computational Natural Language Learning (CoNLL 2024) and The 31st International Conference on Computational Linguistics (COLING 2025).
February 2024: Â I became the Program Committee for the 17th Workshop on Graph-Based Natural Language Processing (TextGraphs-17) co-located with ACL-2024 in Bangkok, Thailand
December 2023: Â I became Student Session Chairs at 35th European Summer School in Logic, Language, and Information in Leuven, Belgium.
Oct 2023: I taught Build Applications with OpenAI API course at VietAI.
Sep 2023: I became the Program Committee at Computational Terminology in NLP & Translation Studies (ConTeNTS) at RANLP 2023.
July - August 2023: I joined 34th European Summer School in Logic, Language, and Information in Ljubljana, Slovenia.
June - July 2023: I joined Machine Learning Summer School on Applications in Science in Krakow, Poland.
May 2023: I joined The 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia.
  I taught the ChatGPT/Bard for Everyone course at VietAI.
March 2023: I joined EELISA spring school "Ethos + Tekhne: a new generation of AI researchers, Pisa, Italy.
PUBLICATIONS
For a complete list of publications, please refer to my Google Scholar page.
LIAS: Layout Information-Based Article Separation in Historical Newspapers
Wenjun Sun, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Mickaël Coustaty, Antoine Doucet
International Conference on Theory and Practice of Digital Libraries (TPDL 2024)
We propose LIAS, a method based on layout information, and conduct experiments on historical newspapers. The method initially identifies the separator lines of the newspaper, analyzes the layout information to reconstruct the in- formation flow of the document, performs segmentation based on the semantic relationship of each text block in the information flow, and ultimately achieves article separation.
LIT: Label-Informed Transformers on Token-Based Classification
Wenjun Sun, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Mickaël Coustaty, Antoine Doucet
International Conference on Theory and Practice of Digital Libraries (TPDL 2024)
We propose LIT, an end-to-end pipeline architecture that integrates the transformer’s encoder-decoder mechanism with an additional label semantic to token classification tasks.
Leveraging Open Large Language Models for Historical Named Entity Recognition
Carlos-Emiliano González-Gallardo, Hanh Thi Hong Tran, Ahmed Hamdi, Antoine Doucet
International Conference on Theory and Practice of Digital Libraries (TPDL 2024)
(Best Paper Awards)
We develop methods to detect semantic ambiguous and complex entities in short and low-context settings of Complex NER using three different prompt-based approaches.
Is Prompting What Term Extraction Needs?
Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Julien Delaunay, Antoine Doucet, Senja Pollak
International Conference on Text, Speech, and Dialogue (TSD 2024)
We evaluate the applicability of open and closed-sourced LLMs on the ATE task compared to two benchmarks where we consider ATE as sequence-labeling (iobATE) and seq2seq (templATE) tasks.
Global-SEG: Text Semantic Segmentation Based on Global Semantic Pair Relations
Wenjun Sun, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Mickaël Coustaty, Antoine Doucet
International Conference on Document Analysis and Recognition (ICDAR 2024).Â
We propose Global-SEG, utilizing global semantic pair relations from both token- and sentence-level language models for text semantic segmentation.
Can cross-domain term extraction benefit from cross-lingual transfer and nested term labeling?
Hanh Thi Hong Tran, Matej Martinc, Andraz Repar, Nikola Ljubešić, Antoine Doucet & Senja Pollak
Machine Learning, 2024.
We propose a novel NOBI annotation regime and evaluate the abilities of cross-lingual and multilingual versus monolingual learning in the cross-domain to automatic term extraction.
Hanh Thi Hong Tran, Tien Nam Nguyen, Antoine Doucet, Senja Pollak
Proceedings of the The 18th International Workshop on Semantic Evaluation (SemEval-2024)
We propose a comparative study among three groups of methods to trigger the detection: (1) Using metric-based models; (2) Using a fine-tuned sequence-labeling language model (LM); and (3) Using a fine-tuned large-scale language model (LLM).
L3I++ at SemEval-2023 Task 2: Prompting for Multilingual Complex NER
Carlos-Emiliano González-Gallardo, Hanh Thi Hong Tran, Nancy Girdhar, Emanuela Boros, Jose G Moreno, Antoine Doucet
Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023)
We develop methods to detect semantic ambiguous and complex entities in short and low-context settings of Complex NER using three different prompt-based approaches.
Ensembling Transformers for Cross-domain Automatic Term Extraction
Hanh Thi Hong Tran, Matej Martinc, Andraz Pelicon, Antoine Doucet, Senja Pollak
International Conference on Asian Digital Libraries (ICADL, 2022)
We propose a comparative study on the predictive power of Transformers at extracting single- and multi-word terms in a multilingual cross-domain setting with and without ensembling approaches.
Can Cross-domain Term Extraction Benefit from Cross-lingual Transfer?
Hanh Thi Hong Tran, Matej Martinc, Antoine Doucet, Senja Pollak
International Conference on Discovery Science (DS, 2022)
We evaluate the abilities of cross-lingual and multilingual versus monolingual learning in the cross-domain to automatic term extraction.
Named Entity Recognition Architecture Combining Contextual and Global Features
Hanh Thi Hong Tran, Antoine Doucet, Nicolas Sidere, Jose G Moreno, Senja Pollak
International Conference on Asian Digital Libraries (ICADL, 2021)
We propose the combination of contextual features from XLNet and global features from the Graph Convolution Network (GCN) to enhance NER performance.