Henry Hung Le

I am currently a Senior Research Scientist at Salesforce Research. Throughout my research career, I have received great guidance from  Steven Hoi and Nancy Chen. During my PhD, I was fortunate to collaborate with leading research labs, including  Facebook/Meta Research, with Satwik Kottur and Alborz Geramifard, and Salesforce Research, with Steven Hoi and Richard Socher. Earlier in my career, I worked in major companies in the banking/consulting sectors, including JP Morgan, Merrill Lynch, and Deloitte. 

I grew up in Saigon, Vietnam (xin chào!) and I am now a Singaporean. 

Contacts:

Publications

CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules

Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, Shafiq Joty (ICLR, 2024)

(Paper)(Code)(Blog)(MarkTechPost)

Citation

@inproceedings{

le2023codechain,

title={CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules},

author={Hung Le and Hailin Chen and Amrita Saha and Akash Gokul and Doyen Sahoo and Shafiq Joty},

booktitle={The Twelfth International Conference on Learning Representations},

year={2024},

url={https://openreview.net/forum?id=vYhglxSj8j}

}


CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Yue Wang*, Hung Le*, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi (EMNLP, 2023)

(Paper)(Code)(Blog

Citation

@article{

    wang2023codet5plus,

    title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},

    author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},

    journal={arXiv preprint},

    year={2023}

}

CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning

Hung Le*, Yue Wang*, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi (NeurIPS, 2022)

(Paper)(Code)(Blog)(Poster)

Media: Medium, AI Supremacy, Reddit

Citation

@inproceedings{

le2022coderl,

title={Code{RL}: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},

author={Hung Le and Yue Wang and Akhilesh Deepak Gotmare and Silvio Savarese and Steven Hoi},

booktitle={Advances in Neural Information Processing Systems},

editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},

year={2022},

url={https://openreview.net/forum?id=WaGvb7OzySA}

}

Multimodal Dialogue State Tracking

Hung Le, Nancy F. Chen, Steven C.H. Hoi

2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL, 2022).

(Slide)(Paper)(Code)

Citation

@inproceedings{le-etal-2022-multimodal,

    title = "Multimodal Dialogue State Tracking",

    author = "Le, Hung  and

      Chen, Nancy  and

      Hoi, Steven",

    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",

    month = jul,

    year = "2022",

    address = "Seattle, United States",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/2022.naacl-main.248",

    pages = "3394--3415",

    abstract = "Designed for tracking user goals in dialogues, a dialogue state tracker is an essential component in a dialogue system. However, the research of dialogue state tracking has largely been limited to unimodality, in which slots and slot values are limited by knowledge domains (e.g. restaurant domain with slots of restaurant name and price range) and are defined by specific database schema. In this paper, we propose to extend the definition of dialogue state tracking to multimodality. Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues. Each new dialogue utterance may introduce a new video segment, new visual objects, or new object attributes and a state tracker is required to update these information slots accordingly. We created a new synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer Network (VDTN), for this task. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. We optimized VDTN for a state generation task as well as a self-supervised video understanding task which recovers video segment or object representations. Finally, we trained VDTN to use the decoded states in a response prediction task. Together with comprehensive ablation and qualitative analysis, we discovered interesting insights towards building more capable multimodal dialogue systems.",

}

VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

Hung Le, Nancy F. Chen, Steven C.H. Hoi

2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL, 2022).

(Slide)(Paper)

Citation

@inproceedings{le-etal-2022-vgnmn,

    title = "{VGNMN}: Video-grounded Neural Module Networks for Video-Grounded Dialogue Systems",

    author = "Le, Hung  and

      Chen, Nancy  and

      Hoi, Steven",

    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",

    month = jul,

    year = "2022",

    address = "Seattle, United States",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/2022.naacl-main.247",

    pages = "3377--3393",

    abstract = "Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded dialogue tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance and language cross-turn dependencies. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components in dialogues to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VGNMN can achieve promising performance on a challenging video-grounded dialogue benchmark as well as a video QA benchmark.",

}

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

Hung Le, Chinnadhurai Sankar, Seungwhan Moon, Ahmad Beirami, Alborz Geramifard, Satwik Kottur. 

The 59th Annual Meeting of the Association for Computational Linguistics (ACL, 2021).

(Slide) (Paper) (Code) (Blog

Citation

@inproceedings{le-etal-2021-dvd,

    title = "{DVD}: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue",

    author = "Le, Hung  and

      Sankar, Chinnadhurai  and

      Moon, Seungwhan  and

      Beirami, Ahmad  and

      Geramifard, Alborz  and

      Kottur, Satwik",

    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",

    month = aug,

    year = "2021",

    address = "Online",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/2021.acl-long.439",

    doi = "10.18653/v1/2021.acl-long.439",

    pages = "5651--5665",

    abstract = "A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem, involving various reasoning types on both visual and language inputs. Existing benchmarks do not have enough annotations to thoroughly analyze dialogue systems and understand their capabilities and limitations in isolation. These benchmarks are also not explicitly designed to minimise biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present DVD, a Diagnostic Dataset for Video-grounded Dialogue. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video. Dialogues are synthesized over multiple question turns, each of which is injected with a set of cross-turn semantic relationships. We use DVD to analyze existing approaches, providing interesting insights into their abilities and limitations. In total, DVD is built from 11k CATER synthetic videos and contains 10 instances of 10-round dialogues for each video, resulting in more than 100k dialogues and 1M question-answer pairs. Our code and dataset are publicly available.",

}

Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues

Hung Le, Nancy F. Chen, Steven C.H. Hoi. 

International Conference on Learning Representations (ICLR, 2021).

(Slide) (Video) (Paper)

Citation

@inproceedings{

le2021learning,

title={Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues},

author={Hung Le and Nancy F. Chen and Steven Hoi},

booktitle={International Conference on Learning Representations},

year={2021},

url={https://openreview.net/forum?id=hPWj1qduVw8}

}

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C.H. Hoi. 

Conference on Empirical Methods in Natural Language Processing (EMNLP, 2020). 

(Slide) (Video) (Paper) (Code)

Citation

@inproceedings{le-etal-2020-bist,

    title = "{B}i{ST}: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues",

    author = "Le, Hung  and

      Sahoo, Doyen  and

      Chen, Nancy  and

      Hoi, Steven C.H.",

    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",

    month = nov,

    year = "2020",

    address = "Online",

    publisher = "Association for Computational Linguistics",

    url = "https://www.aclweb.org/anthology/2020.emnlp-main.145",

    doi = "10.18653/v1/2020.emnlp-main.145",

    pages = "1846--1859",

    abstract = "Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focus on superficial temporal-level visual cues, but neglect more fine-grained spatial signals from videos. To address this drawback, we proposed Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues. Specifically, our approach not only exploits both spatial and temporal-level information, but also learns dynamic information diffusion between the two feature spaces through spatial-to-temporal and temporal-to-spatial reasoning. The bidirectional strategy aims to tackle the evolving semantics of user queries in the dialogue setting. The retrieved visual cues are used as contextual information to construct relevant responses to the users. Our empirical results and comprehensive qualitative analysis show that BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark. We also adapt our BiST models to the Video QA setting, and substantially outperform prior approaches on the TGIF-QA benchmark.",

}

UniConv: A Unified Conversational Neural Architecture for Multi-domain Task-oriented Dialogues

Hung Le, Doyen Sahoo, Chenghao Liu, Nancy F. Chen and Steven C.H. Hoi. 

Conference on Empirical Methods in Natural Language Processing (EMNLP, 2020). 

(Slides) (Video) (Paper) (Code)

Citation

@inproceedings{le-etal-2020-uniconv,

    title = "{U}ni{C}onv: A Unified Conversational Neural Architecture for Multi-domain Task-oriented Dialogues",

    author = "Le, Hung  and

      Sahoo, Doyen  and

      Liu, Chenghao  and

      Chen, Nancy  and

      Hoi, Steven C.H.",

    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",

    month = nov,

    year = "2020",

    address = "Online",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/2020.emnlp-main.146",

    doi = "10.18653/v1/2020.emnlp-main.146",

    pages = "1860--1877",

    abstract = "Building an end-to-end conversational agent for multi-domain task-oriented dialogues has been an open challenge for two main reasons. First, tracking dialogue states of multiple domains is non-trivial as the dialogue agent must obtain complete states from all relevant domains, some of which might have shared slots among domains as well as unique slots specifically for one domain only. Second, the dialogue agent must also process various types of information across domains, including dialogue context, dialogue states, and database, to generate natural responses to users. Unlike the existing approaches that are often designed to train each module separately, we propose {``}UniConv{''} - a novel unified neural architecture for end-to-end conversational systems in multi-domain task-oriented dialogues, which is designed to jointly train (i) a Bi-level State Tracker which tracks dialogue states by learning signals at both slot and domain level independently, and (ii) a Joint Dialogue Act and Response Generator which incorporates information from various input components and models dialogue acts and target responses simultaneously. We conduct comprehensive experiments in dialogue state tracking, context-to-text, and end-to-end settings on the MultiWOZ2.1 benchmark, achieving superior performance over competitive baselines.",

}

Non-Autoregressive Dialog State Tracking

Hung Le, Richard Socher, Steven C.H. Hoi. 

International Conference on Learning Representations (ICLR, 2020).

(Slides/Video)(Paper) (Code)

Citation

@inproceedings{

Le2020Non-Autoregressive,

title={Non-Autoregressive Dialog State Tracking},

author={Hung Le and Richard Socher and Steven C.H. Hoi},

booktitle={International Conference on Learning Representations},

year={2020},

url={https://openreview.net/forum?id=H1e_cC4twS}

}

Multimodal Transformer Networks for End-to-end Video-grounded Dialogue Systems

Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C.H. Hoi. 

The 57th Annual Meeting of the Association for Computational Linguistics (ACL, 2019).

(Slides) (Video) (Paper) (Code

Citation

@inproceedings{le-etal-2019-multimodal,

    title = "Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems",

    author = "Le, Hung  and

      Sahoo, Doyen  and

      Chen, Nancy  and

      Hoi, Steven",

    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",

    month = jul,

    year = "2019",

    address = "Florence, Italy",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/P19-1564",

    doi = "10.18653/v1/P19-1564",

    pages = "5612--5623",

    abstract = "Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance.",

}

URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection

Hung Le, Quang Pham, Doyen Sahoo, Steven C.H. Hoi. 

Preprint

(Paper) (Code)

Citation

@article{le2018urlnet,

  title={URLNet: Learning a URL representation with deep learning for malicious URL detection},

  author={Le, Hung and Pham, Quang and Sahoo, Doyen and Hoi, Steven CH},

  journal={arXiv preprint arXiv:1802.03162},

  year={2018}

}