My broader research goal is development of Machine Learning systems to extract knowledge from textual data. Specifically, I am interested in utilization of unlabelled data to improve performance of ML algorithms in low labelled-data regimes. I develop Self (e.g. Contrastive Learning) and Semi (e.g. Self-Distillation) Supervised Learning based techniques along with Graph Neural Networks for in-domain and cross-domain information extraction.
My non-exhaustive research interests are:
Self-Supervised Learning
Graph Neural Networks: https://tinyurl.com/GraphNNet (GCN Intro + hands-on)
Learning under Limited Supervision
Domain Adaptation and Transfer Learning in NLP: https://github.com/SamujjwalSam/Short-text_GNN
Multi-Label Text Classification and Social Network Analysis: https://tinyurl.com/Kerala-Flood-ML
Efficient GNN training and inference
During the time of disasters, lots of short-texts are generated containing crucial situational information. Proper extraction and identification of situational information might be useful for various rescue and relief operations. Few specific types of infrequent situational information might be critical. However, obtaining labels for those resource-constrained classes is challenging as well as expensive. Supervised methods pose limited usability in such scenarios. To overcome this challenge, we propose a semi-supervised learning framework which utilizes abundantly available unlabelled data by self-learning. The proposed framework improves the performance of the classifier for resource-constrained classes by selectively incorporating highly confident samples from unlabelled data for self-learning. Incremental incorporation of unlabelled data, as and when they become available, is suitable for ongoing disaster mitigation. Experiments on three disaster-related datasets show that such improvement results in overall performance increase over standard supervised approach.
Ghosh, Samujjwal, & Desarkar, M. S. (2020). "Semi-supervised granular classification framework for resource constrained short-texts: Towards retrieving situational information during disaster events", In12th acm web science conference 2020, ACM. 12th ACM Conference on Web Science (WebSci ’20), July6–10, 2020, Southampton, United Kingdom. https://doi.org/10.1145/3394231.3397892
During the time of disasters, lots of short-texts are generated containing crucial situational information. Proper extraction and identification of situational information might be useful for various rescue and relief operations. Few specific types of infrequent situational information might be critical. However, obtaining labels for those resource-constrained classes is challenging as well as expensive. Supervised methods pose limited usability in such scenarios. To overcome this challenge, we propose a semi-supervised learning framework which utilizes abundantly available unlabelled data by self-learning. The proposed framework improves the performance of the classifier for resource-constrained classes by selectively incorporating highly confident samples from unlabelled data for self-learning. Incremental incorporation of unlabelled data, as and when they become available, is suitable for ongoing disaster mitigation. Experiments on three disaster-related datasets show that such improvement results in overall performance increase over standard supervised approach.
Ghosh, Samujjwal, & Desarkar, M. S. (2020). "Semi-supervised granular classification framework for resource constrained short-texts: Towards retrieving situational information during disaster events", In12th acm web science conference 2020, ACM. 12th ACM Conference on Web Science (WebSci ’20), July6–10, 2020, Southampton, United Kingdom. https://doi.org/10.1145/3394231.3397892
Proper formulation of features plays an important role in short-text classification tasks as the amount of text available is very little. In literature, Term Frequency - Inverse Document Frequency (TF-IDF) is commonly used to create feature vectors for such tasks. However, TF-IDF formulation does not utilize the class information available in supervised learning. For classification problems, if it is possible to identify terms that can strongly distinguish among classes, then more weight can be given to those terms during feature construction phase. This may result in improved classifier performance with the incorporation of extra class label related information. We propose a supervised feature construction method to classify tweets, based on the actionable information that might be present, posted during different disaster scenarios. Improved classifier performance for such classification tasks can be helpful in the rescue and relief operations. We used three benchmark datasets containing tweets posted during Nepal and Italy earthquakes in 2015 and 2016 respectively. Experimental results show that the proposed method obtains better classification performance on these benchmark datasets.
Ghosh, Samujjwal, & Desarkar, M. S. (2018). "Class specific tf-idf boosting for short-text classification:Application to short-texts generated during disasters", In Smerp workshop at companion of the web conference 2018, ACM. International World Wide Web Conferences Steering Committee (WWW-W ’18), April 23–27, 2018, Lyon, France. https://doi.org/10.1145/3184558.3191621
Micro-blogging sites are important source of real-time situational information during disasters such as earthquakes, hurricanes, wildfires, flood etc. Such disasters cause miseries in the lives of affected people. Timely identification of steps needed to help the affected people in such situations can mitigate those miseries to large extent. In this paper, we focus on the problem of automated classification of the disaster related tweets to a set of predefined categories. Some example categories considered are resource availability, resource requirement, infrastructure damage etc. Proper annotation of the tweets with these class information can help in timely determination of the steps needed to be taken to address the concerns of the people in the affected areas. Depending on the information types or categories, different feature sets might be useful for proper identification of posts belonging to that category. In this work, we define multiple feature sets and use them with various supervised classification algorithms from literature to study the effectiveness of our approach in annotating the tweets with their appropriate information categories.
Ghosh, Samujjwal, Srijith, P., & Desarkar, M. S. (2017). Using social media for classifying actionable insights in disaster scenario. International Journal of Advances in Engineering Sciences and Applied Mathematics, 9(4), 224–237 (Springer). https://doi.org/10.1007/s12572-017-0197-2
Now a days, any high-end car generally has near to 70 ECUs connected through CAN or FlexRay networks. These cars mainly follow Federated Architecture which is a pattern which describes an approach to enterprise architecture that allows interoperability and information sharing between semi-autonomous de-centrally organized systems and applications. Here, all the sensors and actuators are connected to either to the ECUs or to the Bus directly, which costs more. But the next generation evaluation turns towards Integrated Architecture, where all the sensors and actuators will be connected to a specific type of interface device. Those interface devices will be connected to the Bus. Bus is in turn connected to the ECUs. In this architecture, the required number ECUs are less, in turn lowering the overall cost compared to Federated Architecture. My task in this project is to design a system that can efficiently map the system designed with functional model to Integrated Architecture model. This mapping step must consider various available options for both the functional and architectural models.
Masters thesis can be found at: https://www.dropbox.com/s/hasfw7hr7vqnedg/10CS60R29-Samujjwal%20Ghosh-PPC.pdf?dl=0