GenAI for Natural Langauge Processing
School of Computer Science
Holon Institute of Technology
Course 65339, Spring 2025, Lecturer: Dr. Alexander(Sasha) Apartsin
School of Computer Science
Holon Institute of Technology
Course 65339, Spring 2025, Lecturer: Dr. Alexander(Sasha) Apartsin
The course "GenAI for Natural Language Processing" delves into the inner workings of large language models (LLMs) and explores their diverse applications.
Through a balanced integration of theoretical knowledge and end-to-end implementation projects, students will develop practical skills in designing, building, and deploying NLP applications based on advanced LLM techniques, using cutting-edge software libraries and tools.
Uncover the Hidden Leader: Reveal True Seniority in Every Resume
Matan Cohen, Shira Shany, Edan Menahem
Abstract
Accurately assessing candidate seniority from resumes is a critical yet challenging task, complicated by the prevalence of overstated experience and ambiguous self-presentation. This study investigates the effectiveness of large language models (LLMs), including fine-tuned BERT architectures, for automating seniority classification in resumes. To rigorously evaluate model performance, we introduce a hybrid dataset comprising real-world resumes and synthetically generated hard examples designed to simulate exaggerated qualifications and understated seniority. Through extensive benchmarking, we reveal the strengths and limitations of state-of-the-art large language models (LLMs) in detecting subtle linguistic cues associated with seniority inflation and implicit expertise. Our findings highlight promising directions for enhancing AI-driven candidate evaluation systems and mitigating bias introduced by self-promotional language.
LLM-Driven Insights for Effortless Customer Support
Shanu Kupiec, Inbal Bolshinsky, Almog Sasson, Nadav Margalit
Abstract
In the era of conversational AI, generating accurate and contextually appropriate service responses remains a critical challenge. A central question remains: Is explicit intent recognition a prerequisite for generating high-quality service responses, or can models bypass this step and produce effective replies directly? This paper conducts a rigorous comparative study to address this fundamental design dilemma. Leveraging two publicly available service interaction datasets, we benchmark several state-of-the-art language models, including a fine-tuned T5 variant, across Intent-First Response Generation and Direct Response Generation. Evaluation metrics encompass linguistic quality and task success rates, revealing surprising insights into the necessity or redundancy of explicit intent modelling. Our findings challenge conventional assumptions in conversational AI pipelines, offering actionable guidelines for designing more efficient and effective response generation systems.
Accelerating Code Reviews with Transfer Learning — Embracing Every Language at Lightning Speed.
Yogev Cohen, Romi Simkin, David Ohayon
Abstract
Automating whether a code change requires manual review is vital for maintaining software quality in modern development workflows. However, the emergence of new programming languages and frameworks creates a critical bottleneck: while large volumes of unlabelled code are readily available, there is insufficient labelled data to train supervised models for review classification. We address this challenge by leveraging Large Language Models (LLMs) to translate code changes from well-resourced languages into equivalent changes in underrepresented or emerging languages, generating synthetic training data where labelled examples are scarce.
We assume that although LLMs have learned the syntax and semantics of new languages from available unlabelled code, they have yet to fully grasp which code changes are considered significant or review-worthy within these emerging ecosystems. We use LLMs to generate synthetic change examples and train supervised classifiers to overcome this. We systematically compare the performance of these classifiers against models trained on real labelled data. Our experiments across multiple GitHub repositories and language pairs demonstrate that LLM-generated synthetic data can effectively bootstrap review recommendation systems, narrowing the performance gap even in low-resource settings. This approach provides a scalable pathway to extend automated code review capabilities to rapidly evolving technology stacks, even without annotated data.
Aspect-Level Insights from Student Feedback
Omer Tsaadi , Ron Twito , Gleb Lukovsski, Valentine Gundorov
Abstract
The analysis of student course reviews offers valuable insights for improving educational programs; however, the unstructured nature of textual feedback presents significant challenges. This study investigates the effectiveness of aspect-based sentiment analysis (ABSA) for extracting actionable information from course evaluations. We construct a synthetic dataset of text reviews by systematically sampling relevant educational aspects and sentiment scores, ensuring controlled variability and balanced representation across aspects. We benchmark several state-of-the-art models using this dataset, including zero-shot and few-shot learning with large pretrained language models (LLMs) and fine-tuned BERT-based classifiers. The evaluation focuses on both aspect extraction accuracy and sentiment classification performance across varying levels of supervision. Results highlight the trade-offs between model complexity, supervision level, and sentiment analysis accuracy, offering practical guidelines for deploying ABSA systems in educational settings. Our findings demonstrate that modern large language models (LLMs) exhibit strong generalization capabilities even with limited labeled data, while fine-tuned models remain competitive when aspect-specific annotations are available.
Feel the Emotion Behind Every Line.
Naor Mаzliah, Shay Dahary, Mazal Lemalem, Avi Edana
Abstract
The emotional content of song lyrics plays a pivotal role in shaping listener experiences and influencing musical preferences. This paper investigates the task of multi-label emotional attribution of song lyrics by predicting six emotional intensity scores corresponding to six fundamental emotions. A manually labeled dataset is constructed using a mean opinion score (MOS) approach, which aggregates annotations from multiple human raters to ensure reliable ground-truth labels. Leveraging this dataset, we comprehensively evaluate several publicly available large language models (LLMs) under zero-shot and few-shot learning scenarios. Additionally, we fine-tune a BERT-based model specifically for predicting multi-label emotion scores. Experimental results reveal the relative strengths and limitations of zero-shot, few-shot, and fine-tuned models in capturing the nuanced emotional content of lyrics. Our findings highlight the potential of LLMs for emotion recognition in creative texts, providing insights into model selection strategies for emotion-based music information retrieval applications.
Tracing Emotions in Native Expressions
Yotam Hasid , Amit Keinan , Edo Koren
Abstract
Hebrew sentiment analysis remains challenging due to the scarcity of large, high-quality labeled datasets. This study explores a cross-lingual transfer approach by leveraging well-established English sentiment analysis datasets and automatically translating them into Hebrew. The translated datasets serve as a foundation for training and evaluating sentiment classifiers in the Hebrew language. We verify the accuracy of the translated test partitions and their corresponding labels to ensure the reliability of the evaluation. Our experimental framework includes assessing the few-shot learning capabilities of several state-of-the-art pretrained large language models (LLMs) and fine-tuning BERT-based models for sentiment classification. The results provide valuable insights into the effectiveness of cross-lingual dataset translation, the trade-offs between zero-shot, few-shot, and thoroughly fine-tuned models, and the applicability of modern large language models (LLMs) for sentiment analysis in low-resource languages.
Decoding deceptive headlines with transparent LLM insights
Aviv Elbaz, Lihi Nofar, Tomer Protal
Abstract
The proliferation of clickbait headlines poses significant challenges to the credibility of information and user trust in digital media. While recent advances in machine learning have improved the detection of manipulative content, the lack of explainability limits their practical adoption. This paper presents an explainable framework for clickbait detection that identifies clickbait titles and attributes them to specific linguistic manipulation strategies. We introduce a synthetic dataset generated by systematically augmenting real news headlines using a predefined catalogue of clickbait strategies. This dataset enables controlled experimentation and detailed analysis of model behaviour. We explore and compare multiple detection pipelines, including zero-shot large language models (LLMs) and fine-tuned BERT classifiers, integrated with attribution modules that predict the underlying clickbait strategies present in each title. Experimental results demonstrate that attribution-enhanced models improve detection accuracy and provide interpretable insights into the persuasive techniques employed. This work advances the development of transparent and trustworthy AI systems for combating manipulative media content.
Empowering Fairness in AI—Detecting and Correcting Bias in LLM Text.
Netta Robinson, Shay Yafet, Katrin Zablianov, Ariel Sofer
Abstract
This paper explores bias detection and mitigation in Natural Language Inference (NLI), with a focus on hypothesis statements that encode socially sensitive attributes such as gender, race, or profession. We introduce a curated dataset of premise–hypothesis pairs in which the hypothesis contains potentially biased language. Each pair is annotated to indicate whether the attribute reference is inferable from the premise or reflects an unwarranted bias. We develop a classification model that predicts this bias-aware NLI label, distinguishing between justified and biased inferences. To mitigate such biases, we implement a rewriting mechanism that generates bias-neutral hypotheses while preserving the original entailment relation. We propose a PCA-based metric that quantifies semantic and structural changes correlated with bias mitigation to evaluate the rewriting process. Our results show that the framework effectively detects and neutralizes biased language in NLI tasks, supporting more equitable and robust language understanding.
Decoding Market Moods: Real-Time Sentiment Insights for Every Industry.
Liel Ziv, Roy Levy, Adir Habkuk
Abstract
Financial news articles often span multiple economic sectors, with varying sentiment implications for each. Traditional sentiment analysis methods fail to capture this granularity, treating articles as monolithic units or ignoring segment-specific context. We present a novel system for segment-aware sentiment attribution that analyzes financial news at the chunk level, jointly estimating both the economic segment relevance and sentiment for each chunk. The system uses an attribution model to assign a soft vector of economic segments (e.g., finance, automotive, technology) to each chunk and a sentiment model to determine the corresponding sentiment polarity. Segment-level sentiment scores are computed via attribution-weighted aggregation across all chunks in a document. To train the attribution model, we construct a semi-synthetic dataset that blends real financial text with synthetic sentences labeled by economic segment, enabling precise supervision for multi-segment attribution. Experiments on real-world financial corpora demonstrate that our approach yields more accurate and explainable segment-level sentiment estimation than baseline document-level or sentence-level methods. The system can support more nuanced financial analysis, such as market impact forecasting and risk-aware investment strategies.
Keeping Wikipedia Collaborative, Not Combative.
Ron Butbul, Yuval Horesh, Rotem Mustacchi
Abstract
Toxicity detection in online discourse is crucial for maintaining healthy digital environments. This study compares classification methods for identifying toxic comments on Wikipedia. We benchmark various models using the publicly available Wikipedia Detox dataset, including traditional machine learning classifiers (e.g., logistic regression, Naive Bayes), deep neural networks, and transformer-based language models such as BERT. Each model is evaluated on its ability to detect various forms of toxicity, including insults, threats, and identity-based hate. We analyze performance across metrics such as precision, recall, F1-score, and area under the ROC curve, and we assess model robustness to noisy or adversarial input.
Microsoft GenAI suite for the development of next genration AI applications
GenAI in action: from research to practical applications
Securing and Scaling LLMs
2025 Spring, Course Page