On language universals and the nature of the word
In the first part of this talk, I argue that understanding Human Language is best achieved through a description-comparison approach, rather than searching for an abstract “deep reality” or innate universal grammar. This method emphasizes the documentation and comparison of diverse languages using language-specific categories, avoiding assumptions that all languages share the same structures. Since individual languages vary in arbitrary ways, cross-linguistic comparison requires uniform measurement concepts—not shared building blocks. This allows linguists to identify language universals, such as coexpression and asymmetric coding patterns, which reflect functional and evolutionary pressures rather than deep cognitive structures. I critique the generative tradition's speculative assumptions about underlying universals, noting the lack of empirical success. Instead, I advocate for a solid empirical foundation built on descriptive accuracy and broad comparison. Ultimately, comparative linguistics resembles comparative cultural or religious studies—offering insight into the general nature of Human Language through structured analysis of linguistic diversity.
In the first part of the talk, I critique the imprecise use of the term "word" in linguistics, arguing that its ambiguity undermines distinctions between morphology and syntax. Despite frequent claims about universals (e.g., all languages having words), definitions are rarely provided, and cross-linguistic criteria (e.g., phonological vs. morphosyntactic words) often conflict. I highlights inconsistencies in identifying words, citing examples from Asian and European languages, and I question whether the word-morphology-syntax division is a Eurocentric bias. I review proposed wordhood criteria (e.g., free occurrence, mobility, pauses) but find them inadequate or language-specific. I concludesthat "word" may be an unnatural concept, proposing a retro-definition combining free morphs, clitics, roots, and compounds. Similarly, I redefines "compounds" as unexpandable root combinations, rejecting functional or language-specific criteria. Ultimately, I suggest that linguists acknowledge the term’s indeterminacy, much like astronomers retain geocentric expressions despite heliocentric understanding, and advocate for clearer, cross-linguistically applicable definitions.
Prof. Uma Maheshwar Rao
Session 1
Automating Translation: From Human Efforts to Machine Intelligence with Large Language Models
The human pursuit of translation across languages is a long-standing endeavor rooted in the need for cross-cultural communication and knowledge exchange. With technological advancements, particularly in artificial intelligence, we have witnessed substantial progress in simulating language understanding and production. The limitations of human translation, early efforts in automating translation, and traditional rule-based systems have prompted the exploration of new paradigms in machine translation (MT). It also discusses the evolutionary trends in MT developments in India.
This presentation traces the evolution of translation from manual, human-led efforts to modern automated systems powered by large language models (LLMs). It provides an overview of the role of human translators and linguists, followed by a review of early computational approaches such as rule-based (RBMT), dictionary-based, and statistical machine translation (SMT) systems. It then examines the significant shift brought about by neural machine translation (NMT) in the 2010s, particularly with the advent of sequence-to-sequence (Seq2Seq) models.
Finally, the talk focuses on the emergence and impact of LLMs, highlighting how they represent a fundamental departure from earlier models in their architecture, capabilities, and approach to multilingual translation.
Session 2
Contribution of Large Language Modelling for Indian Languages in Overcoming the Linguistic Imperialism.
Emerging trends in the development of machine translation (MT) systems hold the potential not only to preserve endangered languages but also to protect major regional languages from being overshadowed by dominant global languages. MT thus emerges as a critical tool in resisting the homogenizing forces of linguistic imperialism. Machine translation—encompassing both speech and text—represents the pinnacle of language technology, as it attempts to simulate one of the most defining human traits: the ability to learn, comprehend, and communicate across languages.
While the development of MT systems often begins with the encoding of linguistic knowledge into computational frameworks, this alone is insufficient. A deeper understanding of human language requires integrating pragmatic, cultural, world, and sociolinguistic knowledge. Traditional linguistic theories, both major and minor, rely heavily on rule-based descriptions, which can number in the thousands—or theoretically, be infinite—since, as famously noted, “every grammar leaks.” In essence, every rule has exceptions, echoing the neogrammarian insight: “for every exception, there is a rule.”
Researchers in the field of MT have come to recognize the limitations of purely rule-based approaches. The aspiration to build a fully automatic, real-time, general-purpose speech-to-speech machine translation system is unattainable using conventional grammatical models alone. As a result, there has been a significant paradigm shift: from rule encoding to simulating human-like language acquisition through machine learning and large-scale data-driven approaches.
This presentation explores how the challenges posed by conventional MT systems have catalyzed this shift and examines the strategies that have emerged in response. Drawing examples from MT systems translating into various Indian languages, it highlights the latent strategies and innovations that are shaping the future of multilingual machine translation.
Prof. Shobhana Chelliah
Making the use of computational tools for Language Documentation
Advances in computer technology have revolutionized the ways linguists create and search corpora, that is, large collections of language samples including connected texts. Methods for language description and documentation would benefit from thoughtful improvements to data collection approaches that can take advantage of these advances. In this presentation, I will share how the CoRSAL lab has attempted to use one such technology to improve documentation and description outcomes: automatic speech recognition. The presentation will focus on data collection and preparation for ASR modeling and will include considerations of (1) orthography normalization including word divisions, long and short vowel representation (2) transcribing for tone and the need for phonetic and phonemic transcription; (3) creating acoustic models based on conjugated and declined wordlists. I will also share experiments we conducted for how well the ASR model performed and how we are trying to move newly transcribed data into our documentation workflow.
Prof. Jyotiprakash Tamuli
Reading Concordances with Sinclair
This 90-minute session is designed as an introduction to the practice of reading concordances in corpus linguistics, specifically for students of linguistics who are new to the area. Drawing on J.M. Sinclair’s pedagogically-oriented work Reading Concordances, the session offers a hands-on introduction to the methods and mindset essential for developing "concordance literacy"—the skill of interpreting language through corpus evidence.
Corpus linguistics provides access to rich, empirical data on actual language use. Concordances, which consist of keyword-in-context (KWIC) displays of repeated word patterns, offer a powerful entry point into understanding language structure, meaning, and variation. However, as Sinclair argues, the ability to extract meaningful linguistic insights from concordances is not intuitive; it must be cultivated through systematic procedures, critical observation, and a healthy degree of scepticism toward received grammatical and semantic categories.
In this workshop, participants will be guided through a set of carefully selected tasks from Sinclair’s book, with real concordance lines from English corpora. These will illustrate central themes such as the following:
How meanings are shaped by co-text and co-selection
Underlying regularity and idiomaticity in apparently free expressions
Semantic prosody and evaluative bias in collocational patterns
Literal versus metaphorical usage discernible through lexico-grammatical patterns
Participants will be encouraged to engage actively in analysing the concordance lines and formulating generalisations from bottom-up evidence. The pedagogical approach will follow Sinclair’s seven-step method: initiating with repeated patterns, interpreting them, consolidating evidence, reporting hypotheses, recycling the process with new data, and eventually refining or rejecting earlier assumptions.
By the end of the session, it is hoped that the participants will be able to:
Practise recognising and describing recurring patterns in concordance lines.
Gain insight into the interpretative process of moving from language data to linguistic generalisations.
Develop a critical stance toward assumptions embedded in traditional grammar or dictionaries, favouring data-driven insights.
This practical introduction seeks to lay the groundwork for deeper corpus-based projects, equipping students with a repeatable methodology to engage authentically with language data.
Session 1
An introduction to corpus linguistics
In this lecture, aimed at complete beginners with no background in the study of language using corpora, I will provide an overview of corpus linguistics – what it is, its history, some key terms and concepts – before explaining, from the ground up, the underpinnings of the four most basic methods of corpus analysis (concordances, frequency lists, collocations, and keywords) and giving an indicative discussion of how more complex methods (such as diachronic or synchronic comparisons) are built upon them. Finally, I will introduce some points for critical reflection on the use of corpora in language research, touching on such tricky issues as representativeness of corpus data, data collection biases, and the role of corpus linguistics in the environment of today’s new technologies.
Session 2 (Hands - on Session)
Introduction to a corpus analysis tool: CQPweb
This is a practical session which aims to demonstrate to participants the use of one specific piece of software for corpus analysis. This program is CQPweb, a web-based system – that is, the program runs on a remote server and users access it via a web browser. CQPweb is open-source software, meaning anyone can run it, so many servers exist today, created and operated by different universities and research institutions. In this session, we will work on the first server ever set up, which runs at Lancaster University (https://cqpweb.lancs.ac.uk). First, we’ll look at the corpora available to all users on the platform – many of English, some of other languages – and then we will use these corpora to practice three basic corpus analyses using this system.
The first basic analysis is the concordance: a display of corpus query results, e.g. for a word query, showing how the search term is used in context. The second is the distribution: statistics on how occurrences of a word or phrase are spread out across whatever categories of text (e.g. genres or time periods) are defined within the corpus. The third is collocations: lists of statistically calculated associations between a search term and words that tend to co-occur in proximity to examples of that search term.
Finally, we’ll briefly overview the corpora of (South) Asian languages that are available on Lancaster CQPweb server, and the way in which users can access a tool to upload their own textual data to the system.
Natural Language Understanding and Generation: From Cognitive to Machine Modelling in Low-Resourced Language Contexts
Natural Language Understanding (NLU) and Natural Language Generation (NLG) are two core pillars of Computational Linguistics and Artificial Intelligence. While high-resourced languages have seen remarkable advancements in these domains, low-resourced languages face unique challenges due to scarcity of linguistic data, limited digital infrastructure, and complex morphological structures. This lecture traces the continuum from human cognitive models of language comprehension and production to their computational realizations, highlighting the interplay between linguistics, cognitive science, and machine learning. Special emphasis will be placed on adapting cognitive perspectives to computational frameworks such as rule-based, statistical, and neural models, and on designing resource-efficient approaches suitable for underrepresented languages. The session will also explore the creation and use of linguistic resources, including corpora, morphological analyzers, linguistic embellishments with annotations like PoS, and semantic resources and frameworks, as foundational enablers for NLU and NLG in low-resourced settings.
From Inhalation to Intonation: Mapping the human voice for technology and therapy
The process of speech production begins with inhalation of air which can be used for initiation of speech sounds. We have shown earlier that the acoustic characteristics of the inhalation also contains some speech sound specific characteristics. Further, post initiation of speech sounds, acoustic characteristics of speech sounds are complex and often colored by the surrounding speech sounds. Such effects can be immediate and also long-distance. I will draw examples from our previous works on segment - tone interaction and current work on tonal coarticualtion to illustrate this aspect. Later, I will draw examples from some recent analyses where we have tried to determine normative values for speech phenomena such as nasality. Finally, I will illustrate how these findings are feeding into speech technology development and for speech therapy related applications.
Session 1 (Lecture and Hands - on Session)
Corpus Query Techniques
This session introduces participants to fundamental techniques used in querying linguistic corpora. Designed for researchers and students working with digital language data, the session aims to how corpus query techniques can help answer corpus-based research questions. Beginning with a brief overview of what corpus data is and how it is structured, the session then shifts to corpus query systems, which allow users to perform complex linguistic searches. Participants will be introduced to various query strategies and implement multiple search techniques, starting with basic searches based on word forms, part-of-speech (POS) tags, and lemmas. These techniques are essential when exploring grammatical structures, variations in usage, or stylistic features across different text types or speaker groups. Time permitting, the session will also introduce Corpus Query Language (CQL), a powerful formalism that allows users to create complex search patterns using corpus query syntax. By the end of the session, participants will be able to formulate their corpus queries and understand how different search techniques influence the interpretation of corpus data.
Session 2 (Lecture and Hands - on Session)
Using XML in Corpus Encoding and Analysis
This session focuses on the role of XML (eXtensible Markup Language) in corpus linguistics, particularly for encoding textual or segmental annotation beyond words. The session begins with a brief introduction to XML, including its syntax, structure (elements, attributes), and well-formedness. Participants will learn how XML encodes various linguistic and metalinguistic information layers, such as speaker identity, text domains, and discourse features. A particular emphasis will be placed on how XML enables cross-segment annotation. Participants will have a hands-on session to experience incorporating demographic information in XML format and index the XML-enriched text as a corpus. Next, the session will explore how corpora enriched with XML can be analyzed. Participants will see how XML-encoded values can be targeted in searches, enabling highly customized queries across multiple annotation layers. By the end of the session, attendees will have a foundational understanding of how XML can be effectively used in corpus development and analysis, especially for projects involving detailed annotation schemes or multimodal/multilingual data.
Session 1
Foundations and Challenges in Machine Translation in Indian Context
Machine Translation (MT) has come a long way in recent years, especially with the rise of neural networks and large language models. But when it comes to India, a country with incredible linguistic diversity, MT still faces many challenges. With 22 official languages and hundreds of regional dialects, translating between Indian languages isn’t just a technical task — it’s a cultural and linguistic puzzle. Many of these languages don’t have enough digital data or parallel corpora, which makes it difficult to train effective MT systems.
This abstract looks at the evolution of MT in India, from early rule-based and statistical methods to today’s neural models. It highlights some of the core challenges, such as limited resources, structural differences between languages, and the need to support low-resource and morphologically rich languages. We also touch on important national initiatives like the National Language Translation Mission (NLTM), which aim to make digital content available in local languages.
Building truly inclusive and accurate MT systems for India will require more than just powerful algorithms — it will need better data, deeper linguistic understanding, and collaboration across technology, academia, and government. This abstract offers a starting point for understanding what’s been achieved and what still needs to be done to bridge India’s language divide through technology
Session 2
LLMs and their machine translation capability
Large Language Models (LLMs) like GPT, PaLM, and LLaMA have transformed how we think about language processing — and machine translation is no exception. Unlike traditional translation systems that rely on task-specific architectures and parallel corpora, LLMs are trained on massive amounts of multilingual data and can often translate between languages without being explicitly fine-tuned for the task. This shift has opened up new possibilities, especially for translating low-resource or less-represented languages.
We explore how LLMs handle translation, what makes them different from earlier approaches, and where they currently stand in terms of quality, consistency, and cultural sensitivity. While their performance on high-resource language pairs is impressive, challenges remain when it comes to preserving nuances, dealing with domain-specific content, and ensuring fairness across all languages — especially those that don’t have enough training data.
We also discuss some emerging trends, like instruction-tuning and multilingual prompting, that are helping improve translation accuracy in LLMs. While we’re not yet at the point where LLMs can fully replace dedicated translation systems, their flexibility and generalization ability make them a powerful tool in the evolution of machine translation.
Session 2 (Hands – on Session)
Machine Translation Implementation
Session 1
Grammar to code: The World of Computational Linguistics
This lecture explores fundamental text processing techniques that form the backbone of Natural Language Processing. We begin with tokenization, the process of breaking text into meaningful units like words or phrases, which serves as the first step in text analysis. Next, we examine part-of-speech tagging, where each token is labeled with its grammatical role, such as noun, verb, or adjective. Lemmatization and stemming are then introduced as techniques for reducing words to their base or root forms, enhancing consistency in textual data. The lecture also covers Named Entity Recognition, which involves identifying and classifying proper nouns in text, such as names of people, organizations, and locations. Finally, we delve into dependency and constituency parsing, advanced syntactic analysis methods that uncover the grammatical structure of sentences and relationships between words.
Session 2 (Hands – on Session)
Texts processing techniques
Building corpora for Indian languages w.s.r.t. the North-East
Building corpora for Indian mother tongues (MTs), languages and their different varieties, particularly those in the diverse NE India region, is a crucial endeavour facing unique challenges and requiring collaborative efforts.
Many of them are endangered and lack sufficient documentation. Building corpora helps in capturing and archiving these MTs and languages for future generations. Corpora are the building blocks for -developing various Natural Language Processing (NLP) applications like machine translation, sentiment analysis, speech recognition; -examining phonetics, morphology, syntax, and semantics; -creating language educational materials like structured lessons, dictionaries and grammars; and so on.
NE India boasts a high degree of linguistic diversity, including numerous MTs, languages and their varieties from various families, with many being unwritten and under-described. Many of them lack the necessary writing systems, resources to support corpus creation and linguistic research. Acquiring data for many low-resource languages can be difficult, particularly in contexts like Computer-Mediated Communication (CMC) or specialized domains. Moreover, challenges arise from variations in character sets and keyboard layouts, making standardization difficult. There’s also a lack of essential NLP tools like POS taggers and parsers for many of these MTs, languages and their varieties. Even, the use of a common tag set for all the MTs, languages and their varieties from across different families in the country creates problems in annotation of corpora. Again, some MTs, languages and their varieties are spoken in remote and inaccessible areas, posing difficulties for data collection and fieldwork.
Educational and research institutes like the Central Institute of Indian Languages (CIIL), Mysuru, Tezpur University, Tezpur and IIT Guwahati among others have undertaken projects to document and digitize endangered languages and develop corpora. The scheme Linguistic Data Consortium for Indian Languages (LDC-IL) of CIIL has been releasing datasets to foster NLP development in Indian languages. Efforts are also underway to create parallel corpora for machine translation in North-East Indian languages.
Building robust corpora for Indian languages, especially the diverse and endangered languages of NE India, is a vital undertaking. Collaborative efforts, leveraging technology, and addressing the unique challenges faced by these MTs, languages and their varieties are essential to ensure their survival and promote their use in various domains.