MarcelloFederico - Scientific Journey

My Scientific Journey in Speech and Language Technology

Opening: Why Language is Hard for Machines

Teaching Machines to Listen (1993–2003)

Crossing the Language Barrier: The Rise of Statistical Machine Translation (2004–2015)

The Neural Revolution: Smarter, More Fluent Translation (2016–2021)

Richer Context, Bigger Ambitions (2018–2023)

The Age of Large Language Models: New Questions at the Frontier (2022–present)

Opening: Why Language is Hard for Machines

Language is the most natural thing in the world — until you try to teach it to a machine.

We speak without thinking about it, switching effortlessly between listening, reading, and translating, filling in gaps, tolerating ambiguity, and making ourselves understood across accents, dialects, and contexts. For a computer, every one of those things is a puzzle of staggering complexity. Yet over the past three decades, machines have gone from barely recognizing a spoken sentence to translating live conversations, dubbing videos, and generating fluent text in dozens of languages. I have had the good fortune of contributing to this transformation — from its early, uncertain days to the era of large language models — and this page tells that story.

My journey began in the early 1990s when I joined IRST (Istituto per la Ricerca Scientifica e Tecnologica) in Trento, Italy — later reborn as FBK (Fondazione Bruno Kessler) — at an exciting moment when the institute was still ramping up and the field of language technology was wide open. I was fortunate to enter a stimulating environment where curiosity was encouraged and the culture of doing science that matters — rigorous but always connected to the real world — shaped my thinking from the very beginning. Over the course of my career I have been lucky enough to witness the "magic" happen several times: those rare moments when a new idea suddenly works, when a technology crosses a threshold and the world looks different on the other side. Looking back, I feel I have occasionally found myself working on problems that turned out to matter more than we realized at the time, and that the problems you chose are far more rewarding than the solutions you develop.

Over time my role evolved from individual contributor to research leader — building and guiding teams, setting scientific agendas, and keeping one eye always on problems with real-world impact. That pattern continued when I moved to Silicon Valley to join Amazon Web Services, where for six years I led the machine translation science team and — in what would prove to be an early, fascinating glimpse into where the field was heading — a research project on automated dubbing, at the time a largely unexplored frontier. I later moved to Madrid, where I now oversee science efforts across multimodal and multilingual AI for Amazon Stores Europe. Looking back, what has remained constant across three decades and three countries is a belief that the best research happens at the boundary between deep scientific curiosity and genuine human need. The tools and benchmarks built by our teams have been used by researchers and engineers worldwide, and helped companies of all sizes — from startups to large enterprises — build products and services they could not have created otherwise.

Teaching Machines to Listen (1993–2003)

When I started my research career, the idea of talking to a computer was still largely science fiction. There were no voice assistants, no dictation software worth using in practice, no systems that could reliably understand natural speech in real-world conditions. The fundamental challenge was deceptively simple to state: given an acoustic signal, how does a machine figure out what words were spoken?

The answer, it turned out, was statistical. Rather than programming rules for every possible pronunciation or sentence pattern, you build a language model — a mathematical description of which words and sequences of words are more or less likely in a given context — and combine it with an acoustic model that maps sounds to phonemes and words. The two work together: the acoustic signal narrows down the candidates, and the language model picks the most plausible interpretation. Getting this combination right, efficiently and robustly, was the central problem of the field.

My early work focused on making language models smarter and faster. Working within a rich international scientific community that was collectively defining the field, I contributed innovations to core problems of estimation and adaptation — tuning a general-purpose model to a specific topic or domain without starting from scratch. This is a problem that sounds technical but has immediate practical consequences: a system trained on newspaper text will struggle badly when a doctor starts dictating a radiology report. The ideas around language model adaptation developed in those early years turned out to be impactful and highly enduring; versions of them lasted until the dawn of neural models.

Speaking of radiology reports: one of the most striking applications I worked on in this period was A.Re.S. (Automatic Reporting System), a system that allowed radiologists to dictate medical reports in a natural way, directly into a computer. In the mid-1990's, it was still a very challenging problem. Doctors were skeptical, the technology was fragile, and the vocabulary of medical imaging is highly specialized. Yet the system worked so well that it eventually spun off into a dedicated company, which successfully deployed dictation systems across Italian hospitals — one of the earliest real-world validations that speech technology could save professionals meaningful time in demanding environments. Around the same time, we explored voice telecontrol of robots — using spoken commands to direct a robot remotely — and real-time voice data entry into databases. These high-stakes applications forced us to build systems that were not just accurate in the lab, but robust under real conditions.

These early applied experiments were early harbingers of a much broader trend: the idea that speech interfaces could, and should, be embedded in professional tools across every field. Today this seems obvious. In the mid-1990s, it required a significant leap of foresight.

A larger and more sustained project was the automatic transcription of Italian broadcast news. Together with colleagues at IRST, I helped build systems that could listen to radio and television news and produce a written transcript in near real time. This was harder than it sounds: broadcast news is fast, the vocabulary is broad and constantly changing, speakers have different accents, and audio conditions vary. The technology found its first major deployment at RAI, Italy's national broadcasting company, where it was used to feed both national and regional archives — a striking early example of speech recognition operating at genuine national scale. Over time, this work too gave rise to a spin-off company specializing in audio transcription services and media monitoring, a field that has only grown in importance as the volume of spoken content in the world has exploded. The experience gave us deep lessons in making speech recognition scale to the messiness of the real world.

The final strand of this period took the problem in a different direction: not just understanding speech, but helping people find spoken content across language boundaries. In a series of studies around cross-language information retrieval — conducted partly through the international CLEF evaluation campaigns — we asked: can a user search a spoken archive in English and find relevant documents in Italian, or vice versa? The answer was yes, with the right combination of translation and retrieval techniques. It was an early taste of the multilingual challenges that would come to dominate the next phase of my career.

By the early 2000s, the landscape was changing fast. Speech recognition was maturing, and a new frontier was opening up: not just understanding language, but translating it.

Crossing the Language Barrier: The Rise of Statistical Machine Translation (2004–2015)

By the early 2000s, a quiet revolution was underway in language technology. For decades, machine translation had been dominated by rule-based approaches: linguists would painstakingly encode the grammar, vocabulary, and idioms of a language pair into a set of hand-crafted rules, and the machine would follow them. The results were often stilted, brittle, and expensive to build. Then came a radically different idea, whose roots go back to foundational work by IBM researchers in the late 1980s and early 1990s: instead of programming the rules, why not learn them automatically from data? Feed a system millions of pairs of translated sentences, let it discover the patterns statistically, and it will find structure that no human linguist would have thought to write down. It took years for the broader community to fully embrace this vision, but by the mid-2000s statistical machine translation (SMT) was transforming the field.

I threw myself into this new paradigm with a team of talented researchers, and it turned out to be one of the most productive periods of my career. The problems were hard, the community was vibrant and competitive, and the applications were obvious: the world desperately needed better ways to move information across language barriers.

Our contributions touched many of the core challenges. One was how to handle the fact that languages order words differently — German, for instance, tends to push its verbs to the end of a sentence in ways that make direct word-for-word translation absurd. We developed smarter ways for translation systems to reorder words and phrases, a deceptively tricky problem with a large impact on translation quality. Another was efficiency: translation models are enormous, and making them fast and compact enough to be useful in practice required innovations in how linguistic knowledge was stored and retrieved.

Two open-source tools we built during this period had an important impact on the field. IRSTLM, a toolkit for building and managing the large statistical language models that sit at the heart of any translation system, was adopted by research groups and companies around the world. Moses, co-developed with an international consortium of leading research groups, became the standard open-source platform for statistical machine translation for nearly a decade — the tool that countless researchers and engineers used to build, test, and deploy MT systems. For a period, Moses became the cornerstone infrastructure for the field, widely adopted by researchers, practitioners and companies. Seeing a piece of software become the common infrastructure of an entire scientific community is one of those deeply satisfying experiences that makes research feel worthwhile.

Alongside these technical contributions, I invested heavily in something that is easy to underestimate: evaluation. Science advances faster when the community agrees on how to measure progress, and in spoken language translation that consensus was missing. Together with colleagues, I helped establish and then lead the IWSLT evaluation campaign — an annual international competition in spoken language translation that used TED talks as a common testbed, making it possible for teams worldwide to compare their systems fairly. The WIT3 dataset of transcribed and translated talks that we built to support IWSLT was for a long time one of the most widely used resources in the field. Running a shared evaluation task sounds like organizational work, but it is also a form of scientific stewardship: you are helping provide a framework for what questions the community asks and how progress is measured.

A third thread was perhaps the most directly connected to human impact: computer-assisted translation and the question of how MT could make professional translators faster and better, rather than simply replacing them. Translation is a skilled, cognitively demanding profession, and the relationship between human translators and machine translation tools is subtle. We ran careful studies measuring how much MT actually helped — under what conditions, for which language pairs, for which text types — and developed methods for systems to adapt continuously to a translator's corrections, getting better with every edit. This work fed into MateCat, an open-source professional translation platform developed in collaboration with an industry partner, which brought these ideas into the hands of real translators working on real documents.

By the time neural networks began reshaping the field around 2016, the SMT era had left a lasting legacy: open tools used by thousands, evaluation benchmarks that structured a decade of research, and a much clearer understanding of what it actually takes — technically and humanly — to cross the language barrier at scale. But the transition, when it came, was swift and unsparing — even faster than the earlier shift from rule-based computational linguistics to data-driven statistical approaches, which had itself once seemed revolutionary. Within just a few years, neural machine translation had rendered most of what we had built technically obsolete, and forced us to radically rethink our approaches, our tools, and our intuitions about how machines learn to translate. It was instructive and exciting, and — as I would come to appreciate — exactly the kind of paradigm shift that keeps science alive.

The Neural Revolution: Smarter, More Fluent Translation (2016–2021)

The arrival of neural machine translation marked, for those of us in the field, the abrupt closure of an era. Almost overnight, systems based on deep neural networks began producing translations that were dramatically more fluent and natural-sounding than anything statistical approaches had achieved. The research community scrambled to understand what was happening, why it worked so well, and — crucially — whether fluency was the same thing as quality.

That last question turned out to be more important than it first appeared. Neural MT outputs could read beautifully while still being subtly wrong: mistranslating numbers, dropping clauses, or failing to respect the meaning of specialist terms. One of our first contributions in this new era was to look carefully and rigorously at where neural MT actually gained over statistical MT, and where it still fell short. This kind of honest comparative evaluation is less glamorous than building new systems, but it is essential: without it, the field risks being seduced by impressions rather than guided by evidence.

Perhaps the most vivid illustration of just how fast this transition happened was ModernMT — a real-time adaptive machine translation system we had been developing based on statistical approaches, which suddenly faced an existential choice: adapt or become irrelevant. In a matter of weeks, the team pivoted entirely, rebuilding ModernMT from the ground up on the emerging transformer neural architecture. That it worked — that a small, determined team could absorb a paradigm shift of that magnitude in such a compressed timeframe — is a testament to both the quality of the people involved and the clarity of the scientific moment. ModernMT went on to become an open-source project and eventually spun off into a startup company — one I had the privilege of leading as CEO before joining Amazon. Its open-source adoption, however, was more limited than we had hoped: the neural MT community grew with remarkable speed but also fragmented just as fast, with many competing solutions emerging in quick succession, each with its own following. It was a humbling lesson in how a vibrant, fast-moving scientific community can be both an enabler and a headwind at the same time.

At AWS, the focus was sharply different: building adaptive neural MT systems capable of serving at scale the needs of large enterprise customers — companies with high translation volumes, strict quality requirements, and the need to customize systems to their own domains and terminology. This meant investing in the full stack of neural MT technology, from model architecture to efficient serving infrastructure, always with an eye on reliability and scale rather than academic benchmarks alone.

One of the defining challenges of this period was multilinguality. Statistical MT had largely been built language pair by language pair, requiring separate systems and large amounts of data for each combination. Neural approaches opened up a tantalizing possibility: a single model trained on many languages simultaneously, capable of translating between pairs it had never explicitly seen, and of lending its knowledge of rich languages to help with poor ones. We invested heavily in this direction, developing methods for multilingual NMT, transfer learning across languages, and adaptation to unevenly resourced languages — such as Scandinavian languages — which have less data available compared to languages more widely represented on the web, but which a global service like AWS could not afford to neglect.

A recurring theme throughout this period was control: how do you make a neural MT system not simply do what it has learned to do on average, but respond to what you specifically need at any given moment? This manifested in several concrete problems. One of the most insidious was terminology management — ensuring that a system translates a technical term the way a client requires, rather than the way it encountered it most frequently in training data — a problem that neural models, prone to paraphrasing freely, struggled to handle. We developed training methods that taught systems to respect client-imposed terminology constraints, a capability with direct commercial value in legal, medical, and technical translation. Related work addressed formality and style: how to translate into the appropriate formal or informal register for a given context, or how to model the individual style of a human translator. At AWS, these questions took on an additional industrial dimension: how do you deliver all of this reliably and efficiently, at the scale demanded by large enterprise customers processing millions of words a day?

By the early 2020s, neural MT had matured from a disruptive novelty into the undisputed foundation of the field. But even as we were consolidating our understanding of it, something larger was stirring. The same neural architectures that had transformed machine translation were being scaled up, trained on vast swaths of the internet, and turning into something qualitatively new: large language models capable of translating, writing, reasoning, and much else besides. Once again, a new paradigm shift was approaching.

Richer Context, Bigger Ambitions (2018–2023)

With that perspective in mind, the research directions we pursued in this period take on an added significance. A new question had come into focus: what if, instead of relying solely on what a model had learned during training, you could give it relevant examples at the moment of translation? This idea, which the community had begun exploring under the name of context-augmented or retrieval-augmented translation, was one we contributed to actively. The approach was straightforward in principle: retrieve the most relevant examples from a database of past translations — so-called fuzzy matches, sentences that are similar but not identical to the input — and feed them to the model alongside the source text, letting it draw on them as additional guidance. In practice, making this work well required solving non-trivial problems about how models should weigh retrieved examples against their own learned knowledge, and how to prevent them from being misled by matches that were similar in form but different in meaning. Our contributions to this space were recognized in a patent for fuzzy-match augmented machine translation. What strikes me in retrospect is how naturally this line of research aligned with ideas that were emerging in parallel in the broader AI community — what had become known as in-context learning and retrieval-augmented generation (RAG). We were working on the same fundamental intuition in the concrete and rigorous setting of translation, at a time when LLMs were not yet competitive in machine translation.

From translation enriched by context, it was a natural step — though a technically ambitious one — to ask a much bigger question: what if you could translate not just text, but an entire audiovisual experience?

When you watch a dubbed film or a translated documentary, you are consuming the product of an enormously labor-intensive process: translators, voice actors, directors, and sound engineers all working together to make a foreign-language performance feel natural in your own language. The lip movements have to roughly match the words. The rhythm has to fit the original cadence. The voice has to carry the right emotional weight. For decades this process had resisted automation, because it sits at the intersection of multiple hard problems that must be solved simultaneously.

From the moment I joined AWS, I had the opportunity — and the conviction that the time was right — to launch a pioneering research programme on automated dubbing, at a time when large language models were still far off on the horizon. The goal was to build an end-to-end system capable of taking a video in one language and producing a naturally dubbed version in another, designed from the outset to allow translators and dubbing professionals to intervene and correct results at every stage of the process — from translation to speech synthesis. This meant tackling a chain of interconnected problems that no one had yet addressed systematically as a whole.

The first was isometric machine translation: teaching a translation model to produce output of roughly the right length to match the duration of the original speaker's utterance. The second was prosodic alignment: once you have a translation of the right length, you need the synthesized voice to deliver it with a rhythm and timing that mirrors the original speaker's cadence. For this component we worked in close collaboration with the Alexa speech synthesis team, combining expertise that is rarely found under one roof — and which in this project proved complementary and vital to our progress. The third was evaluation: how do you measure whether a dubbed video is good? We developed PEAVS— a perceptual evaluation metric for audio-visual synchrony grounded in actual viewer judgments — a general-purpose tool that proved particularly useful for evaluating the quality of dubbed videos.

Taken together, this body of work contributed to the growth of automated dubbing as a recognized area of research. The vision driving it — that the language barrier in video content could one day be crossed automatically, making lectures and stories accessible to anyone in any language — remains one of the most compelling applications I worked on.

The Age of Large Language Models: New Questions at the Frontier (2022–present)

What had changed was the scale, and with scale had come something that felt, for the first time in my career, like genuine magic. I must admit that what these models managed to solve came as a profound surprise to many of us in the field, myself included. I had witnessed the decline of rule-based computational linguistics, the statistical revolution, the advent of neural networks — but this was, without doubt, the most profound and disruptive paradigm shift I have ever witnessed.

For the research community, however, the arrival of LLMs was not a moment to stop and admire. It raised urgent and uncomfortable questions, and I found myself drawn to several of them.

The first was one of the most striking empirical findings of recent years: just how much of the multilingual web is already machine-translated. This question arose directly from our work on developing the first multilingual LLMs at AWS — an effort aimed at preparing the transition from dedicated neural MT systems to LLM-based translation, and one that forced us to look very carefully at the quality and composition of the multilingual training data on which such models depend. As LLMs are trained on vast crawls of internet text, the authenticity of that data matters enormously. Together with colleagues, we set out to measure the prevalence of machine-translated content across the web — and the findings were startling. A shocking proportion of multilingual web content turns out to be machine-translated, often without any indication to the reader. This has profound implications: if the data used to train future AI systems is itself the output of earlier AI systems, the potential for compounding errors and self-reinforcing biases is real and serious. It is one of the more consequential feedback loops in modern AI, and one that the field is only beginning to grapple with.

The second question was about trust and reliability. LLMs are now widely used to retrieve and synthesise information — answering questions, summarising documents, helping people navigate complex topics. But how do you know whether the answer a model gives you is actually grounded in the sources it claims to draw on, or whether it is confabulating plausibly? This problem, known as faithfulness in retrieval-augmented generation, became the focus of a new line of work. We developed MEMERAG, a multilingual benchmark for evaluating how faithfully RAG systems answer questions across languages — a tool designed to give researchers and engineers a rigorous way to measure a property that is easy to describe but surprisingly hard to quantify. Alongside this, we explored whether models could be trained to evaluate their own faithfulness across languages, developing multilingual self-taught faithfulness evaluators— systems that learn to assess their own reliability without requiring expensive human annotation for every language.

These two threads — data quality and output trustworthiness — may seem like technical concerns, but they point to something larger. As AI systems become more capable and more widely deployed, the questions that matter most are not only "can the system do this?" but "can we trust what it produces, and do we understand where it learned it from?" These are questions that sit at the intersection of machine learning, information retrieval, and what might broadly be called responsible AI. They are also, I would argue, natural extensions of concerns that have run through my research from the very beginning: evaluation rigour, real-world reliability, and the gap between impressive-sounding outputs and genuinely trustworthy ones.

My move from AWS to Amazon Stores was itself a deliberate choice, driven by the conviction that we are now living through one of the most exciting moments in the history of AI applications — a time when, alongside the continued advances in core technology, an extraordinary range of powerful applications has become possible that simply were not within reach before. After years of developing foundational tools and models, I felt drawn to be closer to the customer problems, to see AI land in ways that directly and tangibly improve people's lives. Both sides of this equation — pushing the frontier of what AI can do, and finding the best ways to put it to use — have never been more important or more interconnected. The challenge of making product information, customer interfaces, and shopping experiences work seamlessly across dozens of languages and cultures — at the scale and quality that customers expect — is one of the most demanding multilingual AI problems in existence, and one where the distance between a good idea and real impact can be remarkably short.

As my career has evolved, so has my relationship with research itself. Where I once measured my contribution in papers and systems, I now find myself focused on the bigger picture: setting the conditions for science to happen well, mentoring the next generation of researchers, and helping talented people do their best work. This sense of obligation to the next generation has deep roots. For twenty years I lectured at university, teaching the foundations of speech and language technology to students who would go on to shape the field themselves. I had the privilege of advising fifteen PhD students over the course of my career — each one a long and rewarding intellectual journey, from first research question to defended thesis. Many of them are now researchers and engineers at leading technology companies and universities around the world, working on problems that did not exist when they started their doctorates. Seeing former students thrive, and occasionally finding their work cited in papers crossing my desk, is one of the quieter and more lasting satisfactions of a life in science.

Increasingly, this also means leading our efforts on responsible AI — ensuring that the technology we build is not only capable and efficient, but safe, trustworthy, and developed with a clear sense of its broader impact on people and society. After three decades of pushing the boundaries of what machines can do with language, I find that the most important question has always been the same one: not just can we build this, but should we, and how do we make sure it genuinely serves people well?

Thirty years in, the field looks nothing like what I imagined when I first started working on language models in Trento. And yet the fundamental questions feel remarkably continuous: how do machines learn to understand and generate human language? How do we make them reliable enough to trust? How do we build AI systems — including agentic systems capable of acting on people's behalf — that truly serve people in all languages and cultures? Those questions were worth asking in 1993. They are worth asking today. And I suspect they will still be worth asking — in forms we cannot yet imagine — for a long time to come.

Google Sites

Report abuse