This is my Master's dissertation for the MSc. in Speech and Language Processing (SLP) at the University of Edinburgh.
This blog serves as a simple version of my dissertation. If you are interested in reading my dissertation, please contact me through my email.
✨ Why I chose this topic ✨
In Speech Processing class in the first semester of the SLP, I asked Festival to say: "Mt. Fuji in Hakodate is a must-visit." It was no surprise that Festival pronounced "Hakodate" incorrectly since Festival is an English text-to-speech (TTS) system. The bigger surprise was that Festival pronounced "Mt." wrongly! Among roughly 80 sentences, Festival incorrectly pronounced some instances of date, time, and measurement unit. These examples belong to the problem of text normalization, a component in the front end of TTS responsible for converting non-standard words (NSWs), broadly defined as atypical word strings, such as dates, numbers, and currency symbols, into standard words.
Incorrect text normalization can mislead the listeners! Miscommunications are always bad. However, text normalization is a much-overlooked problem in TTS since much focus is on producing natural speech outcomes. Hence, when I saw that text normalization was one of the topics in the dissertation topic list, I immediately chose it.
✨ The dataset ✨
Based on my search so far, there are only two publicly available datasets, one from Google for English, Polish, and Russian (the Google Text Normalization dataset), and one from Baidu for Mandarin Chinese.
Among all of these datasets, I chose the English dataset by Google since English is the only language I am familiar with. I wanted to validate the result during inference myself rather than asking friends who are native speakers of those languages.
There are fifteen classes in the English dataset. Specifically, two non-normalized classes are PLAIN, represented by <self> to denote non-normalized words, and PUNCT, represented by sil to indicate a punctuation mark. Thirteen NSW classes are DATE, LETTERS (letter sequence), CARDINAL (cardinal number), VERBATIM (verbatim reading of sequence of characters), MEASURE (measurement unit), ORDINAL (ordinal number), DECIMAL (decimal fraction), ELECTRONIC (electronic address), DIGIT (digit sequence), MONEY (currency symbol), FRACTION (non-decimal fraction), TIME (temporal expression), and ADDRESS (street address).
The raw dataset is in the following format:
PLAIN He <self>
PLAIN is <self>
CARDINAL 6 six
MEASURE ft feet
PLAIN tall <self>
PUNCT . sil
Looking at this dataset, text normalization looks like the core NLP task of machine translation. In a simpler term, the user gives the TTS system a sentence with NSW words, and then the TTS system needs to translate both the normal words and NSWs into corresponding strings.
✨So what can we do to solve the problem of text normalization? ✨
Previous researchers tried to solve the problem of text normalization in three ways: rules-based methods, machine-learning methods, and hybrid methods.
Rules-based methods:
The dominant method: using WFSTs to encode rules to pronounce NSWs. However, the downside is that it is expensive to list all the rules, and WFSTs cannot handle morphosyntactically complex languages, such as Finnish, well.
Machine-learning methods:
Because of the drawbacks of rules-based methods, researchers turned to machine learning models, ranging from n-grams to RNNs and LSTMs to normalize text. To enhance the system's ability to capture contextual information leading to better text normalization, Ro et al. (2022) fine-tuned BERT-base on a subset of the English Google Text Normalization dataset and a private dataset to attain word embeddings and word representations. They then trained stacked RNNs on these word embeddings and representations. In the most recent study at the time I wrote the dissertation, Zhang et al. (2024) prompted GPT-4.0, the most advanced GPT model at that time, to normalize text.
The outputs of these models were better than those produced by rules-based systems. However, each of those systems still made errors in certain NSW classes.
Hybrid methods:
Some researchers believed that combining both rules-based and machine-learning methods can help fix errors produced by pure machine-learning models. The hybrid approaches include:
Zhang et al. (2019) incorporated rules in the decoder of their models in two different ways. They integrated Thrax grammar rules for measurement units and currency symbols, which both had a normalization accuracy of 97.2% prior to the incorporation of rules, and a finite-state transducer for number expressions to guide their RNN's decoder. This architecture resulted in an accuracy of 99.3% in normalizing measurement units and an accuracy of 100\% in normalizing currency symbols. Using the data from the first file in the English Google Text Normalization datatset, Zhang et al. trained a Gated Recurrent Unit (GRU) encoder to identify and classify NSWs and an RNN decoder with covering grammars, which are lightweight grammars generating a list of possible verbalizations for an NSW, instead of attempting to determine one correct verbalization for that NSW. Though Zhang et al.'s model perfectly normalized decimal numbers, it poorly normalized electronic addresses with an accuracy of 63.30% and temporal expressions with an accuracy of 75.00%
Bakhturina et al. (2022) and Ying et al. (2024) used rules and neural models differently to normalize text. Specifically, Bakhturina et al. (2022) first used a non-deterministic WFST to produce all possible normalizations for prenormalized sentences in a subset of the Kestrel dataset and in some LibriTTS transcripts produced by Zen et al. (2019). They then employed BERT-base model, which is a Transformer-based model, to re-score the candidate normalizations because BERT-base model was able to disambiguate contextual information. Finally, BERT-base model selected the highest-score normalization as the output normalization. Bakhturina et al.'s system achieved an accuracy of 94.37%, but it still made unrecoverable errors, such as normalizing "75F" as "seventy five degree", instead of "seventy five degrees Fahrenheit". Ying et al. (2024) fine-tuned BERT model on a subset of the Kestrel dataset to obtain word embeddings, and then trained a Convolution Bank and bidirectional GRU encoder with a masked Conditional Random Fields layer on the said word embeddings to produce normalization. Lastly, they used expert-designed rules to refine normalization. Ying et al. claimed that their text normalization component was good because this component produced a state-of-the-art SER of 1.19% and they could train this component jointly with other TTS front-end components, which could accelerate the deployment speed of a TTS system.
My approach:
It is noticeable that different types of models have been employed to normalize text. Especially concerning using large language models, researchers have used BERT, which is an encoder-only LLM, to attain rich representations of input text (Ro et al., 2022; Ying et al., 2024). Some researchers have fine-tuned or prompted decoder-only LLMs, namely GPT-4.0, to attain normalization of text (Zhang et al., 2024). However, no researcher has used an encoder-decoder LLM to normalize text. This study aims to fill this gap by fine-tuning an encoder-decoder LLM to verbalize text because obtaining the correct verbalization of text demands the model to have a good understanding of the context (Sproat, 2001), and encoder-decoder LLMs can both capture the context of the input text well and generate relevant output (Hui et al., 2022). Moreover, fine-tuning an LLM helps improve its accuracy and relevance by adapting its parameters to suit task-specific knowledge (Browing, 2024).
Eventually, I chose to fine-tune FLAN-T5-small model. This encoder-decoder LLM belongs to the FLAN-T5 family developed by Chung et al. (2024). Each model of FLAN-T5 family was simultaneously trained on 1836 tasks concerning language, reasoning, and mathematics, which leads to FLAN-T5 family's rich knowledge of syntax and semantics of various languages, as well as good understanding of the connections between entities appearing in the training data. Because of such knowledge and understanding of FLAN-T5 family, Chung et al. (2024) claim that any FLAN-T5 model can perform well on unseen tasks by fine-tuning or prompting the model. As I need a model that can capture context well and generate appropriate normalization, FLAN-T5 family with its abilities and knowledge suits my needs. Moreover, due to computational constraints, I chose to fine-tune FLAN-T5-small model, containing 80 million parameters.
✨ The results ✨
This section is a shortened version of two chapters, Results and Discussion, in my dissertation.
Result 1: The accuracy of normalizing each class:
Among all research about approaches to text normalization, only Zhang et al. (2019) used the same dataset to train their model (the first file in the English Google Text Normalization Challenge dataset) and employed accuracy as the evaluation metric. This makes the results of my study directly comparable to those of Zhang et al. (2019). The accuracy in verbalizing/normalizing each class in my study and in Zhang et al.'s (2019) research is given in the following table:
Accuracy in verbalizing/normalizing each class in my study and in Zhang et al.'s (2019) study
(Boldfaced numbers indicate that my accuracy in verbalizing a class is higher)
Overall, the accuracy of normalizing all classes of my model is 99.59%.
The fine-tuned FLAN-T5-small model performed better on 8 classes, namely PLAIN, PUNCT, DATE, LETTERS, CARDINAL, ELECTRONIC, DIGIT, and TIME. Especially, the fine-tuned FLAN-T5-small model's accuracy in verbalizing ELECTRONIC and TIME was much higher than that of Zhang et al.'s model (97.35% vs. 63.30%, and 88.76% vs 75.00%, respectively). The fine-tuned FLAN-T5-small model achieved higher accuracy in verbalizing ELECTRONIC possibly because an LLM's ability to process sequences in parallel allows it to process long sequences (Liu et al., 2024), such as electronic addresses, better and retain more memory about the representations of those sequences than Zhang et al.'s (2019) GRU encoder and RNN decoder model with covering grammars. Perhaps, the fine-tuned FLAN-T5-small model was better at verbalizing TIME than Zhang et al.'s model because FLAN-T5-small model combined the learned representations of TIME verbalizations during fine-tuning and its own pre-trained knowledge, which Zhang et al.'s model did not have.
The fine-tuned FLAN-T5-small model was much worse at verbalizing DECIMAL, MONEY, and FRACTION than Zhang et al.'s model because Zhang et al.'s model incorporated covering grammars in the RNN decoder. The rules for verbalization encoded in the covering grammars helped the RNN decoder to be more precise in generating the output verbalizations.
Result 2: The effect of the amount of training data on the performance of the model:
The three most frequent NSW classes are DATE, LETTERS, and CARDINAL, each of which has more than 95k pairs of preverbalized-verbalized sentences in the dataset. Simultaneously, three of the least frequent classes are MONEY, FRACTION, and TIME.
I especially experimented on FRACTION, TIME, and MONEY to know whether increasing the training data for these classes would increase the accuracy in verbalizing each of them for the following reasons. Concerning FRACTION, it is the class with the lowest accuracy in the first experiment (76.31%). Moreover, the FLAN-T5-small model fine-tuned on just the dataset could not verbalize complicated fractions, such as ``24/29079". Regarding TIME, there are six ways to express time. For instance, ``one o'clock in the afternoon" can be written as ``1:00 PM", ``1 PM", ``1:00 p.m.", ``1 p.m.", ``13:00", or ``1300 hrs". Concerning MONEY, there are three ways to display the amount of money, which are ``currency symbol + amount" (e.g., $100), ``amount + currency symbol" (e.g., 500¥), ``amount + currency code" (e.g., 80.99 GBP). Additionally, almost every country has its currency symbol with its corresponding currency code, resulting in more than 150 pairs of currency symbol and currency code. Possibly because of such variations concerning TIME and MONEY expressions, the FLAN-T5-small model fine-tuned on just the first file of the English Google Text Normalization Challenge dataset was only able to achieve nearly acceptable accuracy in verbalizing TIME and MONEY (88.76% and 90.54%, respectively).
The accuracy of FLAN-T5-small model when fine-tuned on different amounts of training data of DATE, LETTERS, CARDINAL, MONEY, FRACTION, and TIME is shown in the following figure:
The relationship between the accuracy of the fine-tuned FLAN-T5-small verbalizing DATE, LETTERS, CARDINAL, MONEY, FRACTION, and TIME individually and the different amounts of training data of each of those NSW classes
According to the figure, the more training data there is in each NSW class, the higher the accuracy in verbalizing that NSW class. Especially, concerning FRACTION, which is the class with the least training data among DATE, LETTERS, CARDINAL, MONEY, and TIME in the dataset, the accuracy in verbalizing FRACTION is 89.50% when the amount of training data was 10k pairs of preverbalized-verbalized sentences. This means that there was an increase of 13.19% in accuracy compared to when FLAN-T5-small model was trained on 905 pairs of preverbalized-verbalized sentences containing FRACTION in the dataset. The fine-tuned FLAN-T5-small at this phase could verbalize ``24/29079" correctly, while it could not do that when the training data was 905 pairs of preverbalized-verbalized sentences containing FRACTION NSWs. The accuracy in verbalizing FRACTION was increased to 94.38% when FLAN-T5-small model was trained on 30k pairs of preverbalized-verbalized sentences containing FRACTION NSWs. At this point, the fine-tuned FLAN-T5-small model could verbalize more complicated fractions correctly, such as ``8960/968541" and ``120 78/99". There are such increases in the accuracy in verbalizing FRACTION because the additional training data contains more complicated pairs of preverbalized-verbalized sentences. Moreover, with more training data regarding a class with an initial low amount of training data, the performance of a model tends to increase. Concerning MONEY and TIME, there was a moderate increase in accuracy in verbalizing each class when there was an increase in training data. Specifically, the accuracy in verbalizing MONEY went from 90.54% to 93.89% at 10k training pairs of sentences and to 97.56% at 30k training pairs. The accuracy in verbalizing TIME went from 88.76% to 92.81% and to 96.78% at 10k and 30k training pairs, respectively. The additional data for MONEY and TIME also demonstrates more diversity and complexity, leading to higher accuracy in verbalizing MONEY and TIME when there was more training data. Simultaneously, I observed that in the additional data in MONEY, there is an over-abundance of training data with the US dollar, euro, and British pound, and a high presence of some Western currencies, such as the Canadian dollar. This discrepancy in the representation between some Western currencies and other countries' currencies can cause failure in verbalizing the currencies of the latter group. This is not desirable for TTS systems used by people in many countries.
Concerning DATE and CARDINAL, their respective accuracy when FLAN-T5-small model was trained on 30k pairs of training data containing each of these classes was 97.76% and 96.92%. This is quite close to the accuracy when the FLAN-T5-small model was trained on 252995 and 76607 pairs of training sentences containing DATE and CARDINAL NSWs (99.89% and 99.72%, respectively). Perhaps for these classes, the accuracy in verbalization can be increased with less data with greater diversity in DATE or CARDINAL expressions. In other words, for example, possibly when FLAN-T5-small model is trained on 60k pairs of training data with diverse DATE expressions, the accuracy in verbalizing DATE can still reach 99.89%.
In short: An increase in the amount of training data with diverse expressions and complexity helped increase the accuracy in verbalizing DATE, LETTERS, CARDINAL, MONEY, FRACTION, and TIME.
🗂️Works cited:
J. H. Ro, F. Stahlberg, K. Wu, and S. Kumar, "Transformer-based models of text normalization for speech applications," arXiv, 2022. [Online]. Available: https://arxiv.org/abs/2202.00153.
Y. Zhang, T. M. Bartley, M. Graterol-Fuenmayor, V. Lavrukhin, E. Bakhturina, and B. Ginsburg, "A Chat about Boring Problems: Studying GPT-Based Text Normalization," in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 10921–10925. [Online]. Available: https://api.semanticscholar.org/CorpusID:262466068
H. Zhang, R. Sproat, A. H. Ng, F. Stahlberg, X. Peng, K. Gorman, and B. Roark, "Neural Models of Text Normalization for Speech Applications," Computational Linguistics, vol. 45, no. 2, pp. 293–337, Jun. 2019. [Online]. Available: https://aclanthology.org/J19-2004/.
E. Bakhturina, Y. Zhang, and B. Ginsburg, "Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization," in Proc. Interspeech 2022, 2022, pp. 491–495. doi: 10.21437/Interspeech.2022-11074.
Z. Ying, C. Li, Y. Dong, Q. Kong, Y. Huo, Y. Wang, and Y. Wang, "A Unified Front-End Framework for English Text-to-Speech Synthesis," in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 10181–10185. [Online]. Available: https://api.semanticscholar.org/CorpusID:258762799.
H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, "LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech," in Proc. Interspeech 2019, 2019, pp. 1526–1530. doi: 10.21437/Interspeech.2019-2441.
R. Sproat, A. W. Black, S. Chen, S. Kumar, M. Ostendorf, and C. D. Richards, "Normalization of non-standard words," Computer Speech & Language, vol. 15, no. 3, pp. 287–333, 2001. doi: https://doi.org/10.1006/csla.2001.0169. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S088523080190169X.
K. Hui, H. Zhuang, T. Chen, Z. Qin, J. Lu, D. Bahri, J. Ma, J. Gupta, C. N. dos Santos, Y. Tay, and D. Metzler, "ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference," in Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 2022, pp. 3747–3758. doi: 10.18653/v1/2022.findings-acl.295. [Online]. Available: https://aclanthology.org/2022.findings-acl.295.
J. Browning, "Getting it right: the limits of fine-tuning large language models," Ethics and Information Technology, vol. 26, no. 2, p. 36, May 2024. doi: 10.1007/s10676-024-09779-1. [Online]. Available: https://doi.org/10.1007/s10676-024-09779-1.
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, "Scaling Instruction-Finetuned Language Models," Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024. [Online]. Available: http://jmlr.org/papers/v25/23-0870.html.
Y. Liu, H. He, T. Han, X. Zhang, M. Liu, J. Tian, Y. Zhang, J. Wang, X. Gao, T. Zhong, Y. Pan, S. Xu, Z. Wu, Z. Liu, X. Zhang, S. Zhang, X. Hu, T. Zhang, N. Qiang, T. Liu, and B. Ge, "Understanding LLMs: A Comprehensive Overview from Training to Inference," arXiv preprint arXiv:2401.02038, 2024. [Online]. Available: https://arxiv.org/abs/2401.02038.