Text to Speech using Natural Language Processing

TULSHIDAS LINGAYAT
9 min readJun 15, 2021

--

What is Text-to-Speech Synthesis?

Text-to-speech (TTS) is a type of assistive technology that reads digital text aloud. It’s sometimes called “read aloud” technology. With a click of a button or the touch of a finger, TTS can take words on a computer or other digital device and convert them into audio. TTS is very helpful for kids who struggle with reading. But it can also help kids with writing and editing, and even focusing.

Nowadays the goal of TTS — the Text-to-Speech conversion technology — is not to simply have machines talk, but to make them sound like humans of different ages and gender. In perspective, we’ll be able to listen to machine-voiced audiobooks and news on TV or to communicate with assistants without noticing the difference.

The text-to-speech (TTS) synthesis procedure consists of two main phases. The first is text analysis, where the input text is transcribed into a phonetic or some other linguistic representation, and the second one is the generation of speech waveforms, where the output is produced from this phonetic and prosodic information.

Overview of Speech Synthesis

Speech synthesis can be described as the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer and can be implemented in software or hardware. A text-to-speech (TTS) system converts normal language text into speech. Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database.

A text-to-speech system (or “engine”) is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word and divides and marks the text into prosodic units, like phrases, clauses, and sentences.

The back-end — often referred to as the synthesizer — then converts the symbolic linguistic representation into sound.

There are different ways to perform speech synthesis. Those are Concatenative TTS, Formant Synthesis, Parametric TTS, and Hybrid approaches. The choice depends on the task they are used for.

Challenges in Text -To -Speech Conversion

Text preparation is usually a difficult undertaking that involves several language-dependent issues. Digits and numerals must be expanded according to need, For example, 1750 could be written as seventeen-fifty (in year) or one-thousand-seven-hundred-and-fifty (in measure), Fractions and dates are also problematic. 5/16 can be expanded as five-sixteenths (if fraction) or May sixteenth (if date), Expansion of ordinal numbers and for roman numbers is also problematic. Abbreviations can be expanded into entire words and pronounced either as written or letter by letter. For example, kg can be either kilogram or kilograms depending on preceding numbers. The adjacent information may be enough to find out the correct conversion but to avoid misconversions we use letter-to-letter conversion. ‘$’, ‘%’, ‘&’, ‘/’, ‘-’, ‘+’, and other special characters and symbols present unique issues. In some situations, the word order must be changed. For example, $71.50 must be expanded as seventy-one dollars and fifty cents and $100 million as one hundred million dollars, not as one hundred dollars million.

The second task is to determine the proper pronunciation. The most difficult challenges in TTS systems are homographs. Homographs have the same spelling but differ in meaning and, in most cases, pronunciation. e.g. When expressed as a verb or a noun, lead has multiple pronunciations(He followed her lead / He covered the hull with lead). One of the most difficult problems for any TTS system is determining the exact pronunciation of proper names, especially when they are borrowed from other languages.

Finding correct intonation, stress, and duration from the written text is these features together are called prosodic or suprasegmental features and may be considered as the melody, rhythm, and emphasis of the speech at the perceptual level.

In concatenative synthesis, The process of gathering and categorizing voice samples takes a long time and might result in massive waveform databases. However, using a compression approach, the amount of data can be lowered. Concatenation points between samples can cause speech distortion. The coarticulation effect is a difficulty with some lengthier units, such as words or syllables, and it can cause memory and system requirements issues.

Structure of A Text-To-Speech Synthesizer System

The text may contain some linguistic, no linguistic, paralinguistic information and we need to retrieve that information by processing the text. The important steps in the text to speech synthesis consist of :

  1. Analyzing the text input and then transforming it into a phonetic description.
  2. Generating the Prosody

The structure of the text-to-speech synthesizer can be broken down into major modules:

  • Natural Language Processing (NLP) module: It produces a phonetic transcription of the text read, together with prosody.
  • Digital Signal Processing (DSP) module: It transforms the symbolic information it receives from NLP into audible and intelligible speech.

The major operations of the NLP module are as follows:

  • Text Analysis and Text Normalization: First the text is segmented into tokens. The token-to-word conversion creates the orthographic form of the token.

Ex: For the token “Mr” the orthographic form “Mister” is formed by expansion, the token “12” gets the orthographic form “twelve” and “1997” is transformed to “nineteen ninety-seven”.

  • Application of Pronunciation Rules: After the text analysis has been completed, pronunciation rules are applied. Letters cannot be transformed 1:1 into phonemes because the correspondence is not always parallel. It consists of Grapheme to Phoneme transformation. Word has to be converted into its string.

Ex: In certain environments, a single letter can correspond to either no phoneme (for example, “h” in “caught”)

  • Prosody Generation: after the pronunciation has been determined, the prosody is generated. The degree of naturalness of a TTS system is dependent on prosodic factors like intonation modeling which includes phrasing and accentuation, amplitude modeling, and duration modeling which includes the duration of sound and the duration of pauses, which determines the length of the syllable and the tempos of the speech.
  • Synthesis: Once we get the segmental and suprasegmental information by text processing, we can use some signal processing algorithm which is nothing but the synthesis to produce speech.

Types of TTS Synthesis -

  1. Concatenative Synthesis

Concatenative TTS relies on high-quality audio clip recordings, which are combined to form the speech. At the first step, voice actors are recorded saying a range of speech units, from whole sentences to syllables that are further labeled and segmented by linguistic units from phones to phrases and sentences forming a huge database. During speech synthesis, a Text-to-Speech engine searches such a database for speech units that match the input text, concatenates them together, and produces an audio file. There are three major subtypes of concatenative synthesis as mentioned below,

  • Domain-Specific Synthesis

Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports. The technology is very simple to implement and has been in commercial use for a long time, in devices like talking clocks and calculators.

  • Unit Selection Synthesis

Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. Typically, the division into segments is done using a specially modified speech recognizer set to a forced alignment mode with some manual correction afterward, using visual representations such as the waveform and spectrogram. An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. At runtime, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision tree.

  • Diphone Synthesis

Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language. The number of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody of a sentence is superimposed on these minimal units utilizing digital signal processing techniques such as linear predictive coding, PSOLA[12], or MBROLA.

2. Formant synthesis

The formant synthesis technique is a rule-based TTS technique. It produces speech segments by generating artificial signals based on a set of specified rules mimicking the formant structure and other spectral properties of natural speech. The synthesized speech is produced using additive synthesis and an acoustic model. The acoustic model uses parameters like, voicing, fundamental frequency, noise levels, etc that varied over time. Formant-based systems can control all aspects of the output speech, producing a wide variety of emotions and different tone voices with the help of some prosodic and intonation modeling techniques.

3. Parametric TTS

To address the limitations of concatenative TTS, a more statistical method was developed. The idea lying behind it is that if we can make approximations of the parameters that make the speech, we can train a model to generate all kinds of speech. The parametric method combines parameters, including fundamental frequency, magnitude spectrum, etc., and processes them to generate speech. At the first step, the text is processed to extract linguistic features, such as phonemes or duration. The second step requires extraction of vocoder features, such as cepstra, spectrogram, fundamental frequency, etc., that represent some inherent characteristic of human speech, and are used in audio processing. These features are hand-engineered and, along with the linguistic features, are fed into a mathematical model called a Vocoder. While generating a waveform, the vocoder transforms the features and estimates parameters of speech like phase, speech rate, intonation, and others. The technique uses Hidden Semi-Markov models — transitions between states still exist, and the model is Markov at that level, but the explicit model of duration within each state is not Markov.

4. Hybrid (Deep Learning) approaches

The DNN (Deep Neural Network) based approach is another variation of the statistical synthesis approaches that are used to overcome the inefficiency of decision trees used in HMMs to model complex context dependencies. A step forward and an eventual breakthrough was letting machines design features without human intervention. The features designed by humans are based on our understanding of speech, but it is not necessarily correct. In DNN techniques, the relationship between input texts and their acoustic realizations is modeled by a DNN. The acoustic features are created using maximum likelihood parameter generation trajectory smoothing. Features obtained with the help of Deep Learning, are not human readable, but they are computer-readable, and they represent data required for a model.

Conclusion

Text to speech synthesis is a rapidly evolving area of computer technology that is becoming increasingly significant in how people interact with computers. The many activities and processes involved in the text-to-speech synthesis have been identified. The model communicates with an American English-specific text-to-speech engine. In the future, we intend to work on developing engines for some localized languages to make text-to-speech technology more accessible to a wider audience. Implementing a text-to-speech system on other platforms, such as phone systems, ATMs, video games, and any other platform where text-to-speech is used, is another area where more development is needed.

References

  1. Dutoit, T., 1993. High quality text-to-speech synthesis of the French language. Doctoral dissertation, Faculte Polytechnique de Mons.
  2. Text-to-speech (TTS) Overview. In Voice RSS Website. Retrieved February 21, 2014, from http://www.voicerss.org/tts/
  3. http://research.spa.aalto.fi/publications/theses/lemmetty_mst/chap4.html
  4. Text — To — Speech Synthesis (TTS)several to, Federal University of Technology Owerri ,Ikenna Oluigbo Claude Bernard University Lyon ,Okpala Izunna

--

--