How Text-to-Speech Works

Feb 24, 2009


You have often used text to speech converters software or may be right now you are using it to read this article (if you have a good one!). But have you ever thanked about it, I mean how it works? You will probably say it is simple, there must be a file which contains already spoken up words and these files have certain numbers, when a particular text is meant to be spoken up a logic runs and trace out the file that may be a wav file and run it and this is how a T2S works.

Or may be you have not even think about it, but for those developers who wanna know how this happens I’ll describe you in this article, this document will give you a technical overview of text-to-speech so you can understand how it works, and better understand some of the capabilities and limitations of the technology.

Text-to-speech fundamentally functions as a pipeline that converts text into PCM digital audio. The elements of the pipeline are:

  • Text normalization
  • Homograph disambiguation
  • Word Pronunciation
  • Prosody
  • Concatenate wave segments

I’ll cover each of these steps individually

Text Normalization

Text Normalization it is that part of text-to-speech program that converts any input text into a series of spoken words. At basic level, text normalization converts a string like "My name is Ritesh" to a series of words, “My”, “name”, “is”, “Ritesh", along with a marker indicating that a period occurred (a comma). However, this gets more complicated when strings like "John rode home at 23.5 mph", where "23.5 mph" is converted to "twenty three point five miles per hour". Here’s how text normalization works:

First, text normalization isolates words in the text. For the most part this is as minor as looking for a sequence of alphabetic characters, allowing for an occasional apostrophe, space and hyphen.

Text normalization then searches for numbers, times, dates, and other symbolic representations. These are analyzed and converted to words. (Example: "$54.32" is converted to "fifty four dollars and thirty two cents.") Someone needs to code up the rules for the conversion of these symbols into words, since they differ depending upon the language and context.

Next, abbreviations are converted, such as "in." for "inches", and "St." for "street" or "saint". The normalizer will use a database of abbreviations and what they are expanded to. Some of the expansions depend upon the context of surrounding words, like "St. John" and "John St.”

The text normalizer might perform other text transformations such as internet addresses. "" is usually spoken as "w w w dot Microsoft dot com".

Whatever remains is punctuation. The normalizer will have rules dictating if the punctuation causes a word to be spoken or if it is silent. (Example: Periods at the end of sentences are not spoken, but a period in an Internet address is spoken as "dot.")

The rules will vary in complexity depending upon the engine.

Homograph Disambiguation

So by reading till here you were wondering that all of the main task has happened but dude there are other things also whose care must be take otherwise your text to speech wont work efficiently. This stage mainly deals with pronunciation of words.

Actually it’s not a stage by itself, but is combined into the text normalization or pronunciation components. I’ve separated homograph disambiguation out since it doesn’t fit cleanly into either.

In English and many other languages, there are hundreds of words that have the same text, but different pronunciations. A common example in English is "read," which can be pronounced "reed" or "red" depending upon its meaning. A "homograph" is a word with the same text as another word, but with a different pronunciation. The concept extends beyond just words, and into abbreviations and numbers. "Ft." has different pronunciations in "Ft. Wayne" and "100 ft.". Likewise, the digits "1997" might be spoken as "nineteen ninety seven" if the author is talking about the year, or "one thousand nine hundred and ninety seven" if the author is talking about the number of people at a concert.

So the above procedure is quite tough how a computer could knows that when to pronounce “read” as “reed” or as “red”. One way is by judging what out what the text is actually talking about and decides which meaning is most appropriate given the context. Once the right meaning is know, it’s usually easy to guess the right pronunciation.

Text-to-speech engines figure out the meaning of the text, and more specifically of the sentence, by parsing the sentence and figuring out the part-of-speech for the individual word (see how complicated it is, do you remember how many part of speech are there?). This is done by guessing the part-of-speech based on the word endings, or by looking the word up in a lexicon. Sometimes a part of speech will be ambiguous until more context is known, such as for "read." Of course, disambiguation of the part-of-speech may require hand-written rules.

Once the homographs have been disambiguated, the words are sent to the next stage to be pronounced.

Word Pronunciation

The pronunciation module accepts the text, and outputs a sequence of phonemes, just like you see in a dictionary.

To get the pronunciation of a word, the text-to-speech engine first looks the word up in it’s own pronunciation lexicon. If the word is not in the lexicon then the engine reverts to "letter to sound" rules.

Now what is Letter-to-sound rules, it guess the pronunciation of a word from the text. They’re kind of the inverse of the spelling rules you were taught in school. There are a number of techniques for guessing the pronunciation, but the algorithm described here is one of the more easily implemented ones.

The letter-to-sound rules are "trained" on a lexicon of hand-entered pronunciations. The lexicon stores the word and it’s pronunciation, such as:

hello h eh l oe

An algorithm is used to segment the word and figure out which letter "produces" which sound. You can clearly see that "h" in "hello" produces the "h" phoneme, the "e" produces the "eh" phoneme, the first "l" produces the "l" phoneme, the second "l" nothing, and "o" produces the "oe" phoneme. Of course, in other words the individual letters produce different phonemes. The "e" in "he" will produce the "ee" phoneme.

Once the words are segmented by phoneme, another algorithm determines which letter or sequence of letters is likely to produce which phonemes. The first pass figures out the most likely phoneme generated by each letter. "H" almost always generates the "h" sound, while "o" almost always generates the "ow" sound. A secondary list is generated, showing exceptions to the previous rule given the context of the surrounding letters. Hence, an exception rule might specify that an "o" occurring at the end of the word and preceded by an "l" produces an "oe" sound. The list of exceptions can be extended to include even more surrounding characters.

When the letter-to-sound rules are asked to produce the pronunciation of a word they do the inverse of the training model. To pronounce "hello", the letter-to-sound rules first try to figure out the sound of the "h" phoneme. It looks through the exception table for an "h" beginning the word followed by "e"; Since it can’t find one it uses the default sound for "h", which is "h". Next, it looks in the exceptions for how an "e" surrounded by "h" and "l" is pronounced, finding "eh". The rest of the characters are handled in the same way.

This technique can pronounce any word, even if it wasn’t in the training set, and does a very reasonable guess of the pronunciation, sometimes better than humans. It doesn’t work too well for names because most names are not of English origin, and use different pronunciation rules. (Example: "Mejia" is pronounced as "meh-jee-uh" by anyone that doesn’t know it is Spanish.) Some letter-to-sound rules first guess what language the word came from, and then use different sets of rules to pronounce each different language.

Word pronunciation is further complicated by people’s laziness. People will change the pronunciation of a word based upon what words precede or follow it, just to make the word easier to speak. An obvious example is the way "the" can be pronounced as "thee" or "thuh". Other effects including the dropping or changing of phonemes. A commonly used phrase such as "What you doing?" sounds like "Wacha doin?"

Once the pronunciations have been generated, these are passed onto the prosody stage.


Now you will be thinking its all over but “kahani abhi baki hai dost” Now since you are able to pronounce the word but by merely pronouncing the word is not sufficient you need to speak in a tone, in a speed and other factors are also involved in a speech. This section will deals with all these factors.

Prosody is the pitch, speed, and volume that syllables, words, phrases, and sentences are spoken with. Without prosody text-to-speech sounds very robotic, and with bad prosody text-to-speech sounds like it’s drunk.

The technique that engines use to synthesize prosody varies, but there are some general techniques.

First, the engine identifies the beginning and ending of sentences. In English, the pitch will tend to fall near the end of a statement, and rise for a question. Likewise, volume and speaking speed ramp up when the text-to-speech first starts talking, and fall off on the last word when it stops. Pauses are placed between sentences.

Engines also identify phrase boundaries, such as noun phrases and verb phrases. These will have similar characteristics to sentences, but will be less pronounced. The engine can determine the phrase boundaries by using the part-of-speech information generated during the homograph disambiguation. Pauses are placed between phrases or where commas occur.

Algorithms then try to determine which words in the sentence are important to the meaning, and these are emphasized. Emphasized words are louder, longer, and will have more pitch variation. Words that are unimportant, such as those used to make the sentence grammatically correct, are de-emphasized. In a sentence such as "John and Bill walked to the store", the emphasis pattern might be "JOHN and BILL walked to the STORE." The more the text-to-speech engine "understands" what’s being spoken, the better it’s emphasis will be.

Next, the prosody within a word is determined. Usually the pitch and volume rise on stressed syllables.

All of the pitch, timing, and volume information from the sentence level, phrase level, and word level are combined together to produce the final output. The output from the prosody module is just a list of phonemes with the pitch, duration, and volume for each phoneme.

Play Audio

The speech synthesis is almost done by this point. All the text-to-speech engine has to do is convert the list of phonemes and their duration, pitch, and volume, into digital audio.

Methods for generating the digital audio will vary, but many text-to-speech engines generate the audio by concatenating short recordings of phonemes. The recordings come from a real person. In a simplistic form, the engine receives the phoneme to speak, loads the digital audio from a database, does some pitch, time, and volume changes, and sends it out to the sound card.

It isn’t quite that simple for a number of reasons.

Most noticeable is that one recording of a phoneme won’t have the same volume, pitch, and sound quality at the end, as the beginning of the next phoneme. This causes a noticeable glitch in the audio. An engine can reduce the glitch by blending the edges of the two segments together so at their intersections they both have the same pitch and volume. Blending the sound quality, which is determined by the harmonics generated by the voice, is more difficult, and can be solved by the next step.

The sound that a person makes when he/she speaks a phoneme, changes depending upon the surrounding phonemes. If you record "cat" in sound recorder, and then reverse it, the reversed audio doesn’t sound like "tak", which has the reversed phonemes of cat. Rather than using one recording per phoneme (about 50), the text-to-speech engine maintains thousands of recordings (usually 1000-5000). Ideally it would have all possible phoneme context combinations recorded, 50 * 50 * 50 = 125,000, but this would be too many. Since many of these combinations sound similar, one recording is used to represent the phoneme within several different contexts.

Even a database of 1000 phoneme recordings is too large, so the digital audio is compressed into a much smaller size, usually between 8:1 and 32:1 compression. The more compressed the digital audio, the more muted the voice sounds.

Once the digital audio segments have been concatenated they’re sent off to the sound card, making the computer talk.

Generating a Voice

You might be wondering, "How do you get thousands of recordings of phonemes?"

The first step is to select a voice talent. The voice talent then spends several hours in a recording studio reading a wide variety of text. The text is designed so that as many phonemes sequence combinations are recorded as possible. You at least want them to read enough text so there are several occurrences of each of the 1000 to 5000 recording slots.

After the recording session is finished, the recordings are sent to a speech recognizer which then determines where the phonemes begin and end. Since the tools also know the surrounding phonemes, it’s easy to pull out the right recordings from the speech. The only trick is to figure out which recording sounds best. Usually an algorithm makes a guess, but someone must listening to the phoneme recordings just to make sure they’re good.

The selected phoneme recordings are compressed and stored away in the database. The result is a new voice.

Wrapping up

This was a high level overview of how text-to-speech works. Most text-to-speech engines work in a similar manner, although not all of them work this way. The overview doesn’t give you enough detail to write your own text-to-speech engine, but now you know the basics. If you want more detail you should purchase one of the numerous technical books on text-to-speech.

NOTE: Whole article is somewhat modified to make it more interesting and easier to read.
The article is taken from Microsoft’s text to speech SDK.
Partially compiled & edited by – Ritesh Kawadkar


Post a Comment


Connect With ME