Synthesis snd/sp‎ > ‎


There is a new voice synth chip on the market SpeakJet

ADPCM for Highly Intelligible Speech Synthesis

Copyright © 1983 Steven A. Ciarcia. All rights reserved, Juni 1983 @ BYTE Publication Inc

posted 1 February 2002

Steve Ciarcia
POB 582
Glastonbury1 CT 06033
Special thanks to Bill Curlew for his software expertise.

Some new integrated circuits from Oki Semiconductor compress digitized speech data efficiently.

    Use ADFCM for Highly Intelligible Speech Synthesis
  • During the past few years I have presented four different computer speech-synthesizer projects (see references). With each article I have tried to present the latest technology and describe successively more cost-effective synthesis methods. This month I'd like to describe a new variation on digitized speech that uses adaptive differential pulsecode modulation.

    What Is Digitized Speech?
  • Computers communicate in a digital language, but the language of humans is analog. If computers are to speak as we do, this obvious barrier must be overcome. Fortunately for us, a number of techniques have been devised to allow a computer to synthesize a human voice, some of them quite effective
    Some synthesized voices employ electronic circuitry to simulate the throat and vocal tract, but the purest form of machine-generated speech is simply a digital recording of an actual human voice, using digital circuitry to mimic the action of a tape recorder. For example, in most parts of the United States you can dial a telephone number and hear a recorded voice saying something like, " The number you have reached has been changed. The new number is 924-9281." The voice is distinctly human in quality, highly intelligible, and machine-generated an excellent example of digitized speech. Although it uses a lot of memory, digitized speech is the most intelligible machine-generated speech currently possible.
    The basic concepts of producing stored digital speech are fairly simple. The process begins with data acquisition. A voice waveform can be treated like any other fluctuating voltage input; the computer can record the waveform by periodically taking a sample of the signal's voltage through an analog-to-digital (A/D) converter and storing it as a binary value. (The number of samples needed per second depends upon the frequency of the input signal.) Once the samples have been stored, the computer can recreate the original waveform by sequentially sending the stored values to a digital-to-analog (D/A) converter at the same rate as the original sampling.

    Pulse-Code Modulation
  • A common method of representing continuous analog values in digital form is pulse-code modulation, or PCM. In PCM, distinct binary representations (pulse codes) are chosen for a finite number of points along the continuum of possible states. When-ever the value is being measured and it falls between two encoded points, the code for the closer point is used.

    (This process is called quantization: the dividing of the range of values of a wave into subranges, each of which is represented by an assigned value.) A series of these pulse codes can be transmitted in a pulse train, resulting in a pulse-code modulated signal.
    Because the samples of digitized speech referred to above are stored in the form of digital pulses, the stored speech waveform can be thought of as an example of pulse-code modulation. Figure 1 shows a block diagram of a speech synthesizer that reproduces speech stored in pulse-code-modulated form.


Functional Block Diagram of a digitized speech-reproduction system that employs pulse-code modulation

    Sampling Rates and Other Messy Stuff
  • The sampling rate you use in recording any signal must be chosen with awareness of a theoretical limit called the Nyquist interval. At the very minimum, the sampling rate must be at least twice the highest frequency found in the input signal. With an input bandwidth of 2 kHz (kilohertz), adequate for intelligible speech, the sampling frequency would have to be at least 4 kHz.
    This rule holds strictly true only when an ideal lowpass filter is used on the output of the D/A converter.

The ear is sensitive, and too coarse a reproduction will sound unnatural or even unintelligible.

  • In real equipment, sampling rates of 3 or 4 times the input bandwidth are sometimes necessary. So for speech reproduction, a sampling rate around 6 or 8 kHz is good. (Optical digitized-music recordings, which are just now coming to market, use 16-bit AID conversion at a 50-kHz sample rate to achieve high fidelity. The resulting data rate is 800,000 bps-bits per second.)
    Other technical limitations crop up. Once you have determined the sampling rate, you must consider the resolution of the analog-to-digital converter. A/D converters operate in discrete steps (quanta) rather than continuous levels, as shown in figure 2.

Fig. 2

Waveform sampling by pulse-code modulation (PCM). The interval between samples is T0;
the sampling frequency is the reciprocal of the interval. Each sample of PCM data consists of N bits;
the leftmost is the most significant bit and the rightmost is the least significant bit.

  • If a 4-bit A/D converter is used, then only 16 values are available to define the signal. Any reading could potentially be in error by ca.1/16, or about 6 percent. A 12-bit converter, which has 4096 potential levels, would have a possible quantization error of only 0.02 percent.

    Achieving Fidelity
  • In dealing with analog voice signals, we must accurately reproduce a the input signal for it to be understood. The ear is sensitive, and too coarse a reproduction will sound unnatural or even unintelligible. A direct relationship exists between the PCM data rate and reproduced speech quality. Let's consider a case in which we have an 8-kHz sampling rate. If we use 12-bit A/D conversion, then the data rate (in bits per second) is found using the following equation:

  • bit rate = sample rate X conversion
    - 8000 Hz X 12 bits
    - 96,000 bits/second

  • Using standard PCM on a voice signal with a 4-kHz bandwidth would require a 96,000-bps data rate. The average personal computer could store only about 8 seconds of speech in its 64K-byte memory. The data rate can be reduced somewhat by using an 8-bit A/D converter rather than a 12-bit unit. The raw data rate now becomes 8000 X 8 or 64,000 bps. (This reduces the signal-to-noise ratio from 66 to 42 dB (decibels), but the sound quality is more than adequate for experimentation. For commercial applications, however, I recommend a 12-bit converter.)

    Delta Modulation
  • The pulse-code modulation we have been examining uses no data compression. In playback, the data bits representing the absolute values of each successive signal sample are sent to a full-resolution D/A converter and reproduced at the same rate at which they were recorded:
    96,000 bps in, 96,000 bps out. The circuit can operate with no assumptions made about the signal it is to process.
    On the other hand, voice waveforms contain much redundant data. Long periods of silence are interspersed with sounds that vary in pitch slowly. If you take some time to analyze the A/D samples, you will notice that the changes are, for the most part, gradual and that the variations in the signal between adjacent samples are a limited portion of the full dynamic range.
    One method of reducing the data rate used in PCM voice reproduction is called delta modulation. This process assumes that the input signal's waveform has a fairly uniform and predictable slope (rate of rising and falling). Rather than storing an 8- or 12-bit quantity for each sample, a delta modulator stores only a single bit. When the computer samples the input signal from the A/D converter, it compares the current reading to the preceding sample. If amplitude of the new sample is greater, then the computer stores a bit value of 1. Conversely, if the new sample is less, then a 0 will be stored. Figure3 shows how this works.


Waveform sampling by delta modulation. Each sample of the source waveform is tested to see
if its amplitude is higher or lower (within the resolution of a fixed quantization value Ar-delta-r)
than that of the previous sample. If the amplitude is higher, the single-bit delta-modulated
encoding value is set to 1; if lower, the encoding value is set to 0.

Fig. 3b

Two potential problems occurring in delta modulation. When the source waveform changes too rapidly,
the fixed quantization value may be too small to express the full change in the input; this slope overload
causes a compliance error. Or when there is little change in the input waveform (at the exfreme, a DC
signal), vertical deflection in the quantization value results in granular noise in the output.

  • Reproduction of the waveform is accomplished by sending the stored bits in sequence to the output, where their values are integrated.
    But, like other techniques, delta modulation has limitations, one of them the familiar sampling-rate restriction. Because only a single bit changes between samples, the rate at which samples are taken must be sufficiently fast that no significant information is lost from the input signal. Furthermore, if the slope of the input waveform varies a lot, the reproduced waveform may be audibly distorted. So using delta modulation may not reduce the data rate much, although there are many different variant schemes, and it's difficult to predict which is optimal in a given situation.

Competing Digitizing Methods

  • The most effective application of delta modulation that I have observed is the technique developed by Dr. Forest Moser at the University of California and implemented in the National Semiconductor Digitalker voice-synthesis chip set (see reference 1). However, while the Digitalker's process is definitely a variant of delta modulation, the data-compression and zero-phase-encoding algorithms that produce the stored bit patterns take hours of processing per word; it's very difficult for you to program your own custom vocabulary.

    Differential PCM

  • We can actually reduce the amount of data stored for reproduction of speech by using a concept related to delta modulation as follows. When the speech waveform is being sampled, for each sample a value is stored that represents the amplitude difference between samples. This scheme, called differential pulse-code modulation, or DPCM, allows more that a single bit of difference between stored samples, accommodating more variation in the input waveform before severe distortion sets in. The DPCM value can be expressed as a fraction of the allowed input range or the absolute difference between samples (see figure 4).


Differential pulse-code modulation (DPCM) is an attempt to reduce the amount of data stored or
transmitted, as compared with regular PCM. For each sample, the difference between the previous
PCM code and the current code is expressed in terms of a fixed quantization value Ar (delta-r),
which must be chosen with attention to the characteristics of the source waveform.
If too large or small a quantization value is used, compliance errors occur.

  • DPCM exhibits some of the same limitations as simple delta modulation but to a lesser degree. Only when the difference between samples is greater than the maximum DPCM encoding value will distortion (called a compliance error) occur. Then the only solution is to reduce the input bandwidth or raise the sampling frequency.

ADPCM Is a specialized form of PCM that offers significantly Improved inteliiglbility at lower data rates.

    Adaptive Differential PCM

  • The real breakthrough in digitized speech is the technique known as adaptive differential pulse-code modulation (ADPCM), a specialized form of PCM that offers significantly improved intelligibility at lower data rates. This system was devised to overcome the defects of the delta-modulation techniques described thus far while still reducing the overall data rate and improving the output's compliance with the source waveform.
    ADPCM improves upon DPCM by dynamically varying the quantization between samples depending upon their rate of change while maintaining a low bit rate, condensing 12-bit PCM samples into only 3 or 4 bits. (The variations in the quantization value are regulated with regard to the characteristic complex sine waves that occur in voice. The technique is therefore not applicable to other kinds of signals, such as square waves.)

  • In ADPCM, each sample's encoding is derived by a complicated procedure that includes the following steps:
    a PCM-value differential dn is obtained by subtracting the previous PCM-code value from the current value; the quantization value An (delta-n) is obtained by multiplying the previous quantization value times a coefficient times the absolute value of the previous PCM-code value; the PCM-value differential is then expressed in terms of the quantization value and encoded in four bits, as shown in figure 5.

    Build an ADPCM Speech Analyzer/Synthesizer

  • The Oki Semiconductor Corporation produces a number of integrated circuits (ICs) that perform ADPCM encoding and decoding. Of these, the MSM5218RS and the MSM52O5RS are worthy of attention. The 5218 is designed to perform both storing and reproducing of digitized speech, while the 5205 provides only the reproducing function. Using these CMOS (complementary metal~xide semiconductor) components, we can put together a cost-effective speech-synthesis system that produces highly intelligible output and yet makes efficient use of memory.

  • Figure 6 on page 40 is the block diagram of the MSM5218RS IC. It is designed to work with 12-bit analog-to-digital converters and contains both an ADPCM analyzer and synthesizer. An internal 10-bit D/A converter is provided to reconstruct the waveform where direct analog output is wanted, or the decoded PCM data may be routed to an external D/A converter.
    The schematic in figure 7 on pages 42 and 43 diagrams a speech-synthesis circuit built around this chip (see photo 2 on page 41). In the circuit, a low-cost 8-bit A/D converter is used in place of a higher-resolution, more costly 12-bit converter. The Oki MSM5204RS 8-bit CMOS A/D converter, employed here, uses a successive capacitor ladder conversion system. It also incorporates a sample-and-hold stage that enables direct input of rapidly changing analog signals. An external clock signal provides timing for the chip; the clock's frequency is not critical and can be anywhere from 450 to 500 kHz.


Adaptive differential pulse-code modulation (ADPCM) improves upon DPCM by dynamically varying
the quantization between samples, depending upon their rate of change, while maintaining a low bit rate,
condensing 12-bit PCM samples into only 3 or 4 bits. In ADPCM, each sample's encoding is derived
by a procedure that includes the following steps. A PCM-value differential dn is obtained by sub fracting
the previous PCM-code value from the current value. The quantization value An (delta-n) is obtained by
multiplying the previous quantization value times a coefficient times the absolute value of the previous PCM-
code value. The PCM-value differential is then expressed in terms of the quantization value and encoded
in four bits. The mathematical relations are shown here in figure 5a, whereas figure 5b shows
a typical encoded waveform.



  • The frequency bandwidth of the signal input to the A/D converter is limited by an active low-pass filter, IC2, an Oki ALP-2 filter with a
    1.7-kHz cutoff frequency. Attenuation is 18 dB per octave above the cut-off frequency. (Although frequencies up to 4 kHz can theoretically be captured with an 8-kHz sample rate, in this application the lower cutoff frequency gives better-sounding reproduction.)

    A/D Conversion in Operation

  • Data conversion is started when the S CON (start conversion) line (pin 13) of the MSM5218 forces the write line (WR), pin 15) on the 5204 A/D converter into a lower state. After conversion is complete, the A/D read line (RD), pin 14) is brought low to latch the data onto the 5204's output lines. At a clock rate of 450 kHz, the 5204 completes the 8-bit conversion in approximately 73 microseconds.
    The digital representation of the input data from the 5204 is fed into a CD4014 serial-to-parallel converter (IC9) for transposition into the serial format required by the MSM5218's input. Because we are using an 8-bit converter and the MSM5218 expects 12-bit input, the four remaining low-order bits are clocked in as zeros by the CD4024 counter (IC7) and sections of the quad NAND gate (IC6). These components provide four extra SI-CK (serial input OR clock) pulses with zero-logic-level data.

    Selectable Parameters

  • The MSM5218 can analyze or synthesize ADPCM speech using a variable sampling rate. Three internal preset VCLOCK rates can be selected, or an externally supplied signal up to 384 kHz can be used. The logic levels on the 5218's pins Si and S2 define the VCLOCK reference in both analysis and synthesis modes, as shown in the lower right corner of figure 7. The host computer, or any other external hardware, synchronizes itself with the 5218 by monitoring the state and transition timing of the VCLOCK signal (pin 1).
    In addition to selecting the VCLOCK rate, you can choose encoding of the ADPCM data in either 3 or 4 bits, depending upon the logic level on the 4B/3B line (pin 7). A logic 1 selects 4-bit ADPCM values.

    Data Transfer and Rates

  • In the dual-function MSM5218, the data lines D0 through D3 are bidirectional and used either for output of analyzed ADPCM data (for storage) or for input to the speech-synthesizer circuitry. In the analysis mode (with pin 6 held high), the current encoded ADPCM value is available on D0 through D3 at the occurrence of the rising edge of VCLOCK. If you have set S1 and S2 for 8 kHz and 4B/3B for 4-bit data, the resulting bit rate is calculated as follows: 8000 X 4 bits = 32,000 bps


Figure 6: Functional block diagram of the Oki Semiconductor MSMS218RS ADPCM integrated circuit.

  • Remember that we originally calculated that a rate of 96,000 bps would be needed to reproduce speech with this same fidelity. (Here we used an 8-bit A/D converter for economy; the bit rate would be the same for a 12-bit converter.)
    With a slight sacrifice in fidelity, the bit rate can be reduced further. By selecting the 4-kHz sample rate and 3-bit ADPCM codes, a 12,000 bps rate is achieved.
    This may still sound like a lot of data, especially when you compare it to phoneme and LPC (linear-predictive coding) speech synthesizers like the Votrax SC-1A and the Digitalker, which by comparison use data rates of 70 to 1000 bps. The difference, of course, is speech quality and intelligibility. A phoneme or LPC synthesizer generates its own sounds and forms them into words. An ADPCM synthesizer, on the other hand, retains the inflection and intonation of the original human voice. With ADPCM, as with an analog recording, it's possible to have a voice output that reproduces the regional accents of the human speaker.

Photo 2

  • The circuit of figure 7 can be used as both an analyzer and a synthesizer. Both subsystems function concurrently when the MSM5218 is in the analysis mode; the results, the reconstructed waveform, can be heard in real time (delayed by 3 VCLOCK periods). In figure 7, this output is smoothed by a low-pass filter and externally amplified to drive a speaker.

Fig. 7

Fig. 7

Fig. 7

An ADPCM speech analysis and synthesis (storage and reproduction) circuit built around the Oki MSM5218RS chip.
A low-cost 8-bit A/D converter is used in place of a higher-resolution and more costly 12-bit converter. The Oki
MSM5204RS 8-bit CMOS A/D converter, used in this circuit, contains a successive-capacitor-ladder conversion system.
It also incorporates a sample and hold stage that enables direct input of rapidly changing analog signals. An external clock-
signal provides timing for the chip; the clock's frequency is not critical and can be anywhere from 450 to 500 khz. The
frequency bandwidth of the signal input to the A/D converter is limited by an active low-pass filter, IC2, an Oki
ALP-2 filter with a 1.7-kHz cutoff frequency and attenuation of 18 dB per octave above the cutoff frequency.

    Use of the ADPCM Circuit

  • As I said at the beginning, the purpose of this project is to create intelligible machine-generated speech. With the circuit of figure 7 connected to a Z80-based computer, and using the LOAD routine in the program of listing 1 (the algorithm shown in the flowchart of figure 8), you can analyze and store 10 seconds of speech (or 20 seconds at the 4-kHz sample rate).

Flowchart Fig. 8

Algorithm of the LOAD routine in the program of listing 1. Used with the circuit
of figure 7, LOAD takes analog voice signals from a microphone or other
source and stores them in ADPCM-encoded form in user memory.

  • The program simply turns on the synthesizer by lowering the reset line and then observing VCLOCK. At every negative-going transition it reads a 4-bit ADPCM nybble (there are 2 nybbles per byte) and stores it in a memory-resident table.
    For experimental purposes I set this table to occupy a rather large region of user memory (40K bytes). For most practical applications, you might prefer to store segments of speech on disk and load them into memory in smaller increments.
    Once some speech has been stored, you can play it back using the DUMP routine from listing 1 (whose algorithm appears in figure 9).

Listing 1

Listing 1 continued

  • With the MSM5218 set to the synthesis mode, the ADPCM codes are sequentially loaded on each rising edge of VCLOCK.
    If you want to store some speech permanently and then play it back in a dedicated application (as an annunciator, for instance), you won't need the analysis part of the circuit after the ADPCM codes have been stored.

    For such cases you may wish to use the synthesize-only circuit of figure 10, which uses the 18-pin MSM5205RS ADPCM-synthesis chip instead of the dual-function 24-pin 5218 (see photo 3 on page 48). The 5205's synthesis capabilities are equal in every way to thoses of the 5218, but the 5205 saves the expense and complication of the analysis section. The resulting 2-chip circuit, the parts of which cost less than $15, can be easily manufactured for a variety of applications.

One significant aspect of ADPCM speech synthesis is the ease of producing a custom vocabulary.

Photo 3

  • I was pleasantly surprised at the fidelity using ADPCM at 32,000 bps. It was still more intelligible than the majority of current synthesis techniques even at 12,000 bps. While testing the software I attached the input of the analysis unit to an FM radio. Even when using the 1.7-kHz filters, I was surprised how good even music sounded.

Fig. 10

A voice-reproduction circuit built around the Oki MSM5205RS speech synthesis chip.
This circuit is useful in applications where you need a fairly inexpensive means of
reproducing a custom vocabulary. You can store your vocabulary with the circuit
of figure 7., and load the encoded speech into this simple circuit for output

    Summary of ADPCM Synthesis
  • Probably the most significant aspects of ADPCM speech synthesis are the simplicity of the hardware and the ease of producing a custom vocabulary. You don't have to send a word list and recording tape to a manufacturer and wait for the com pany to spend days doing a Fourier analysis of the tape. To produce a ROM (read-only memory) containing your custom vocabulary, you can use merely a microphone and a simple LOAD/DUMP routine. It may require 4 to 5 times more memory space than other high-intelligibility speech-synthesis schemes, but the price of that memory is minuscule compared to the cost of producing vocabularies for the other schemes.

Fig. 9 Algorithm of the DUMP routine from listing 1.

    Future Applications of ADPCM
  • We've looked at ADPCM here only as it relates to voice synthesis, but in actuality, the possible applications of ADPCM to speech recognition prompted my initial interest. The first phase of any speech-recognition technique is digitizing the waveforms and getting them into the computer for analysis, compression, and comparison. My previous article on voiceprints (reference 5) demonstrated the large quantity of hardware necessary to merely condition the waveform for traditional speech-recognition methods. With ADPCM and these Oki chips, we have an inexpensive (under $30) circuit for digitizing voice waveforms and presenting them to a computer in a form that it can digest.
    Even though 1500 to 4000 bytes of raw data per second of speech stream into the computer, the data thus recorded should be unique for each individual word. Speech recognition could be accomplished by brute-force comparison of all the data, or perhaps there exists some applicable compression algorithm that might reduce one second of data to 200 bytes or so. The final compacted data would not be for reconstruction of the original waveform but rather stored as a signature of the input word (derived from an ADPCM code table) for use in comparison.
    We have accomplished the first step and now have means to place the ADPCM codes in memory. In the course of the next few months I will be experimenting with various compression and comparison techniques in hope of developing a practical speech-recognition project. But if by chance you happen upon the solution to the problem overnight, let me know.