COMPUTALKER CT-1 speech syntheser
- The COMPUTALKER Model CT-1 Speech Synthesiser is a high quality
voice
generator unit designed for the standard S-100 I/O bus configuration.
The
synthesiser is controlled by acoustic-phonetic parameters transmitted on
the
microcomputer data bus. These parameters control the perceptually and
physiologically fundamental aspects of speech as determined by
contemporary phonetic research.
- With the COMPUTALKER Model CT-1, sound are defined in real time
under software control. Parameters which represent the phonetic
structure of human speech are transmitted to the CT-1 at a rate of 500
to 900 bytes per second, depending on the data compression techniques
used. This allows the production of highly intelligible and quite
natural sounding speech output. Speaker characteristics and language or
dialect variations are retained in the output.
- COMPUTALKER CT-1 Speech Synthesiser Hardware Specifications
- Standard S-100 compatible board:
10x51/4 PC board with 100 pin (dual 50 .125 CTRS) edge connector pattern
Depth, approx. 11/16 overal (occupies one slot on I/O board)
- Components on board include:
CT-1 Synthesiser module set (2 calibrated modules, ea. 3x4x51/8)
14 digital and analog IC's, Power regulators, Address selector switch
2 extra sockets for Expansion and External Parameter Control
- Bus interface:
Uses 10 output addresses, one byte (8 bits) each
Block of 10 addresses is relocatable to any hex boundary via on-board
switch
Remaining 6 ports in block of 16 are reserved for future use
A parameter data frame consists of a sequence of 9 output instructions
which update each of the 9 parameter values. After addressing any of the
9 ports, a minimum of 20 microseconds must be allowed before adressing
another port.
- Good quality speech requires a frame rate of approx. 100 frames per
second.
Updating at this frame rate, the 8080 CPU is occupied approx. 2 to 3
percent
of the time.
Connections on the PC board are provided for controlling the speech
fundamental
frequency (F0) from an external square wave source (such as an
electronic music
synthesiser) rather than the software controlled F0 parameter. This
allows
real-time control of the Compu-singer.
- Audio output:
RCA type phono jack mounted on PC board
1 V peak-to-peak output into 10K ohm load resistance
- Power requirements:
+ 8 V 170 mA typ., 250 mA max. (on-board regulation to ca 5 V)
ca 16 V at 85 mA (on-board regulation to ca 12 V)
- The COMPUTALKER Model CT-1 can also be operated in a low data
rate mode
using phoneme definitions contained in the CSR1 Synthesis-by-Rule
software
package.
- The COMPUTALKER speech synthesis system, used in this way, has the
advantage that the software driver can easily be modified to keep the
naturalness
and intelligibility of the speech output up to date with the constantly
evolving
state of the art of rule governed speech.
Synthesizing speech by rule with the COMPUTALKER
MODEL CT-1
- Synthesis-by-Rule is a method of producing synthetic speech which is
considerably easier than computer/hand analysis of recorded human
speech. The word or phrase to be synthesized is entered in the form of a
phonetic code to a software system which generates the control
parameters
for the CT-1 Synthesizer board. The result is speech which is
understandable to most people in all but the most difficult perceptual
situations with high noise levels or speech material having completely
unexpected content.
- The demonstration cassette contains a portion of the Gettysburg
Address synthesized using a system of software rules. Such a set of
software acousticphonetic rules is available from Computalker
Consultants coded for the 8080 CPU. This software system accepts a
string of ASCII coded phonetic symbols with stresses marked, and
produces a set of control parameters
for the Model CT-1 Synthesizer. The example on the cassette was
generated using a previous version of this software system coded in
FORTRAN, and running on a DEC PDP-12. As the parameter data was
generated, it was punched on paper tape in the data format as described
in the CT-1 Hardware User's Manual, and then read into the IMSAI 8080
for playback. That program, as
run on the larger machine, was originally written for a different speech
synthesizer and some parameters required special treatment for
conversion to the CT-1 parameter format. In some cases, this conversion
was not accurately fine-tuned for the CT-1, and the direct output of the
8080 version
of the program is somewhat clearer in some of the fine details.
- The CSR1 Synthesis-by-Rule software system is organized around
the philosophy of attempting to produce natural sounding, human quality
speech, rather than trying to produce a stereotypical robot-like sound.
Because the true structure of real human speech is not yet correctly
represented
in the software rules, the resulting speech sometimes has an eerie
quality that makes the listener try to assign human-like traits and
qualities to the 'speaker" behind the voice. This psychological
reaction to the voice does not occur when it is synthesized in a "robot"
stereotype having little or no pitch variation and aupt, blocky formant
frequency transitions. The pitch control parameter (F0) can easily be
held to a constant value if the speech output sounds better to you that
way.
The CSR1 software system is structured around phonological, phonetic and
acoustic principles in such a way that it can be modified to keep pace
with the state of the art of synthesis of natural speech. The Model CT-1
has been designed a general acoustic synthesizer so that the hardware
will not pose limitations to further improvements in the obtainable
speech output quality.
- The CSR1 software system is set up as a general callable
suoutine which accepts a string argument containing the phonetic text,
and on completion, plays the speech data in the buffer directly to the
CT-1. With this structure, CSR1 may be called either from a keyboard
input loop (supplied
with the code) giving an on-line phonetic synthesizer, or from another
system such as BASIC or an operating system, which passes a stored or
computed string argument containing the material to be synthesized. On
return, the buffer contains the actual CT-1 data as synthesized, which
may be written out to cassette or paper tape for editing with the CTMON
Monitor/Editor program.
The 8080 assembly code version of CSR1 fits in less than 6K bytes of
memory, including all phoneme feature and target tables. This code may
be located in ROM or RAM. Additional RAM will be required for parameter
data storage during the actual synthesis. The buffer space required is
300 bytes per second of speech. By comparison, the introductory phrase,
"Hello, I'm Computalker, A speech synthesizer designed to plug into the
standard bus on your 8080 microcomputer" is less than 7K bytes long.
CSR1 version 1.0 completes the coniputation of parameter data before
beginning playback. An interrupt driven version is currently under
development, which will begin playback as soon as sufficient data has
been computed and stored in the buffer.
How to get natural sounding speech output from
the COMPUTALKER MODEL CT-1
- The demonstration cassette, "Sounds of Couputalker",illustrates
several methods of obtaining the control parameters to operate the
Computalker Model CT-1 Speech Synthesizer. High quality speech output,
as exemplified by the introductory phrases, "Hello, I'm Computalker.
A speech synthesizer ... ", involves computer processing of recorded
human speech followed by a fair amount of hand work. The recordings were
initially digitized at 10K samples/second and then analysed using a
linear prediction algorithm to extract the formant frequencies, and a
cepstrum algorithm to measure the fundamental frequency. These
techniques are described in several texts on speech analysis
(Flanagan,J.L., Speech Analysis, Synthesis, and Perception, 2nd Ed.,
Springer Verlag 1972; Markel,J.D. and Gray,A.H.,Jr., Linear Prediction
of Speech, Springer '1erlag 1976). In addition to these analyses, the
amplitude was measured by RMS averaging a smooth window each 10 msec. to
obtain the AV parameter. Some editing of the formant frequency data
was done by hand to eliminate falsely
detected peaks and fill in occasional gaps in the true formant data
before converting the frequency data to the Computalker parameters Fl,
F2, and F3. Since the CT-1 control parameters consist of numerical
values within the range of 0-255, all frequency and amplitude data is
converted so that it stays within this range. All the above steps
required approximately 6 hours of time on a DEC PDP-12 set up for speech
analysis processing to produce the original data for the introductory
phrases on the cassette. At this stage, this data was punched on paper
tape and then read into the CT-l Control Monitor program running on my
IMSAI 8080. From that point, I spent several
more evenings entering the datd for parameters AH, AF, FF, and AN, and a
bit more touching up of the other parameters.
- Given the frequency vs. time information obtained from the
initial computer analysis, the remaining aspiration and frication data
can be inserted by fairly straight-forward procedures. These procedures
will be described in the completed CT-1 Hardware User's Manual. The
Manual will also discuss the approximate formant frequency patterns
needed to construct the sounds of the various phonemes of English. It
would be feasible (although tedious work) to construct intelligible
sounds by hand editing based on this data. However, it is still quite
difficult to form these patterns to make natural sounding speech without
access to a spectrum analysis process of some kind. Such an analysis
gives you the frequency structure as a function of time, i.e. retaining
the natural timing structure.
- It is my plan to publish more extensive descriptions of the
above mentioned speech analysis techniques, to make them accessible to a
wider audience than they now have. The recent developments in floating
point hardware with multiplication In the 50-100 microsec. range make
it reasonable to do this sort of analysis on a microcomputer. The setup
would require a filter and A/D converter capable of sampling the speech
at at least 10K samples/sec. The low-pass speech filter ahead of the
A/D converter should be reasonably flat to at least 1/3 of the sampling
rate, and then down by at least 30-40 db at 1/2 the sampling rate. 32K
of RAM memory would allow sampling up to 3 seconds continuously which is
a workable sized chunk. Without floating point hardware the analysis
would proceed quite slowly but in many cases that is not a drawback on a
micro system.
- Alternatively, for a modest consulting fee, Computalker
Consultants could supply the basic, rough formant frequency, Fill and AV
data from your tape recording, leaving out the aspiration, frication
and nasal values, which must be added by hand. As a preliminary
estimate, I believe this work could be done for approx. $25 per second
of speech material to be analyzed. Working from this basic data the
desired speech could be produced following the tables and information
given in the CT-1 Hardware User's Manual, using the CTMON Monitor/Editor
to synthesize speech from the data as the work progresses.
Friends, Humans and Countryrobots: Lend me your
Ears by D Lloyd Rice published Byte in aug 76
- You've got your microcomputer running and you invite your friends in
to show off the new toy. You ask Charlie to sit down and type in his
name. When he does, a loudspeaker on the shelf booms out a hearty Hello,
Charlie! Charlie then starts a game of Star Trek and as he warps around
thru the galaxy searching for invaders, each alarming new development
is announced by the ship's computer in a warning voice, Shield power
low!, Torpedo damage on lower decks!
The device that makes this possible is a peripheral with truly unlimited
applications, the speech synthesizer. This article describes what a
speech synthesizer is like, how it works and a general outline of how to
control it with a microcomputer. We will look at the structure of human
speech and see how that structure can be generated by a computer
controlled device.
How can you generate speech sounds artificially, under computer control?
Let's look at some of the alternatives. Simplest of all, with a fast
enough digilal to analog converter (DAC) you can generate any sound you
like. A 7 or 8 bit DAC can produce good quality sound, while some-where
around 4 or 5 bits the quantization noise starts to be bothersome. This
noise is produced because with a 5 bit data value it is possible to
represent only 32 discrete steps or voltage levels at the converted
analog output. Instead of a smoothly rising voltage slope, you would get
a series of steps as in figure 2.
- As for the speed of the DAC, a conversion rate of 8,000 to 10,000
conversions per second [The sample rate in conversions per second or
samples per second is often quoted in units of Hertz. We will use that
terminology here, although conversions per second is a generalization of
the concept of cycles per secondi is sufficient for fairly good quality
speech. With sample rates below about 6 kHz the speech quality begins
to deteriorate badly because of inadequate frequency response.
Almost any microprocessor can easily handle the data rates described
above to keep the DAC going. The next question is, where do the samples
come from? One way td get them would be by sampling a real speech signal
with a matching analog to digital converter (ADC) running at the same
sample rate. You then have a complicated and expensive, but very
flexible, recording system. Each second of speech requires 8 K to 10 K
bytes of storage. If you want only a few words or short phrases, you
could store the samples on a ROM or two and dump then sequentially to
the DAC. Such a system appears in figure 3.
If you want more than a second or two of speech output, however, the
amount of ROM storage required quickly becomes impractical. What can be
done to minimize storage? Many words appear to have parts that could be
recombined in different ways to make other words. Could a lot of memory
be saved this way? A given vowel sound normally consists of several
repetitions of nearly identical waveform segments with the period of
repetition corresponding to the speech fundamental frequency or pitch.
Figure 4 shows such a waveform.
- Within limits, an acceptable sound is produced if we store only
one such cycle and construct the vowel sound by repeating this waveform
cycle for the duration of the desired vowel. Of course, the pitch will
be precisely constant over that entire interval. This will sound rather
unnatural, especially for longer vowel durations, because the period of
repetition in a naturally spoken vowel is never precisely constant,
but fluctuates slightly. In natural speech the pitch is nearly always
changing, whether drifting slowly or sweeping rapidly to a new level. It
is of interest that this jitter and movement of the pitch rate has a
direct effect on the perception of speech because of the harmonic
structure of the speech signal. In fact, accurate and realistic
modelling of the natural pitch structure is probably the one most
important ingredient of good quality synthetic speech. In order to have
smooth pitch changes across whole sentences, the number of separate
stored waveform cycles still gets unreasonable very quickly. From these
observations of the cyclic nature of vowels, let us move in for a closer
look at the structure of the speech signal and explore more
sophisticated possibilities for generating synthetic speech.
- How Do We Talk?
- The human vocal tract consists of an air filled tube about 16
to 18cm long, together with several connected structures which make the
air in the tube respond in different ways (see figure 1). The tube
begins at the vocal cords or glottis, where the flow of air up from the
lungs is broken up into a series of sharp pulses of air by the vibration
of the vocal cords. Each time the glottis snaps shut, ending the
driving pulse with a rapidly falling edge, the air in the tube above
vibrates or rings for a few thousandths of a second. The glottis then
opens and the airflow starts again, setting up conditions for the next
cycle.
The length of this vibrating air column is the distance from the closed
glottis up along the length of the tongue and ending at the lips, where
the air vibrations are coupled to the surrounding air. lf we now
consider the frequency response of such a column of air, we see that it
vibrates in several modes or resonant frequencies corresponding to
different multiples of the acoustic quarter wavelength. There is a
strong resonance or
energy peak at a frequency such that the length of the tube is one
quarter wavelength, another energy peak where the tube is three quarter
wavelengths, and so on at every odd multiple of the quarter wavelength.
If a tube 17.4 cm long had a constant diameter from bottom to top, these
resonant energy peaks would have frequencies of 500 Hz, 1500 Hz, 2500
Hz and so on. These resonant energy peaks are known as the formant
frequencies. Figure 5 illustrates the simple acoustic resonator and
related physical equations.
- The vocal tract tube, however, does not have a constant diameter
from one end to the other. Since the tube does not have constant shape,
the resonances are not fixed at 1 000 Hz intervals as described above,
but can be swept higher or lower according to the shape. When you move
your tongue down to say ah, as in figure 6, the back part is pushed back
toward the walls of the throat
- and in the front part of the mout the size of the opening is
increased. Th effect of changing the shape of the tube this way is to
raise the frequency of the fir' resonance or formant 1 (Fl) by sever
hundred Hz, while the frequency of formant 2 (F2) is lowered slightly.
On the other hand, if you move your tongue forward ar upward to say ee,
as in figure 7,
- the size of the tube at the front, just behind the teeth, is much
smaller, while at the back the tongue has been pulled away from the
walls of the throat, leaving a large resonant cavity in that region.
This results in a sharp drop Fl down to as low as 200 or 250 Hz, with F2
being increased to as much as 2200 or 2300 Hz.
We now have enough information to put together the circuit for the oral
tract branch of a basic formant frequency synthesizer. After discussing
that circuit, we will continue on in this way, describing additional
properties of the speech mechanism building up the remaining branches of
synthesizer circuit.
- A Speech Synthesizer Circuit
- To start with, we must have a train of driving pulses, known as
the voicing source, which represents the pulses of air flowing up thru
the vibrating glottis. This could be simply a rectified sine wave as in
figure 8.
To get different voice qualities, the circuit may be modified to
generate different waveform shapes.
This glottal pulse is then fed to a sequence of resonators which
represent the formant frequency resonances of the vocal tract. These
could be simple operational amplifier bandpass filters which are tunable
over the range of each respective formant. Figure 9 shows the concept
of a typical resonator circuit which meets our requirements. 1C1, 1C2
and 1C4 form the actual bandpass filter, while 1C3 acts as a digitally
controlled resistance element serving to vary the resonant frequency of
the filter.
- Several such resonator circuits are then combined as in figure 10 to
form the vocal tract simulator. The voicing amplitude control, AV, is
another digitally controlled resistance similar to 1C3 of figure 9. This
gain controlled amplifier configuration is the means by which the
digital computer achieves its control of speech signal elements. The
data of one byte drives the switches to set the gain level of the
amplifier in question. In figures 10, 13 and 1 5 of this article, this
same variable resistance under digital control is shown symbolically as a
resistor with a parameter name, rather than as an operational
ampliefier with analog switches.
- Generating Vowel Sounds
- The vocal tract circuit as shown thus far is sufficient to
generate any vowel sound in any human language (no porpoise talk, yet).
Most of the vowels of American English can be produced by fixed, steady
state formant frequencies as given in table 1.
- A common word is given to clearly identify each vowel. The formant
frequency values shown here may occasionally be modified by adjacent
consonants. An alternative way to describe the formant relationships
among the vowels is by plotting formant frequencies Fl vs F2 as in
figure 11. F3 is not shown here because it varies only slightly for all
vowels (except those with very high F2, where it is some what higher).
- The F1-F2 plot provides a convenient space in which to study the
effects of different dialects and different languages. For example, in
some sections of the United States, the vowels in hod and paw are
pronounced the same, just above and to the right of paw on the graph.
Also, many people from the western states pronounce the sounds in head
and hid alike, about halfway between the two points plotted for these
vowels on the graph. A few English vowels are characterized by rapid
sweeps across the formant frequency
space rather than the relatively stable positions of those given in
table 1. These sweeps are produced by moving the tongue rapidly from one
position to another during the production of that vowel sound.
Approximate traces of the frequency sweeps of formants F1 and F2 are
shown in figure 12 for the vowels in bay, boy, buy, hoe and how. These
sweeps occur in 150 to 250 ms roughly depending on the speaking rate.
- Consonant Sounds
- Consonant sounds consist mostly of various pops, hisses and
interruptions imposed on the vibrating column of air by the actions of
several components of the vocal tract shown in figure 1. We will divide
them into four classes: 1) stops, 2) liquids, 3) nasals, and 4)
fricatives and affricates. Considering first the basic 'stop consonants,
p, t, k, b, d and g, the air stream is closed off, or stopped,
momentarily at some point along its length, either at the lips, by the
tongue tip just behind the teeth or by the tongue body touching the soft
palate near the velum. Stopping the air flow briefly has the effect of
producing a short period of silence or near silence, followed by a pulse
of noise as the burst of air rushes out of the narrow opening.
The shape of the vocal tract with the narrow opening at different points
determines the spectral shape of the noise pulse as well as the formant
locations when voicing is started. Both the noise burst spectrum and
the rapid sweeps of formant frequency as the F1-F2 point moves into
position for the following vowel are perceived as characteristic cues to
the location of the tongue as the stop closure is released. We need
only add a digitally controlled noise generator to the vocal tract
circuit of figure 10 to simulate the noise of the burst of air at the
closure release and we can then generate all the stop consonants as well
as the vowels.
- Figure 13 shows the speech synthesizer with such a noise generator
added. The breakdown noise of a zener diode is amplified by lC1 and
amplitude is set by the digitally controlled resistor AH. 1C2 is a mixer
amplifier which combines the glottal source and aspiration noise at the
input to the formant resonators. It is important to notice at this
point the range of different sounds that can be generated by small
changes in the relative timing of the control parameters. The most
useful of these timing details is the relationship between the pulse of
aspiration noise and a sharp increase in the amplitude of voicing (see
figure 14).
- For example, if we set the noise generator to come on for a noise
pulse about 40 ms long and immediately after this pulse, Fl sweeps
rapidly from 300 up to 775 Hz and F2 moves from 2000 down to 1100 Hz,
the sound generated will correspond to moving the tip of the tongue down
rapidly from the roof of the mouth. Observe, however, that the formant
output is silent after the noise pulse until the voicing amplitude is
turned up. If voicing is turned on before or during a short noise burst,
the circuit generates the sound da, whereas if the voicing comes on
later, after a longer burst and during the formant frequency sweeps, the
output sounds like ta. This same timing distinction characterizes the
sounds ba vs pa and ga vs ka, as well as several other pairs which we
will explore later. Figure 14 gives the formant frequency patterns
needed to produce all the stop consonants when followed by the vowel
ah.When the consonant is followed by a different vowel, the formants
must move to different positions corresponding to that vowel.
The important thing to note about a stop transition is that the starting
points of the frequency sweeps correspond to the point of closure in
the vocal tract, even though these sweeps may be partially silent for
the unvoiced stops p, t and k, where the voicing amplitude comes on
after the sweep has begun.
The second consonant group comprises the liquids, w, y, and I. These
sounds are actually more like vowels than any of the other consonants
except that the timing of formant movements is crucial to the liquid
quality. W and y can be associated with the vowels oo and ee,
respectively. The difference is one of timing. If the vowel oo is
immediately followed by the vowel ah, and then the rate of Fl and F2
transitions is increased, the result will sound like wa. A comparison of
the resulting traces of Fl and F2 vs time in wa with the transition
pattern for ba in figure 14 points out a further similarity. The
direction of movement is basically the same, only the rate of transition
of ba is still faster than for wa. Thus we see the parallelism in the
acoustic signal due to the common factor of lip closeness in the three
sounds ua, wa and ba. Y can be compared with the vowel ee in the same
way, so the difference between ia and ya is only a matter of transition
rates. Generally, I is marked by a brief increase of F3, while r is
indicated by a sharp drop in F3, in many cases, almost to the level of
F2.
- The third group of consonants consists of the nasals, m, n and
ng. These are very similar to the related voiced stops b, d and g,
respectively, except for the addition of a fixed nasal formant. This
extra formant is most easily generated by an additional resonator tuned
to approximately 1400 Hz and having a fairly wide bandwidth. It is only
necessary to control the amplitude of this extra resonator during the
closure period to achieve the nasal quality in the synthesizer output.
The fourth series of consonants to be described are the fricatives, s,
sh, zh, z, f, v and th and the related affricates ch and j. The
affricates ch and j consist of the patterns for t and d followed
immediately by the
fricative sh or zh, respectively, that is, ch = t+sh and j = d+zh. The
sound zh is otherwise rare in English. An example occurs in the word
azure. With the letters th, two different sounds are represented, as
contained in the words then and thin. All the fricatives are
characterized by a pulse of high frequency noise lasting from 50 to 150
msec. The first subdassification of fricatives is according to voicing
amplitude during the noise pulse, just as previously described for the
stop consonants. Thus, s, sh, f, ch and th as in thin' have no voicing
during the noise pulse, while z, zh, v, j and th as in then have high
voice amplitude. When a voiceless fricative is followed by a vowel, the
voicing comes on during the formant sweeps to the vowel position, just
as in the case of the voiceless stops. The different fricatives within
each voice group are distinguished by the spectral characteristics of
the fricative noise pulse. This noise signal differs from that
previously described for the stop bursts in that it does not go thru the
formant resonators, but is mixed directly into the output after
spectral shaping by a single. pole filter.
- Table 2 gives the fricative resonator settings needed to produce the
various fricative and aifricate consonants. Fricative noise amplitude
settings are shown on a scale of 0 to 1.
- The Complete Synthesiser
- The system level diagram of a complete synthesizer for voice
outputs is summarized in figure 15.
- The information contained in this article should be sufficiently
complete for individual readers to begin experimenting with the
circuitry needed to produce speech outputs. In constructing a
synthesizer on this model, the result will be a device which is
controlled in real time by the following parameters:
AV amplitude of the voicing source, 8 bits
FV frequency of the voicing source, 8 bits
AH amplitude of the aspiration noise component, 8 bits
AN amplitude of the nasal resonator component, 8 bits
AF amplitude of the fricative noise component, 8 bits
Fl frequency of the formant 1 filter, 8 bit setting.
F2 frequency of the formant 2 filter, 8 bit setting.
F3 frequency of the formant 3 filter, 8 bit setting.
FF frequency of fricative resonator filter, 8 bit setting.
This is the basic hardware of a system to synthesize sound; in order to
complete the system, a set of detailed time series for settings for
these parameters must be determined (by a combination of the theory in
this article and references, plus experiment with the hardware). Then,
software must be written for your own computer to present the right time
series of settings for each sound you want to produce. Commercial
synthesizers often come with a predefined set of phonemes which are
accessed by an appropriate binary code. The problem of creating and
documenting such a set of phonemes is beyond the scope of this
introductory article, but is well within the dollar and time budgets
of an experimenter.
|