Synthesis snd/sp‎ > ‎


Simple Speech

Experimental Speech System, Part One published by Practical Electronics November 1983
Posted 27 october 2002

    • OVER recent years some extremely sophisticated techniques have been developed
      for the digital encoding of speech, and for its reconstitution. The purpose of
      this article is not to discuss these techniques (see references),but to present
      a very simple method, and its implementation using a 6502-based microcomputer
      with minimal extra hardware, whereby speech may be digitised, and subsequently
      regenerated, with an adequate degree of intelligibility.
      This method is relatively economic in its use of memory for storage of the speech
      data, and has the outstanding advantage that the encoding process is simple. One
      may readily construct any chosen vocabulary, whether composed of complete words,
      phrases, or phonemes. This is in distinct contrast to the most commonly used method
      of speech digitisation, involving linear predictive coding to reduce redundancy,
      and utilising a digital model (a lattice filter) of the human vocal tract for the
      reconstitution of speech. Although highly complex, this hardware is now cheap and
      readily available. However, it is not at all easy for the user to construct his
      own vocabulary, as the encoding process is so complicated.

      The method presented here is not new in concept and is based upon the well-known
      observation that the intelligibility of infinitely clipped, or zero-crossed speech,
      is remarkably high. This is due to the particular spectral characteristics of speech,
      especially its quasi-periodic nature. Anyway, given this observation, it is but a
      simple step to see that one may encode zero-crossed speech waveforms by storing
      digitally the time intervals between successive zero-crossings.
      The zero-crossed speech waveform may be faithfully reconstructed by an inverse
      process of generating switching intervals corresponding to the magnitudes of the
      digital speech data. Note also that speech may be speeded up or slowed down on
      replay (but the pitch also changes accordingly). In practice it was not found
      possible to speed up the speech appreciably on replay, because of the limited
      speed of the microprocessor.

      Initial experiments had the simple aim of digitising a few seconds of continuous
      speech (including any periods of silence). Once this had been achieved, and the
      reconstituted speech shown to be almost identical to the original zero- crossed
      speech waveform, the programs and hardware were developed further to enable isolated
      words or word segments to be digitised and stored between known limits in memory.
      Knowing where in memory the speech data comprising a word startsand ends, it becomes
      a simple matter to replay any particular chosen word under program control.
    • Acquisition and replay of a few seconds of continuous speech, including silences.
      A little consideration suggested that sufficient speed could be obtained only by
      using the 6502 interrupt system; it would take rather too long to repeatedly examine
      a digital input line, testing for a change of state. The acquisition hardware operates
      as follows: First a source of audio signal with a peak amplitude of a few volts
      rather than millivolts,is required.
      This can be obtained from a microphone followed by a suitable preamplifier (Fig. 1.1.),
      but if such a pre-amp is not available, then the audio signal may be pre-recorded and
      replayed from a cassette recorder at maximum volume, taking the audio output from the
      earphone socket. In Fig. 1.2, IC1 is configured as a comparator receiving an input
      which is filtered to give a bandpass response between about 300 and 3200Hz. The
      comparator threshold may be set within a certain range either side of zero by means
      of VR1. Ideally, the threshold should be set at zero, but in practice it is preferable
      to set the threshold just above the noise level, either positively or negatively.
      Clearly it is necessary to conduct these experiments in an environment relatively
      free of acoustic noise.
      The comparator provides a "zero-crossed" output or, expressed another way,
      an infinitely clipped version of the audio waveform. The function of the pre-filtering
      is to restrict the bandwidth to the minimum compatible with reasonable intelligibility.
      This will economise on memory usage for the storage of the words, especially those
      with a high sibilant content. IC1 is followed by an edge detector, consisting of a
      simple delay and an exclusive-OR gate. This gives positive-going pulses of minimal
      duration (the acquisition Count routine must not interrupt itself).
      These pulses are inverted by IC3a 1/2 74LS20 and then supplied to the IRQ input of
      the 6502 microprocessor. The reason for using a 4-input NAND gate rather than a simple
      inverter is to enable the system to be expanded later on.

    • Table 1.1 gives the assembly listing. Table 1.2 gives a hexdump of the acquisition program.
      The storage area reserved for speech data is $1000 to $1FF, so a minimum of 8K of memory
      must be available. 4K of memory is thus available for storage of speech data, which is
      enough for a few seconds of speech.
      INTA1 is the acquisition program. It consists of two parts, an Initialisation routine, and
      a Count routine. Superboard II has a small area of free memory, located at addresses $0250
      to $02FF, which is not used by BASIC, so it was decided to assemble the programs into this area.
      Note that the machine used was fitted with WEMON; on machines without WEMON this area of free
      memory in page 3 is slightly larger.
      Simple Speech
      The first part of the acquisition program is concerned with initialisation. It ensures that
      all arithmetic operations are conducted in twos complement binary rather than in BGD; then it
      sets up the system IRQ vector to access the Count routine. The remaining instructions set up a
      2-byte indirect pointer to access memory for the storage of zero-crossing intervals, starting
      at $1000; then the counter (the X-register) used for timing the intervals is cleared, interrupts
      are made possible, and the processor cycles in a jump-self wait loop until the first interrupt
      pulse arrives. Operation of the second part, the Count routine, is best understood by reference
      to the assembly listing. Essentially it stores the last count value, and counts again until
      another interrupt pulse occurs, in which case the counting starts again, or until a count of $FF
      is reached, in which case the same thing happens, except this time via a JMP rather than an
      interrupt. Thus intervals longer than $FF count units will be truncated to that value.
      In practice most intervals do not exceed $FF units, but the reason for incorporating
      this feature is to enable silences also to be encoded, as a train of bytes of value $FF.
      For this first experiment this was of interest, because some words do contain brief silences,
      and also to enable continuous speech to be encoded. Later it is attempted to record single
      words one at a time, both with and without this feature, to see how great a difference the
      omission of brief intra-word silences makes. For the purposes of the present programs, however,
      it is enough to note that the replay program is designed to skip over bytes having the value
      $FF, i.e. it treats them as silence. At the other extreme there is a minimum interval between
      consecutive zero-crossings that can be correctly resolved; this is the time taken for the
      execution of instructions in lines 180 through 300. Again, the quality of the resuts
      obtained suggests that intervals so brief are rare in band-limited speech.
    • The hardware required for reconstitution of the zero- crossed speech is even simpler than
      that employed for acquisition. Best results are obtained if the replay is filtered to remove
      the high-frequency raggedness of the sound, and the low-frequency rumble component. This is easily
      achieved using an audio amplifier (Fig. 1.3.) equipped with bass and treble controls.
      Alternatively, a simple audio amplifier may be used in conjunctionwith the same pair of filters
      used for the acquisition process. The replay program, Table 1.3, generates pulses on a particular
      write line, in this case W2, whose address is decimal 61314.
      The pulses occur at intervals corresponding to the original zero-crossing intervals, and are used
      to toggle a flip-flop, Fig. 1.4, to generate the reconstituted infinitely clipped speech waveform.
      Clearly at least one address-decode write line is needed.
      See references for suitable circuitry. The functions of capacitors C1 and C2 is not immediately
      obvious, but they were found necesary to ensure clean toggling operation of the flip-flop.
      I speculate that there may have been double-pulsing or ringing on the write line, although I could
      not detect it by eye on an oscilloscope. Thus these capacitors may not prove necessary, and
      I suggest experimenting with their values.
    • The second part of this article will deal with programs and additional hardware for building
      up a vocabulary of isolated words, and for replaying specific words under program control. It will
      also consider some of the possible uses of these techniques for producing complex sound effects
      rather than speech, and look at zero-crossed speech data graphically with a view to possibilities
      for' speech recognition.
    • 1) Speech Synthesis. Practical Electronics Nov., Dec. 1980
      2) Interfacing Compukit, by D. E. Graham, Practical Electronics Jan.-July 1981.

Experimental Speech System, Part Two published by Practical Electronics November 1983


    • TWO approaches have been tried for the systematic acquisition of a vocabulary and for selective
      replay of a specified word. They have both been considered because they have their respective merits.
      In the first approach the acquisition count routine halts during silences; this has the great
      advantage of ignoring redundant silence either side of a word as spoken during the period that
      the Acquisition Enable button is depressed. However, intra-word silences are elided. One solution
      to this difficulty might have been to make acquisition conditional on the state of the output of an
      envelope detector; however, this would
      still be unsatisfactory, because the first few milliseconds
      of a word would be missed, owing to the response time of the envelope detector. Furthermore,
      the acquisition procedure would be fooled, since it would be unable to distinguish between such
      a brief hiatus and the true end of a word.

      Yet other approaches might be tried: obviously the problem could be overcome by using a double-precision
      counter, which would time for long enough to cover the intra-word silences; but it would be wasteful of
      memory to store two bytes per intertransition interval, for much of the time the high-order byte would
      be redundant; it would also (in the case of the 6502) increase the count loop time, so making for poorer
      resolution of intervals.

      The first approach was tried, and was quite successful, apart from the elision of intra-word silences.
      In the second approach, for which programs are here presented, provision is made for the acquisition
      of silences of a limited duration, whose maximum value may be specified before execution of the
      acquisition program. The end address of the speech data is automatically recorded in a table when the
      press button switch is released. This is achieved by means of a routine to which control is transferred
      by a negative-going transition on the 6502's NMI line. The replay program regenerates any chosen word,
      the appropriate speech data being accessed simply by a number, this being the position of the word in
      the sequence that was originally stored. Note that 16K of RAM is needed, and that BASIC must be restricted
      to decimal 5567 bytes. The additional hardware needed has already been given in the lower part of Fig. 1.2
      last month, but revised, rationalised circuitry is given in Fig.2.1.


    • The acquisition circuitry has been modified to avoid the need for a dual power supply, and to
      incorporate some nominal filtering to restrict the speech bandwidth. If you wish to experiment with
      different bandwidths, then omit this filtering and use the variable low and high-cut filters
      described earlier. A further refinement is the provision of
      an l.e.d. driven by the comparator output; this is helpful in setting the comparator threshold
      just above the noise level: VR1 should be set so that the l.e.d. just stays off at the ambient
      'silence' level. (It is important that background noise be reasonably low within the operating
      bandwidth). A delay interposed between the comparator and the I.e.d. stretches brief threshold
      crossings enough to visibly light the l.e.d.
    • Quite acceptable speech output can be obtained, without the need for an audio amplifier, by
      connecting a small
      loudspeaker to the flip-flop as shown. Placing the loudspeaker cone downwards in a small plastic
      bowl of suitable tapering diameter was found to provide a beneficial resonance; this is something
      to experiment with.
    • Setting up the acquisition circuitry involves only adjustment of VR1 as already described.
      To verify that the D-type flip-flop can toggle correctly, fill up part or all of the word data
      storage area with some arbitrary interval value (except 255, which is treated as silence), and execute
      the routine for continuous replay, i.e. RPFF3 given earlier. This should yield a continuous tone of
      steady pitch; if it sounds ragged or irregular, then try changing the values of the capacitors marked
      with an asterisk.
    • The circuitry is not complicated or critical in layout, and because it is regarded as experimental
      and open to improvement, a p.c.b. design has not been provided. It can readily be assembled on Veroboard
      or fitted into space on an existing board.

    • Having set up the hardware, the programs may now be tried out.
      Enter a value into SILMAX, i.e. page zero location 51 hex. This gives the maximum length of continuous
      silence that can be acquired; a value of 10 hex seems about right. Now execute the acquisition program
      from 029B hex. No speech data will be stored in memory until the Acquisition Enable button is pressed.
      Thus the procedure is as follows: decide on the vocabulary sequence you wish to enter; execute the program
      from the machine code monitor; press the button and hold down, starting to speak the word as soon as the
      button is depressed, and releasing the button as soon as the word is finished; continue in this fashion
      until the program returns you to the monitor, which will happen when the memory storage area for either
      the word data or the end-of-word address table becomes filled up. Depending upon the bandwidth in use, up
      to 20 average-length words can be stored. If you have more than 16K of memory, then you can modify the
      programs to increase the vocabulary storage capacity.

      Having entered the vocabulary sequence, verify that page zero location 57 hex contains the number of
      words which you were able to enter; then have a look at the contents of the table of end addresses
      of words; the two-byte values in low-high order should be in an ascending order.
      The replay routine is conveniently tested from BASIC using the test programs given, one of which
      replays a single specified word, the other replaying the entire vocabulary out in sequence. Clearly the
      replay routine can be called from any other program of your choice.

      It is actually possible to manage without a microphone" and preamplifier. because you can store
      the vocabulary sequence on cassette tape, and then replay the tape recording into the acquisition circuitry
      at full volume, pressing and releasing the button switch as described above; a particular vocabulary for use
      with a particular program can be stored on tape along with the program; clearly the vocabulary data could also
      be stored in its encoded form, in which case the acquisition procedure would not have to be repeated. If you
      have a disc-based system, then the speech facility becomes much more useful.


  • Practical Electronics, november 1983