Miscellaneous‎ > ‎


Advanced Music Synthesis

  • published by E and MM november 1981 page 51/52 by Alan Davies


  • 'Twas brillig and the slithy toves Did gyre and gimbie in the wabe; All mimsy were the bomgoves, And the morne raths outgrabe. 'Beware the Jabbewock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!' from 'jabberwocky' - Lewis Carroll Reproduced by kind permission of Frederick Warne Publishers.
... or is there more to this than meets the ear?
  • In this article we take a look at some methods of electronic speech production and manipulation and investigate some applications of this technology in the recording industry and the rapidly expanding market for the 'talking chip'
  • Undeniably some of the most interesting and frequently employed special effects in recent popular music have been various treatments and subtle uses of the human voice and the qualities which it possesses. A very good example of the use of the untreated voice is David Bowie's 'Ashes to Ashes'. A close examination of this reveals an extremely compelling use of background voices which seem to half chant, half whisper the words of the song. This has the effect of drawing listener's interest to it just in the same way as two people whispering aross the room arouses a curiosity as to what is being said. This ploy commands attention and is certainly powerful musical 'hook'. Another common effect which trades upon vocal qualities is the ubiquitous Wah-Wah pedal - creatively used in the film music 'Theme Shaft'. Related to this are the 'Mouth Tube' and the much more sophisticated 'Vocoder'. One of the earliest examples of the use of the latter's sound in popular music was 'Sparky's Magic Piano'. More recently ELO's 'Mister Blue Sky' and television's talking robot 'Metal Mickey' have both used this equipment.
  • But why this fascination with vocal effects? To explain this it is important to appreciate the way in which the human ear responds to sounds. Recent research has shown that the hearing system is not only sensitive to the frequency and amplitude of an incomming signal but also to the way in which both these parameters vary temporally. For example, if a pure sine wave (no harmonics) of constant pitch and amplitude is played to a listener then he soon tires of this - the ear becomes fatigued by the stimulus. If however the signal is mildly frequency modulated (i.e. a slight vibrato introduced) at a rate of say 8-10Hz then the ear is able to sustain a greater exposure to this before fatigue sets in. The same principle applies to the introduction of amplitude modulation (tremolo): in both cases the incoming signal is more interesting to the ear. If the principles of frequency and amplitude modulation are now extended to waveforms having much higher harmonic contents (such as a ramp wave) then this is even more interesting as modulations of the fundamental then produce more complex modulations of the harmonic structure resulting in an extremely 'active' sound. This discussion tends to suggest that there are receptors within the auditory system which are 'tuned' to detect both frequency and amplitude variations in an incoming signal.
  • Indeed it has been shown that they are even capable of determining the shape of the modulating waveforms! Thus it is transients within a sound which are important to maintain interest and also very important when it comes to the recognition of, for example, musical instruments or speech. This may be easily seen when you try to simulate the sounds of conventional musical instruments on a synthesiser. The problem arises from the fact that it is very difficult to introduce sufficient variation of both frequency and amplitude into the waveforms produced. With sounds of short duration then it is just about possible to deceive the ear but with any sustained sound such as the imitation of a held violin or oboe note then the ear is able to detect the too regular nature of the waveform and labels it as electronically generated. With the advent of the new generation of computer synthesisers (such as the Fair-light CMI) then the real-timecontrol of frequency and amplitude parameters is possible to a very fine degree but it is still difficult to produce a really convincing 'held sound' - the waveforms are still too 'perfect' and do not possess the unpredictable irregularities common to all natural sounds. So, the human ear is highly sensitive to changes in an incoming signal and this is a clue as to the power of the human voice as a communicator and also its magnetic attraction when used for special effects. There is no more flexible sound generator known to man than the human voice. It is capable of extremely precise amplitude, frequency and harmonic control over a relatively wide range resulting in a vast repertoire of expression.
  • How is the human voice able to achieve all this? Let's take a closer look at the way in which speech is produced and the reasons why certain sounds are described as possessing a 'vocal character'. Speech is composed of two main component sounds: (1) VOICED SOUNDS. These are produced when air from the lungs is forced between the vocal chords, which are situated in the windpipe, causing these membranes to vibrate and a pulsating column of air to enter the mouth and nasal cavities. The fundamental pitch of the resultant note is determined by the length, thickness and tension of the vocal chords. (2) UNVOICED SOUNDS. If the air from the lungs is not forced through the vocal chords but simply expelled through the mouth then unvoiced sounds such as 'f' or 'h' are produced. These are very similar in nature to the sounds which may be produced by the filtering of a 'white noise' source. The shape of the mouth and the nasal cavities determines the character of both the above types of sound - they act as complex filters, the response of which is variable by altering the shape of the mouth. (Try vocalising the sound 'ah' and then slowly altering the shape of the mouth and listen carefully for the changes in the harmonic structure which results from this. All the vowel sounds can be produced in this manner). Precise variations are obtained by movements of the tongue and lips which alter the resonant features of the filter system, creating areas in which certain frequencies are boosted and others cut. The ranges in which frequencies are boosted are known as formant bands (which are also present in the resonant structures of musical instruments and largely account for their different sounds - each instrument can be said to have its own formant 'fingerprint'). The lips play a particularly important role in the production of sounds which may be distinguished by their dynamic amplitude characteristics such as the percussive attack transients in sounds such as 'p'. Overall then, the voice may be regarded as a complex sound generating instrument consisting of ah amplitude and frequency controlled oscillator (vocal chords and lungs), noise generator lungs) and a set of formant filters (mouth and nasal cavities). Viewed in this light it would seem that the basic ingredients required for voice production are available on a conventional music synthesiser and it poses the question as to the feasibility of producing vocal sounds using conventional synthesis techniques. These would involve using a voltage controlled oscillator to simulate the vocal chords (ensuring that the waveform produced is sufficiently rich in harmonic content, e.g. a pulse wave) and a noise generator for the unvoiced sounds. Circuitry would be required to switch back and forth between these two sound sources depending on whether voiced or unvoiced sounds were desired. For the filtering section, a bank of voltage controlled bandpass filters could be used each tuned to a quarter or third of an octave apart, covering the area of the audio band in which speech components are most prominent (approx. 150-8000 Hz). The array of filters would be similar to those employed in a graphic equaliser except that those of course are not voltage controlled. If you possess a graphic equaliser with sufficient frequency discrimination between its bands - preferably a 20 channel unit - then you can have a go at simulating various vowel shapes on it by using a pulse wave as a signal source and adjusting the slider controls to approximate the outlines of the frequency spectra of the vowel shapes shown in Figure 1.
  • Figure 1. Guide to frequency spectra of vowel shapes.

  • Interesting backing sounds for songs may be produced in this way especially if the output from the equaliser be passed through a chorus unit producing not just one vowel sound but a multiple effect. Thus far the possibility of speech production using conventional synthesis techniques seems on the cards. However, it is when considering the extremely complicated control voltages which would be required to manipulate the filter bank that we come up against the main snag with this system. How can we overcome this problem? One possibility is to store the control voltages digitally resulting ina hybrid analogue - digital speech synthesiser. Control voltages would be stored in ROM (Read Only Memory) and a microprocessor could read these out and convert them via a D/A (Digital to Analogue) converter into analogue voltages for the filter bank. This would enable a limited vocabulary of words to be produced governed by the storage capacity of the ROM. A slightly different approach to speech synthesis and one which is now becoming more commonplace is, the entirely digital system. This in some ways is an extension of the one described above, in that the components of words are stored in ROM. Now the data stored is such that when read out and fed through a D/A converter, the analogue voltage produced is no longer just a control voltage to be applied to a filter bank but may be immediately fed to an amplifier and will produce the desired sound of say a vowel or consonant. The beauty of this system is that instead of having to store lots of control voltages - up to 22 for each sound in a large hybrid system consisting of 22 bandpass filter channels - it is now possible to store less values for the same resultant sound. A further extension of this principle leads to even more compact storage of words.
  • Consider the following: better;batter;matter;match;fetch;mud, it is possible to divide these up into component sounds: better-(1)beh (2)tur batter-(3)bah (2)tur matter-(4)mah (2)tur match-(4)mah (5)ch fetch-(6)feh (5)ch mud-(7)muh (8)d From these individual components it is now possible to make new words such as: fetter-(6)feh (2)tur batch -(3)bah (5)ch bed - (1)beh (8)d bad - (3)bah (8)d mad - (4)mah (8)d much -(7)muh (5)ch fed - (6)feh (8)d etc.
  • As may be seen, an extension of this system will result in a large vocabulary being available from relatively few component parts. This is the method which is used in vanous devices such as the talking calculator or spelling game or in anything which uses a 'talking chip'. So much for methods of producing speech 'from scratch'. Let's return to the topic of vocal special effects. These make use of an existing human voice and subject it to different forms of electronic processing. One of the most obvious of these is of course the addition of either reverberation or echo. Another is the use of a frequency-shifter to generate the effect of two voices singing together a fixed interval apart. But perhaps one of the most popular vocal effects units is the vocoder. Vocoding or VOice-CODING is not a new concept. Indeed the original idea was conceived before the Second World War. There was interest in Germany in the thirties due to the military potential of the unit for encoding secret messages. The first person to use the term vocoder to describe a commercial unit was an American called Homer Dudley who in 1936 devised a machine for the compression of the bandwidth of speech for transmission purposes. The modern vocoder still operates on the same principles, namely that of the real-time superimposition of speech onto a 'carrier signal' - nowadays this usually means a musical instrument. Utilising this system it is possible to make almost anything speak from a guitar to a full symphony orchestra.
  • The way in which the unit works may be seen by referring to the block diagram

of the circuitry of a typical vocoder (Figure 2).
  • This is somewhat simplified but gives an overall view of the processes involved. Speech is input at point 'A' and is then split up into discrete frequency bands by a series of bandpass filters. At the output of each of these there is an envelope follower which produces a DC voltage proportional to the amplitude of the signal present in the particular frequency band. The ban of bandpass filters thus produces series of control voltages which precisely follow the frequency spectrum of the incoming speech signal. These control voltages are used to control a bank of VCAs (Voltage Controled Amplifiers) as shown. Connected to the signal input of each one of these is the 'carrier signal' (e.g. guitar sound which enters the vocoder at point 'B'. This carrier is used for the production of the 'voiced' portions of the speech and a noise generator for the unvoiced'. The circuit which selecs either 'carrier' or 'noise' is the 'voiced/unvoiced detector'. This compares the relative levels of high and low frequencies in the incorning speech signal. When there is a higher proportion of frequencies above 4000 hz than below, the noise generator is switched in as the component speech being input at that moment will be 'unvoiced'. The outputs of the VCA's go to an identical bank of bandpass filters to those used for the analysis of the incoming speech signal. Therefore, the control voltages derived from the speech input now determine the amplitude of each frequency band in the carrier signal, allowed through to the output summing amplifier. The speech has therefore imposed its frequency spectrum on the musical carrier. Result - talking music! The combination of the transient nature of both speech and music which this unit affords provides formidable tool for the making of aurally arresting sound effects which if used sparingly will always demand the listener's interest. So, we have now considered some methods of speech production and processing but what of this Jabberwocky? Perhaps, when it comes to effects, it's not so much what is said but the way it sounds which is important! E and MM