01162598.pdf | PDF Host

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-22, NO. 5 , OCTOBER 1974 3.53 AverageMagnitude Difference FunctionPitchExtractor 1\/IYRON J. ROSS, MEMBER, IEEE, HARRY L. SHAFFER,ANDREW GOHEN, MEMBER, IEEE, RICHARDFREUDBERG, AND HAROLD J. MANLEY Abstract-Thispaper describes amethodforusingtheaverage magnitude difference function (AMDF) and associated decision logic to estimate the pitchperiod of voicedspeechsounds. The AMDF is a variation on autocorrelation analysis where, instead of correlating the input speech at various delays (where multiplications and sum- mationsareformedateachvalue of delay),adifferencesignal is formed between the delayed speech and the originaland,ateach delay, the absolute magnitude of thedifference is taken. The dif- ference signal is always zero at delay = $, and exhibits deep nulls at delays corresponding to the pitch period of voiced sounds. Some of the reasons the AMDF is attractive include the following. 1) It is a simple measurement which gives a good estimate of pitch con- tour, 2) it has no multiply operations, 3) its dynamic range charac- teristics are suitable forimplementation on a 16-bit machine,and 4) the nature of its operations makes it suitable for implementation on a programmable processor or in special purpose hardware. Theimplementation of the AMDFpitchextractor(nonreal-time simulation and real-time)is described and experimentalresults pre- sented to illustrate its basic measurement properties. I I. INTEtODUCTION N RECEKT years, digital speechcompression tech- niques, whichmodel the human mechanismforgen- erating speech sounds, have been very successful for information rates of 7200 bits/s and below [1], C S ] These schemes generally comprise a synthesizer filter (referred to as the vocal tract filter) which performs spectral shaping on an excitation signal much the same way the vocal tract shapes the air stream generated by the lungs and respiratory muscles, andmodulatedby the vocal chords (see Fig. 1). During voiced synthesis, the digital equivalent of an impulse carrier (sometimes referred to as a buzz source) operating at the measuredpitch rate is convolved with the vocal t.ract impulse response to pro- duce the synthetic output speech. A noise carrier is used to excite the vocal tract response filter during unvoiced synthesis. The vocal tract impulse response used in this process is a discretely t.ime-varying function whose parameters are updated at the frame updating rate. The problenl of generating the pitch impulse carrier (or unit sample pitch carrier) is primarily one of determining the points in time at which unit sample pitch pulsesoccur. Since the humanear is sensitive to small variations in this carrier, a met’hod for generating an acceptable pitch contour is critical to the scheme if high quality speech is to be generated. ManuscriptreceivedNovember 26, 1973;revisedApril 24, 1974. This work was supported by the Narrowband Systems Group, Electronics Command, Fort Monmouth, N. J. under Contract DAAB07-71-C-0207. M. J. Ross is with CNR, Newton, Mass. 02159. H. L. Shaffer, A. Cohen, R. Freudberg,and H. J. Manley are with GTE Sylvania, Needham, Mass. 11. AUTOCORRELATIONAND CROSS-CORRELATION ANALYSIS It is well known thatthe autocorrelation function (ACF) of a speech signal (of suitable length) can be used for pitch detection [see Fig. 2(c)]. It is generally not necessary to compute the entire autocorrelation function for each L second segment of speech; values of delay (or shift) falling in a search range of approximately 3-15 ms are usually computed. Since values of pitch frequency generally fall within the range of 70-300 Hz, (correspond- ing to thesearch range of 3-15 ms) , much data processing can be eliminated in ACP pitch detection by obviating the need to compute the entire ACF of each 1; second analysis interval for shifts outside t’hisrange.l This portion of the ACF is then “scanned” for a strongpeak. Assuming that various voicing criteria are satisfied inthepitch logic, the pitch period is then taken to be the location or delay value of the “strong;” peak relative t o zero delay or the true origin of the ACP. A further reduction in data processing can be achieved by computing over a portion of the analysis interval L‘ [Fig. 2(b)] where L’ < L iscross correlated with the full 1; second interval [Fig. 2 (d) 1. This is the cross-corre- lation function (CCP). A t’runcatedversion of this func- tion [Fig. 2(e)] is quite suitable for the detection of pitch. The length L‘ is chosen in accordance with the expected pitch period, e.g., it may be approximately equal to two pitc~h periods of the normal male speakers (i.e., 100 Hz). Under these conditions, L is about 20 ms and L might typically be 36 ms, resulting in a pitch search range of L-L’ or about 16 ms. Experimental results have shown that L ’ can be as low as 8-9 ms and I, as low a.s 23 ms without causing severe deterioration in the pitch estima- tion process. An advantage of the CCF method is that the relative sizes of the correlation peaks tend to remain constant as a function of delay. In this regard, an essen- tially linear decrease in correlation peak size versus delay can be observed in the AC11’ method. On the other hand, the constancy of correlation peak size in Fig. 2 (e) occurs since there is alwaysfull overlap of data between the two segments being cross correlated in the CCF method. This is not the case in the ACF method since data overlap falls off linearly with delay. Because the correlation peaks tend to remain relatively large intheCCF,theyare more easily discernable in the detection of pitch. Mathematically, the autocorrelation function of an L second segment of digitized speech is defined as Also,onlyone-halfof the ACF need be computed since i t is an even function. 354 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, A N I ) SIGNAL PROCESSING, OCTOBER 1974 _ - _ _ PITCH UPDATED FILTER PARAMETERS PULSE (VOICED SOUND) .* GENERATOR 1 t t SYNTHESIZER PIME VARYING VOCAL TRACT) SWITCH FILTER __c OUTPUT .c SPEECH + 1 I I I NOISE GENERATOR (UNVOICED SOUNDS) u VOICED/UNVOICED DECISION I f?ig. 1. Model o f the speech production mechanism. TYPICAL L SECOND SEGMENT OF VOICED SPEECH C t (b) SUB-INTERVAL CONSISTING OF FIRST L' SECONDS 0 m L - t AUTOCORRELATION FUNCTION T I 4 C C F ( T ) I I (d) CROSS CORRELATION FUNCTION OF SL A N D S L z -L' b SUEINTERVAL OF CCF ( 7 ) CONSISTING OF, DELAY VALUES FROM 0 TO L-L C T 0 Fig, 2. Comparison o f aut,ocorrelation and cross-correlation functions. ROSS et al.: PITCH EXTRACTOR 355 where Si = j t h sample of the speech waveform vector Xj’ = j t h sample of the speech waveform subvector L ’ = portion of speech segment T = delay value T , ~ ~ ~ = maximum delay shift (,rrnax 5 I; - 1 , ’ ) (X?) = (Sl,S*,’ * - , X , ) (8;) = (Xl,S2,. - *,X,*) If L and L’ are chosen judiciously, the truncated CClg minimises computational requirements yet contains suffi- cient data to yield accurate pitch estimation. 111. ,4VERAGE MAGNITUDEDIFFERENCE FUNCTION (AMDF) [ q , pl A variation of autocorrelation analysis for measuring the periodicity of voiced speech uses the AMDF, defined by the relation l L L j-1 D r = - I Si - Sj-r 1, T O , l , * * * ~ n > : t x (3) where S i are the samples of input speech (Si) = (Sl,X’2,...,XL) and Sj, are the samples time shifted T seconds. The vertical bars denote Oaking the magnitude of the difference Si - X++ Thus a difference signal D, is formed by delaying the inputspeech various amounts, subtracting the delayed waveform from theoriginal, and summing the magnitude of the differences between sample values. The difference signal [Fig. 4(b) 3 is always zero at delay = 4, and isobserved to exhibit deep nulls at delays correspond- ing to the pitch period of a voiced sound having a quasi- periodic structure. An approximate expression that provides a. useful relationship betmeen the AMDIP and the ACF of a sampled sequence millnow be developed. This rrlation- ship is based on the well lmo\m bound, 1 N-1 1 N-1 lj2 - c I X k I I (& c x2) - (4) N k=@ k=O In (4), the left side is the average magnitude of the samples sequence f X k ) while the right side of the equa- tion is the rms value of the sequence. The bound repre- sented by (1) is readily established by Schware’s in- equality [ij], [SI. The AMDF for a sequence of samples { S,] is defined by the relation 1 N-1 N k=O D, = - I S k - St+, I ( 5 ) where the delay index 12 ranges from - (A7 - 1 ) to + ( N - 1) (is., n = - ( N - 1),...-1,0,1,:!,3,.-., N - 1 ) to generate the complete AMDI’. In implenwnting ( $ 5 ) , the summing index k ranges from k = n t o lc = N - 1 for n 2 0. That is, the AMDY is formed only in theregion of overlap of the sequences S k and Sk-%. Thus for n < 0, the summing index k ranges from 0 t o N - 1 + n. It is seen that D, is an even function (i.e., D, = D-,) accord- ing to the above definitions. Using (4) we can approximate D, in t’heform (6) In (6), the coefficient P, is a scale factor. For Gaussian sequences it is possible to determine a value for P, (ana- lytically) that would achieve equality on the average between the average magnitude and rms sums. For other distributions, a value for P, can be determined experi- mentallybytestinga large number of sequences. It is evident that 0 , depends upon the joint probability density function (pdf) of SI, and SI,-,, Since thejoint pdf of S k and S k - , will in general vary with the delay index n, the Coefficient 0 , mill therefore be a function of n. Our experience is that Pn can vary from about 0.6 to 1.0 depending uponthe sampled sequence but is not a rapidly varying function of the delay index n. Byexpanding the squared termin braces under the square root sign in (6) we can cxprcss 1 1 , in the form, By defining the ACE’ of the sequence { Sk\ as 1 R, = - SICS,-, N k ( 8 ) it is seen that the third sum in the braces is -ah!,. Assum- ing that the sequence { S k } corresponds toastationary process it is evident that the first two sums in (7) are simply the ACF evaluat,ed at n = 0. That is under the assumption o f st,ationarity. Using (8) and (9) in (7 ) yields D, as D,& 2 Pn[2(Ro - Rn)]1’2. (10) The properties of t’he AMDF are accurately ckaracter- ized by (10). Specifically, the Ah/IDF is seen t o be zero at zero delay ( n 5 0) and varies as the square root of the ACF that has been negated and “dc shifted” by Ro. This isprecisely the character oi the AMDF that has been observed experimentally in actual computations using the definitions of (5). Nulls will appear in D, at those points where R, is large compared with Bo. This occurs when the sequence { Sk} is taken from a voiced speech sound con- taining two or more pitch periods in the sequence. The separation of the nulls is a direct measure of the pitch period. 356 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, A N I ) SIGNAL PROCESSING, OCTOBER 1974 1 Rn vs n 0 1 R - R A N D (Ro - R n ) v s n O n 1/2 0 Rn I Ro = 1 Fig. 3. Sketches of R,, R, - R, and (E,, - R,J1/2 for a typical periodic ACF. Reference is made to Fig. 3 showing sketches of a typical (periodic) ACF, R,, followed by the negated and shifted ACY, (R, - a,), This latterfunction is compared SIGNAL SPEECH ACF SPEECH SIGNAL AMDF is helpful in increasing the rcsolution of pitch mc.asurement (mhcn comparing the AMDF with t,he hC3’) in that t,he reduced width of the nulls enables greateraccuracy in determining their locations. These characteristics of the AXDii’ predicated by (10) ca,n be obscrved in the oscilloscope pictures of Fig. 4. The experimental results that have been observed indicate that (10) not only provides an accuratc characterization of the AMDip in terms of the ACF but could be used to compute the ACP from AMDF2 with many fewcr multi- plies than are required in computing thc ACic directly. Fig. 4 compares oscilloscope patterns of the ACE’ and AMDF for 36 ms of the voiced sound ‘‘aw”. L), is zero a t T = 4, corresponding to the peak of the ACF. D, also exhibits deep nulls at time spacings equal tothc pitch period, corresponding t o the highest peak of the ACI?, and thus provides an indication of the periodic structurc of thc voiced sound. I n Fig. G, for the unvoiccd segment ‘‘sh,” neither an ACY secondary peak nor an AMDP null is observed, since a periodic structure does not existfor this sound. I’ig. 5 shows that the spacing between pitch pc.aks for the ‘LaTV77 sound is stillevidentdespite a 10 dB signal-to- SPEECH NOISE PLUS AC F SPEECH NOISE PLUS AMDF noise ratio. The A3UlF is a variation of ACF analysis where, instcad sound caw21 for dl+ SNR, (a) Speech signal (top trace) and its Fig. 5. Oscilloscope pichires of ACF and AMDF of voiced speech of correlating the input speech at various dclays (where ACF(bottomtrace)for IO dB SNR. (b) Speech signal (top) and multiplications and summations arc formed at cach value it,s AMDF (bottom) for 10 dB SNR. of delay), a differencesignal is formed bet’weenthe delayed speech andthe original, and at each delay value the orcross-correlatiolz function, however, thc hRIDF C a l C U - absolutemagnitude is taken. Unlike the autocorrelation lationsrequire no multiplications, a desirable property Icor each valuc of delay, computation is made over an tions. integrating window of L’ samples, similar to the procedure for real-time applications. With acceptableaccuracy for many speech processing applica- ROSS et al.: PITCH EXTRACTOR (b) Fig. 6. Oscilloscope pictures of ACF and AMDF of unvoiced speech sound “sh” for high SNR. (a) Speech signal(top) and it’s ACF (bottom) for high SNR. (b) Speechsignal(top) and its AMDF (bottom) for high SNR. used t’o obtain the truncated cross-correlation function of Fig. 2 ( e ) . Togeneratetheentire range of delays, the window is “cross differenced” with the fullanalysis in- terval. An advantage of this method is that the relative sizes of the nulls tend to remain constant as a function of delay. This is because there is always full overlap of data between the two segments being cross differenced. I n extractors of this t’ype, the limiting factor on accuracy is the inability to completely separate the fine structure from the effects of the spectral envelope. For this reason, decision logic andprior knowledge of voicing are used along with the functionitself to help make the pitch deci- sion more reliable. IV. PITCHDETECTIONLOGIC Fig. .7 shows theset of logical rules developed for extraction of pitchinformationfrom theAMDF, It is comparable in complexity to the logic you might’ find in anACFpitchextractor.Thereare five separate iogic paths, each of which are selected, based on the three most recentvoicedjunvoiced(VUV) decisions. The parameter LOGIC is the weightednumberobtained by treating the three consecutive decisions as a binary number, i.e., LOGIC = VUV ( n ) + 2 VUV ( n - 1) + 4 VUI’ ( n . - 2) (11) where VUV (n) = + if the nth interval was unvoiced. VUV (n) = 1 if the nth interval was voiced. 3.57 The range of values that LOGIC can assume is 0-7. Thus, there are eight possible conditions that the pitch logic is designed to handle. Thresholds indicated in the flow were determined empirically by examining speech data for many different utterances and speakers. In path A , the present VUV decision is unvoicedand the logic askswhether this decision shouldbechanged to voiced. A change is justified by the presence of a strong periodic waveform within the analysis interval. In path B, the present VUV decision is voiced. Nor- mally the pitch shouldequal the minimumposition of the AMDF in the searchrange.However, a n unvoiced decision can occur if either the maximum AMDF value is not sufficiently strong or the ratio of the maximum to minimum value is below the specified threshold. In path C, the nth and ( n - 1) t h VUV decisions are voiced but the ( n - 2)thinterval was unvoiced. This is an indication of the onset of voicing; the pitch extractor changes to voiced and chooses the minimum value of the A N D F as the pitch. In path Dl we extend voicing an additional frame when the VUV decision indicates unvoicing aft’er an extended period of voicing. If the frameshould actuallybe un- voiced a t t’his point, the speech waveform,typically, is of such low amplitude that it makes little difference to the synthesizer. More importantly, though, in the event that the extractor is in error and the interval should be voiced, this extension eliminates the possibility of an un- voiced interval somewhere in the middle of a voiced sound. Path E is the normal path for sustained voicing and employs a feature for locking ontothetrue pit’ch and tracking it,. A tracking window of f 12 samples about the last measured pitch period is defined within which the logic looks for a minimum. This minimum is then com- paredwiththeminimumintheentire AMDF search range. Normally, the tracking minimum is selected as t’he pitch but’ the logic will change to thenontracking position if the amplihde of the minimumoutside thetracking range is less than 1/2 the tracking amplitude minimum. For higher frequencies, more nulls arepresentint,he AMDF so a null outside the tracking window is required to be less than 1/8 the minimum in the tracking window to be chosen. There is also a path for changing the VUV decision from voiced to unvoiced, and for extending the previous pitch value. For all UV intervals, as well as thefirst voiced interval, the samples of the input segment are reversed in time. This is introduced to overcome a serious problem related tothe onset of voicing. For thisanalysisinterval,the waveform is partially unvoiced and partially voiced with the integrating window falling on the UV portion. A corre- lation of thesesamples will not exist for any calculated data value. Reversing the time function places the samples from the voiced portion of the speechwaveform in the integrating window. If the interval is basically voiced, t’he AMDF will have a low amplitude null; if less than 50 percent of the interval is voiced, then the decision would be an unvoiced output. 358 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, A N D SIGNAL PROCESSING, OCTOBER 1974 I h" 4 P'or all voiced intervals other than at thc onset of voic- V. EX1'ERI;1IEXTAAL RESULTS ing, the samples in the analysis window are not reversed. l?ig. S shows the pitch period cont'our for the utterance At the trailing end of voicing t.his has the same advantage "healthsufferswhen food goes bad"measured by an that reversing t,he timefunctionhadatthe onset of AJIDI; and AC1; pitch extractor, each with appropriate voicing, i.e., the periodic portion is contained in the decision logic and VUV detecttion. The frames where the integrating window, and it is possiblc to track the pitch -4CF extractor differed from the decision of the hMDTI' more accurately to tjhc completion of the voiced sound. extractor are indicatcd by a circle. ROSS et al.: PITCH EXTRACTOR 0 ACF PITCH EXTRACTOR 8 AMDF PITCH EXTRACTOR I 10 20 FRAME 4 NUMBER 1 4 0 r FOOD $: 0 E 2 8 uv t - I - - u - - - - L - - o - p - - k - 4 & ~ ’ 100 110 FRAME --.2. NUMBER 120 Fig. 8. Pitch period contour for “he@th suffers when food goes had.” The curves arein good generalagreement,differing mainly at theonset or trailing end of voicing. It is reason- able to assume that, at these points, the signal amplitude is so small that it makes little difference to a synthesizer whether they are voiced or unvoiced. In the simulation of the AMDF logic, the voicing deci- sion could be arbitrarily changed. Errors were thus intro- duced inthe voicing decision andtheeffects 011 the measured pitch contour were noted. Fig. 9 is a plot of the fundamental frequency for the words “healt,h suffers” and the degradations which occurred when the voicing deci- sions were constrained to be, respectively, all voiced, all unvoiced, or voiced/unvoiced randomly with equal prob- ability of occurrence. The results show that the logic will attempt to lack onto the correct pitch period in spite of errors which try t’o offset it. ,41so evident is a bia,s toward voicing since t,hecase where the voicing decisions were forced to be unvoiced produced the most serious degrada- tion. Adding noise to the input, signal caused pit,ch errors to be generated. These errors were speaker dependentbut ap- petired to consist mostly of pitch douhlings occurring at the onset or central portion of voiced sounds. Few dropouts were evident; for the most part, the extractor tended to maintain voicing. As the signai-to-noise ratio was varied from 30 dB to 10 dB, the number of errors increased, al- t,hough not a substantial amount. A more substantial in- crease in error ivas found in going from the uncorrupted speed to a high S/N (30 dB) than in decreasing the S/N appreciably. Some evidence is available mhich shows that the AMDF remains suitable for pitch extractiondown to a C $ dB S/X [7]. However, more extensive evaluation on a 360 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, OCTOBER 1974 - HEALTH p 180 i FEW I I I I L-, i n I 40 50 -. FRAME NUMBER NUMBER FRAME Fig. 9. Fundamental frequency for theutterance"health suffers" with the introduct,ion of errors in the voicing decision. wide variety of sounds and speakers is required in order to completely evaluatethe lowest S/N at which the AMDF pitch extractforwill successfully operate. VI.REAL-TIMEIRIP1,EMENThTION A . Run-TimeEstimates The abilit,y to implement an AMDF pitch extractor in real time is a direct conscquencc of the number of opera- tions necessary and the computational speed of the avail- able machine. A flow chart of t'he instruction code which would generatesamples of the AMDF' in rcal t,imeona GTE Sylvania programmable signal processor (PSI') is de- picted in Fig. 10. For this particular machine (shown in Fig. ll), a multiply instruction takes 750 ns while addi- tions,subtract,ions, and manipulations of data generally take 250 to 37.5 nsapiece.Foreachset of operations included in the dashed block of I'ig. 10, 1.S75 p s of time are required. This includes 500 ns each for the load, sub- tract, test,, and negate instructions and an additional 375 ns for storing thepartial sun?. For eachanalysis interval, there are 64 computations (equal to t'he number of samples in the integrat.ing window3) in order to calcu- late a single point of the AMDF. Thus, t,he time interval is 64 x 1.875 p s or 120.0 p s . Since '77 samples of t,he AhlDF arc generated within each frame, the tot,al time required is about, 9.24 ms. These 77 samples are adequate to allow pitch analysis in the range of 70 Hz to 300 Hz a t a 7040 Hz sampling rate. The time & h a t e should also include the required loop control t,o maintain accuracy in the AMD1; generat'ion. This accounts for apprqximately 250 p s . Thct'otaltimeinterval,withthe inclusion of approximately 0.2 ms for t'he decision logic is the running- time estimatefor pitch extraction on this machine. Actual system times were approximat'ely 10 ms for generation of 3 This corresponds t o about' 9 ms of speech sampled at a '7040 HZ rate. t'he AMDF and an average of 0.25 ms to perform the decision logic. B. Conservation of Dynam,ic Range I n order to preserve accuracy in performing the summa- tions of ( 3 ) , the calculation of each point of the AMDF is divided into four partial summations, each dealingwith 16 terms, i.c., 16 32 n,= { $ C I s j - x , - , i + $ C I x f - L S L l j=1 j=17 48 64 + $ c j x3 - 8 3 - 7 I + i c 1 xj - X?-, 1 1 3=33 j=49 7 = 0, 1, - -, 71,,nx. (12) The input data are scaled so the largest value obtainable is + (210 - I). A factor of 2 is required since each difference sample is the result of combining two input samples yield- ing a maximum of 2 ( Y o - 1) = 211 - 2. Since 64 differ- ence samples (26) are summed in the integratingwindow, the accuracy requiredto cafculate each point of the AMDT.' is (26) (2' - 2 ) = 217 - a7. This will overload in a 16-bit computer. The accuracy required in summing 16 points is 24(2" - 2) or 215 - 2 . Each part,ial sum is factored by 2-2; the four part,ial sums can t,hen be combined to gen- erate one point of t,he Al\ZDI'. This is suitable for cal- culation on a 16-bit machine. C. Methods for Reducing Run-Time Requil-eme?lts There are several ways to reduce the time for pitchex- traction. Onepossibility is t'o generatethe AMDF function in an external hardwired device and feed the results into the computer via an 1/0 channel. Pitch extraction logic on the AiVlDF signal would require a complex hardware design.but would represent a minor load on the processor ; thus, a reasonable approach would be t'o perform thelogic inside the processor. The hardware interface to an external -414DF generat,or could be quite simple. For input, the ROSS et al.: PITCH EXTRACTOR 361 0 S U M - I N W T N - I N W T OPERATION IN DASHED LINES PERFORMED 16 TIMES PPSM = PSUW2 OPERATION IN DASHED LINES PERFORMED 16 TIMES K = l , M i M M t l I YES 0 A M D F N O W CALCULATED Fig. 10. Real-time AMDF generatorprogram flow chart. Fig. 1 I. GTE Sylvaniaprogrammablesignal processor (PSP) with card reader. AMDP function generator would use the processor’s A/D converter, and for output the AMDF function would be transmitted to the main processor at the sampling rate. An alternate inputscheme is to use a sign-magnitude A/D converter, thus eliminating the magnitude test instruction within the dashed block of Pig. 10. This would reduce the AMDF generating time from 9.24 ms to 6.8 ms. Anotherpossibility for fasteroperation,is to develop special instructions to decrease the computation. An in- struction that added to the magnitude of the accumulator into a 20-bit precision register in one instruction cycle would decrease time for the inner loop and the time for scaling the intermediate sums. Such an instruction would reduce the AMDF generation from 9.24 ms t’o about $5.44 ms Other interesting approaches include such ideas as band- limiting the input signal to 1000 Hz and resampling a t one-fift.h the sampling rate [SI. Thiscan provide a re- duction in the number of computations necessaryto obtain the AMDE’. Some success wasfound withthe simple scheme of generating every second or third point of the AMDF, while restricting the decision logic to ignore points which have not been calculated. In boththese cases, how- ever, it should be recognized that a cruder version of the AMDF is generated which could result in a possible loss of accuracy. Implementation of the ACl’ or CCF requires multiplies in place of the sum and magnitude instructions.I n general, for a machine with a fast multiply, it takes no more time to generate an ACP or CCE’ than it does to generate an AMDI’. Hoarever, fast multiplyhardwareissomewhat expensive. Also, more scaling is required to handle the large dynamic rangc associated with the multiply opera- tion than would be for an cquivalent summation process. Thus the use of the AMDF, which requires no multiply instructions and does not have restrictive dynamic range constraints, has an advantage over thc ACF or CCF with respect to machine complexity. REFERENCES [I] M. R. Schroeder, “Vocoders: analysis and synthesis of speech,” Proc. IEEE, vol. 54, pp. 720-734, May 1966. [2] J. L. Flanagan, Speech Analysis, Synthesis, and Perception. New York: Academic, 1965, 2nd ed., 1972. [3] “Narrow band autocorrelation study,“GTE Sylvania, Inc., Needham, Mass., Final Rep.: Contract DAAB07-71-C-0207, July 1972. [4] H. L. Shaffer, M. J. Ross, and A. Cohen, “AMDFpitch ex- [I] M. R. Schroeder, “Vocoders: analysis and synthesis of speech,” Proc. IEEE, vol. 54, pp. 720-734, May 1966. [2] J. L. Flanagan, Speech Analysis, Synthesis, and Perception. New York: Academic, 1965, 2nd ed., 1972. [3] “Narrow band autocorrelation study,“GTE Sylvania, Inc., Needham, Mass., Final Rep.: Contract DAAB07-71-C-0207, July 1972. [4] H. L. Shaffer, M. J. Ross, and A. Cohen, “AMDFpitch ex- General, U. S. Army Electronics Command, Attn: AMSEL-NL-Y-4, Requests for this document must be referred to: Commanding Fort Monmouth, N. J. 07703. 362 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. A S P - 2 2 , NO. -5, OCTOBER 1974 tractor,” presented at the85th meeting of the Acoustical So- [6] J. M. Wozencraft, Principles of Communication Engineering, ciety of America, Boston, Mass., Apr. 10-13, 1973, I. M. Jacobs, Ed. New York: Wffey, 1965. [7] H. L. Shaffer and C. Howard, Real-time generation of the cross-correlation functionand difference magnitude function,” [5] R. L. Freudberg, “A comparison of the averages - Z I X k I and 1 (& k ! l X P ) ‘ ” ’ GTE Sylvania, Inc., Needham, Mass., Res. Note, July 1970. N [SI J. D. Markel, “The SIFT algorithm for fundamental frequency G T E Sylvania, Inc., Needham, Mass., Res. Kote, Sept. 1970. estimation,” IEEE Trans. Audio Electroacoust., vol. AU-20, pp. 367-377, Dec. 1972. A Parametrically ControlledSpectralAnalysis SystemforSpeech Abstract-The parametrically controlled analyzer (PCA) is a large PL/I program which has been designed to perform spectral analysis of speech signals. PCA featuresparametric selection of several analysis methods, including discrete Fourier transformation and linear predictive coding. Also, selection may be made among various smoothing, normalization, and interpolation methods. PCA develops high-quality spectrographic representations of speech for standard line printers andCRT displays. The PCA is described and numerous examples of various parametersettingsarepresented and discussed. T I. INTRODUCTION HIS PAPER describes and shows output from a large PL/I program called the parametrically cont.rolled analyzer (YCA) It develops high-quality spectrographic dat’a from a digital speech signal. Previous descriptions of digital sound spectrography for speech may be found in [1]-[4]. Two well known techniques are implemented, discrete Fouriertransform (DFT) and linear predictive coding (LPC) Thc focus of the paper, however, mill not be a description of thesealgorithms; they are well de- scribed in the literature (see, e.g., [ S I , [SI for D1;T and [7]-[11] on LI’C) Emphasis will be placed upon para- metric control of the various normalizat,ions, smoothings and interpolations which may be necessary for the dL mva- ‘ tion of high-quality spectrographic output. PCA hasbeen designed assuming t’hefollowing recording conditions : High sampling frequency Low background noise Large-dynamic-range A/D converter Normal adult male speaker. This has been done in the context of an overall prod 0 ram dedicated to the problem of automatic recognition of con- tinuous speech. Many of the comments which follow, t,herefore,must be interpreted with this in mind. However, The authors are with the Speech Processing Group, Department Manuscript’ received February 7, 1974; revised May 3, 1974. of Computer Sciences, IBMThomas J. WatsonResearch Center, Yorktown Heights, N. Y . 10598. the system has been implemented with complete control over the important system parameters, and thus should be applicable to nearly all acoustic environments as well asother speech processing problems, assuming the user is willing to determine and change controlparameters appropriately. For the above environment, the system is self-normalizing and convenient to use. Fig. 1 shows the five major processing st.ages of the PCA. In the following sections each stage will be dis- cussed in detail. In Section VII, examples of the output will bepresented and interact’ionsamong important system parameters discussed. The authorswish to emphasize that t’heseexamples do not represent a serious att.empt to com- pare LPC and DFTspectral estimations; rather, they are vehicles for explaining the present system. 11. AMPLITUDE NORMALIZATION A. Recording System All recordings arc made in a 10’ X 10’ double-walled, sound-treated room which has an allowable interior noise level of 40 dB at 63 Hz and less than 20 dB for frequen- cies > 500 Hz,re, 20 pN/m2. After amplification, the microphone signal is fed directly to a dealiasing filter and a 14-bit + sign A/D converter. Sixteen-bit words are recorded directly on digit,al magnetic tape at a 20-kHz sampling frequency. A peak-detecting digital counter indicates t,he number of samples which caused t,he A/D to limit, thus allowing re-recording if the number is excessive. Normally, some of t’he available 65-dB signal- to-noise ratio is sacrificed so that this counter will read zero after a recording is made. Therefore, there is no need for automatic level control. B. Amplitude Calibration In a recording system of this quality, it has been noted that the distribution on the log of energy over short-time intervals of speech is essentially bimodal. This allows the determination of maximum and minimum encrgy values which define the speech range of the recording.