| Notes (resize this window to the desired height)
Note 1: Selecting "Floating point FFT" provides additional accuracy at the cost of substantial extra computation time for large spectrograms on older systems. This extra accuracy is rarely needed. If your spectrograms seem to have slowed to a crawl, check to see if "Floating point FFT" is checked. If a spectrum or spectrogram is giving you an unexpected result, especially from a very low-amplitude signal, turn on "Floating point FFT", and redo the problem display. If there are significant differences, believe the one with floating point turned on.
Note 2: Warning: When you make arbitrary cuts to play a segment of a syllable, you can introduce various perceptaul artifacts. Small segments may not sound like speech at all: It usually takes 40 or 50 ms to get a vowel percept out of context. Larger pieces may appear to begin or end with unexpected consonants: Cut off most of an initial [s]-friction, and you're likely to hear /t/ or /d/. These artifacts generally have good auditory or speech-perception explanations, but they can be confusing at first. For your present task of analyzing vowel formants, you are usually safe to ignore the unexpected consonants.
Note 3: At normal speech rates, the transitions will often overlap. In cases like that, the best part of the syllable for measuring the vowel formants may still have substantial influence from the adjacent consonants. Kent & Read 1992 have a useful example of "vowel undershoot" Fig 5-4 p89. (See also Note 11.)
Note 4: Wide-band spectral cross-sections: Why do we use an averaged spectrum for our wide-band measurement, but a regular spectral cross section for the narrow-band measurement? The crucial difference is the window width for wide-band and narrow-band cross sections.
To see the problem, display several wide-band spectral cross-sections at different places in the part of the syllable that you selected for measurement. You'll probably see significant differences between the cross-sections: This happens because the wide-band displays look at a very small window (8ms or 5 ms or 3.3 ms), so that there will be substantial random variation between measurements, depending on exactly where that small window falls.
We can avoid this random variation that by computing an "averaged spectrum" over some suitable segment of the syllable. Usually 20 or 30 ms is about the right length of segment to average; with a very short syllable (like a real-world schwa) or with rapidly changing formants, you may need to compute an average over a shorter period, 10 ms or even 5 ms. Averaging over even a short segment helps a lot in evening out the random fluctuations and giving you a better representation of the real formants. (Test this by trying some different examples.)
A narrow-band cross section looks at a 25 ms window (for "Narrow Band") or a 33 ms window (for "Very Narrow Band"), so it's already taking in enough information to give you a stable measurement.
You might ask, what are the wide-band cross-sections good for, except for averaging? The answer is that they are very useful to get a spectral measurement of brief events, things like stop bursts that may last only 5 or 10 ms.
Note 5: This procedure assumes that the biggest harmonics in a given region are also the most strongly amplified harmonics. More generally, it assumes that height in the narrow-band display tells us how much a given harmonic has been amplified by the nearby formant resonance(s). This will be true if the source function slopes smoothly downward like the textbook pictures, and if the "pre-emphasis" operation and the radiation effect combine to cancel out this downward slope. Real source functions do not necessarily have the smooth textbook slope, and the amplitude adjustment is necessarily an approximation; but for normal voices these assumptions usually work pretty well. If you are working with a patient with a voice pathology, this may not be true. See also the discussion of Slice 2 in section 6.3.3.
Note 6: LPC analysis is intended to help with this evaluation. The Signalyze Manual p57, 171-72 gives extensive advice about setting parameters for LPC analysis, and accompanying cautions: "Caution: Use LPC with care." "...the chances of error in the model are relatively high. For this reason, LPC extractions are quite often "strange" (no peaks, weird-looking peaks, or peaks in patently wrong places) Nudge the cursor over just a bit and try again. Chances are you'll do much better." This is useful advice for the experienced analyst, who already knows pretty much what the answer "ought" to be from the narrow-band FFT display, and can keep trying until the system gives an LPC display that fits with the rest of his or her analysis. It is less useful for the student, who doesn't know in advance what ought to count as strange.
Unfortunately, despite the Manual's good advice, I haven't been able to work out a convenient system for getting reliable LPC measurements using Signalyze 3.12. Until I figure this out, we're stuck with a judgement call for interpreting formant center frequencies from narrow-band cross-sections.
Note 7: How much precision can we reasonably expect in measurements of formant center frequencies? Classical studies like Lehiste and Peterson 1961 (JASA 33:268-77) report test-retest accuracies of +/- 25 Hz; with modern interactive systems, we should be able to do better, but not hugely better. The real answer is that it depends on the signal. There will be easy cases, where the interpretation is obvious, and everyone should come out within a few Hz. There will be genuinely difficult cases, where you may be guessing over a range of 100 Hz or more. For these hard cases, usually you will want to look for ways of collecting data that is easier to interpret rather than struggling with the problem signals.
Note 8: Normative data: Textbooks like the ones listed in section 1 typically include graphs or tables of formant center frequencies taken from classical large-scale studies like Peterson and Barney 1952 (JASA 24:175-84). Handbooks like Baken 1987 (Clinical Measurement of Speech and Voice, Allyn and Bacon, Boston) include more extensive normative data and references. In comparing your measurements to normative data, it is important to remember:
1. Speakers with different vocal tract sizes will have different formant center frequencies, even though their acoustic vowel quadrilaterals may show the same basic pattern. Most normative studies provide data for men, women, and pre-adolescent children, with perhaps some further break-down among the children; you want to make sure you are looking at the right part of the table.
2. The average values that are graphed or listed represent mid-points of distributions that have a substantial range of variation, even after controlling for vocal tract size. You want to think of norms as areas on an acoustic vowel quadrilateral, not as points. Peterson and Barney provide helpful graphs (their Figs 8 and 9) plotting each English monophthong as a substantial ellipse on a vowel quadrilateral.
3. In addition to the variation within speakers of a given dialect, there are also substantial differences between geographical and social dialects, and there may also have been significant linguistic change in the time between a study like Peterson and Barney (1952) and the data that you are collecting today. Just because you're seeing data outside the range of the textbook norms doesn't necessarily mean you're seeing a speech pathology: You need to know the norms for this speaker's speech community.
Note 9: Acoustic Vowel Quadrilaterals: Everyone agrees that it is useful to plot F1 and F2 center frequencies on a two-dimensional graph that we could call an "acoustic vowel quadrilateral". Unfortunately, there is no standard way to do this. Linguists usually arrange their axes to look like a traditional vowel chart, which puts F1 on the vertical axis and zero in the top right corner. Psychologists usually put zero in the usual bottom left position, and put F1 on the horizontal axis. Sometimes the second axis is F2; sometimes it is F2-F1. Scales are sometimes log Hz (rarely linear Hz), sometimes a psychologically-based measurement (Koenig, mel, bark, etc.), sometimes a physiologically-based measurement. The only safe rule for interpreting acoustic vowel quadrilateral displays is to look very attentively at the axis labels and the scales, and to make sure you are comparing like with like.
Note 10: See section 3, initial set-up: To get the clean wave-from display in (1a) and also get the horizontal reference lines in the spectrograms in (1b, 2a, 3a), you need to work in three steps: First, select Y-axis scales and horizontal grid lines in Signal > Scale Setup. This gives you labelled horizontal reference lines in your waveform; any spectrograms you create while the waveform has horizontal lines will also get horizontal lines. Now, second step, create all your spectrogram displays, and they will appear with the helpful horizontal reference lines you see in (1b, 2a,3a). Finally, third step, go back to Signal > Scale Setup and turn off Y-axis scales and horizontal grid lines. This will delete the non-useful horizontal lines in the waveform, while leaving the useful reference lines in the spectrograms you have already displayed.
Apologies for the complications here. If you're not looking at details in the waveform, and you're not trying to do pretty print-outs, you can always just leave the horizontal lines turned on permanently, and ignore the clutter in the waveform display.
Note 11: At this very slow speech rate, there really is a part of the syllable that we could fairly label as "the vowel": The effect of the /h/ stops about 890 ms, and the effect of the /d/ doesn't begin until about 1150, so the part between is just /æ/. At more normal speech rates, it's more common for everything to overlap, so that we usually talk about "the syllable nucleus" or "the vocalic part of the syllable" rather than "the vowel". (See also Note 3.)
Note 12: According to textbook descriptions, English /æ/ in "had" is a monophthong (one vowel quality), while the English "long /i/" in "heed" is often transcribed as a diphthong /ij/ or [ij]. It is therefore interesting to notice that, for these tokens, the vowel in "had" is a lot more diphthongized than the vowel in "heed". Acoustic analysis is full of little surprises like this.
Note 13: Do some of the short chunks sound more like [ ] than [æ]? Some of that effect may be vowel quality, but there is also an important perceptual effect of duration: English / / is substantially shorter than English /æ/, so a short chunk of [æ] is quite likely to be heard as / /. Try varying the duration, and see what you hear. This is an example of what we were discussing in Note 2.
Note 14: In this example, F1 and F2 run reasonably straight from the beginning of the syllable to the inflection point where the stop transition begins, so a measurement at each end can fairly represent the trajectory for this /æ/. Sometimes you see more complex trajectories, requiring 3 measurement points (a "triphthong") or even more.
Note 15: The averaged spectrum display looks like a step-function, whether or not smoothing is turned on in Spectral Analysis Setup. Spectral cross-sections with smoothing give you smooth curves. This can be a useful visual cue, reminding you when you are looking at an averaged spectrum.
Note 16: When doing spectral cross-sections, it is usually good practice to use the "Floating point FFT" setting, even on slower machines where you wouldn't use it for spectrograms. The spectra in Figures 1 and 2 were all calculated with floating point on.
Note 17: If you are in doubt about the relative amplitudes of different harmonics, recall that you can read a dB measurement of the relative amplitude from the lower display in the window to the left of the spectrum display. In this case, the harmonic at about 528 Hz has an amplitude of 21.6 dB, while the harmonic about 842 Hz has an amplitude of 20.8 dB.
Note 18: In the display in (2c), we can't see the fundamental frequency f0 = h1, which we would expect to find at about 107 Hz. Similarly, we can't see h2 or h3; the first visible bump corresponds to h4 at about 428 Hz. This sort of pattern is very common in pre-emphasized narrow-band displays: Recall that this display is designed to show you the harmonics that got amplified; if the lowest formant is above 600 Hz, the fundamental at about 100 Hz isn't getting much amplification at all. For other purposes, we might want to change the spectral analysis settings so that we could see the low harmonics, but the low harmonics aren't relevant for our current analysis of /æ/.
Note 19: This is an important difference between the human speech production system and reed instruments like a clarinet. In the clarinet, the resonances control the rate of vibration of the reed, and therefore control the fundamental: The reed locks on to one of the principal resonances of the tube; opening and closing holes in the clarinet tube changes the resonance, and therefore changes the fundamental. In a classical source/filter model of human speech production, the vibrating vocal cord source works independently of the resonances (formants) in the upper vocal tract. This turns out not to be quite true, but it is close enough for our present purposes.
Note 20: An LPC spectrum is designed to give you a more exact estimate of the resonance peak, but I haven't been able to get reliable results from the Signalyze 3.12 LPC algorithm--see Note 6.
Note 21: You'll notice that I've located the resonance peak slightly to the left of the centerline between the two big harmonics, but that the next harmonic below the big pair (about 1791 Hz) is substantially smaller than the next harmonic above the big pair (about 2098 Hz). This is not what we would expect if all four harmonics were amplified by a single, symmetrical resonance.
This asymmetry might be an accident, since resonances are not always symmetrical. It is more likely, however, that the 2098 Hz harmonic and the other harmonics between F2 and F3 are getting amplified by both the F2 resonance and the F3 resonance. In Track 2c, it is easy to see where the formants go without worrying about this sort of resonance overlap. When the formant center frequencies get closer together, allowing for the overlap can play a crucial role in interpreting the data: You want to visualize the two overlapping resonance curves, and then estimate how big the middle harmonics would be if they were getting amplification from only one of the curves.
Note 22: There's a big amplitude drop in the waveform between about 950 ms and 970 ms. It's natural to speculate that this amplitude change is a reflection of the same irregular vocal fold vibration that is producing the gaps in the source function.
Note 23 (resize this frame to view the whole table):
Table 2. Measurements for [æ] in "had", Slice 2, in Hz
Data from Figure 2.
| Formant |
Averaged wide band |
Narrow band |
difference |
Combined best estimate |
| (Slice 1) |
Raw |
Calibration |
Corrected |
| F1 |
834 |
-40 |
794 |
796 |
+2 |
790 |
| F2 |
1738 |
1698 |
1707 |
+9 |
1700 |
| F3 |
2573 |
2533 |
2557 |
+24 |
2540 |
Note 24: Duration is also an important cue distiguishing / / from /æ/; see Note 13.
Note 25: English has a vowel transcribed [a] in diphthongs like [a ] and [a ]; but this is central rather than front, and not in contrast with /æ/.
Note 26: I am fudging the question of the phonological status of [a], which does not contrast directly with either [æ] or [ ] for this speaker.
Note 27: Notice, however, that the nucleus of the "heed" syllable is almost as long, but the [i] is much less diphthongized than our /æ/: Duration by itself does not necessarily cause diphthongization.
Note 28: This ellipse in Figure 8 contains data points from men, women, and children, so it greatly overstates the range of variation for men. If we compare P&B's Figure 8 with their Figure 9 (not so often reprinted), we can see that the whole travel of our /æ/ falls within the adult male range, and more specifically within the range of /æ/ productions that were unanimously identified as /æ/ in P&B's listening tests.
|