| Okay, I'll give this one a shot. Keep in mind that this answer is not rigorous in
any sense, in general based on my understanding of the algorithms and not their
implementations in csound, and possibly apocryphal or flat-out wrong.
LPC works on the assumption that the source sound is basically a filtered buzz. In
the analysis process, formants are estimated and filtered out of the sound. What
remains is called the residue. From the residue, the intensity and frequency of the
buzz can be calculated. As with STFT and the streaming phase vocoder
implementations, this process is done on short frames of audio. Wikipedia indicates
30-50 frames/sec are usually successful for speech. In order to resynthesize a
signal analyzed with LPC, you then just filter a source signal (typically a mix of
buzz and noise), which should yield approximately the same output.
The streaming phase vocoder is based on the short-time Fourier transform, a
completely different method of analysis. Each frame of audio is transformed into a
series of frequency bins. The number of bins is dependent on the length of the
analysis frame. The analysis produces an amplitude-phase pair for each bin. These
amplitude-phase pairs can then be resynthesized using an inverse Fourier transform.
Given this information, there's a clear reason why STFT methods are often more
successful in musical contexts. LPC assumes that sound is produced by a filtered
buzz. While this is relatively true for speech/voice, it is less accurate for many
musical instruments and other audio sources, and completely falls apart in polyphonic
contexts. Furthermore the output of LPC, at least in Csound's implementation, varies
widely depending on the analysis parameters. I haven't witnessed as large of a
variance in STFT methods. This makes it much easier to get bad results with LPC.
Presumably if you use the method a lot, it's much easier to determine good parameters
at the outset.
I cannot comment on pitch-synchronous overlap-add methods.
There's also a clear reason why sociolinguists would use LPC. It has a long history
of being used for speech applications and in publications, therefore it's
well-understood within the field. The same cannot be said for the phase vocoder.
Besides that, as LPC analysis is built on the assumption that the sound source is
vocal-like, the analysis data is directly applicable to vocal models. With an
STFT-based analysis, there would need to be an intermediate step of analyzing the
analysis output to match it to a vocal model.
I doubt any studies exist that you could cite to prove that STFT analysis is superior
to LPC for the purposes of linguists; such studies would almost certainly have been
performed by linguists, and they're probably too busy doing their real work to
compare LPC to some other method they don't know about. I'm not convinced it's true
myself (I prefer LPC to pvsanal et al. when the source is suitable for LPC).
If you want to convince sociolinguists to use pvsanal-like tools, you may need to get
them interested enough in the tool to do such research themselves. I would begin
such a conversation by asking about how LPC data is used, what the known limitations
of the method are, and if there's anything they wish the analysis could provide that
it doesn't.
John W. Lato
Sarah and Ernest Butler School of Music
The University of Texas at Austin
1 University Station E3100
Austin, TX 78712-0435
(512) 232-2090
David Akbari wrote:
> Not yet.
>
> The reason I'm asking is because I know many people involved in
> sociolinguistics who are using the LPC/PSOLA for analysis/resynthesis
> of speech, specifically.
>
> I know from musical experience that the streaming f-sig analysis
> format implemented in CDP and Csound is far superior. I just need some
> resources to cite to prove this to these individuals. Simply producing
> sound for A/B comparison has been OK.. but it would be nice to have a
> more pedantic substantive basis for these claims of superiority. Then
> we might see a wider adoption of this technology beyond the scope of
> computer music circles.
>
>
> -David
>
> On Sat, Jun 21, 2008 at 9:46 AM, Richard Bowers
> wrote:
>> There has been no reply on the list to this. Did anyone reply to David
>> privately? I would be interested in the responses if there were any.
>>
>> --Richard.
>>
>> David Akbari wrote:
>>> Hi List and Dr. Dobson,
>>>
>>> In my recent work I have come across the paradigm of creating a
>>> continuum from endpoint stimuli in experimental procedures using
>>> synthetic sounds as the end points.
>>>
>>> I'm specifically wondering, what are the major differences in the
>>> abstract between the linear predictive coding analysis and
>>> pitch-synchronous-overlap-add resynthesis and the spectral streaming
>>> phase vocoder analysis/resynthesis as it is implemented today in
>>> Csound ?
>>>
>>> Many people are using the LPC/PSOLA but I know from musical experience
>>> that the PVS/PVX format sounds much better. I'm trying to get a better
>>> idea of why this is so... any scholarly papers, websites, or similar
>>> online resources would be greatly appreciated!
>>>
>>>
>>> Thank you for your time and consideration,
>>>
>>> David Akbari
>>>
>
>
> Send bugs reports to this list.
> To unsubscribe, send email sympa@lists.bath.ac.uk with body "unsubscribe csound"
|