Questions to pvs opcodes
Date | 2016-10-04 11:11 |
From | Karin Daum |
Subject | Questions to pvs opcodes |
Hi, I'm intensively using the pvs opcodes (mainly pvstanal) for pitch scaling of voices. When the scale factors are large I get the well known problems of the "helium effect" when raising the pitch to much or in case of lowering it too much features of the voice get washed out. The problem is that some parts of the spectrum have to be shifted (with some modifications to the amplitudes) and others should be kept unchanged (mainly higher frequencies e.g. the sound components created by tongue, teeth and lips). My first attempt was to split the signal with pvsbandp into low and high frequency parts, scale the low frequency part and sum the two parts afterwords. This created quite some artefacts (distortions) Then I tried to improve the quality by using pvswarp with imode=1 or 2, but this produces also artefacts (some resonances) and does not really avoid the problems of the helium effect and the washing out. currently I'm trying to use pvsftw for getting arrays of amplitudes and frequencies which then can be modified and transformed back to fsig using pvsftr. I realised that I can change low frequencies (say below about 1 kHz) as intended but I'm not able to change the high frequency part, despite of the factors I apply to these frequencies in the corresponding table. This I saw be doing spectral analyses of the output files. Maybe I'm to naive, but I was expecting that using pvsftw/pvsftr and scaling the frequencies in the tables should have the same effect as using pvscale. To demonstrate the problem I attach a small csd-file (and some wav file used as input) Instrument 10 triggers 5 instruments which are "played" one at a time. Instrument 21 prints out the information on the average pitch and centroid etc. Instr 1 plays the original sound. Instr 2 plays the sound scaled by p4 of instr 10 using pvscale Instr 3 plays the sound scaled by p4 of instr 10 using pvsftw/pvsftr Instr 4 should reproduce the original sound by first scaling by p4 of instr 10 using pvscale and then by 1/p4 using pvsftw/pvsftr Instr 5 should reproduce the original sound by first scaling by p4 of instr 10 using pvsftw/pvsftr and then by 1/p4 using pvsftw/pvsftr again. I expected sound 1, 4 and 5 should be the same and sound 2 and 3 should be the same. But only instr 5 reproduces the original. From the output of instr 21 you see what happens: pvsftw/pvsftr preserve the centroid. This makes sound 2 and 3 (4 and 5) sound differently. Is there a way to avoid the preservation of the centroid but having the possibility to modify the different frequencies independently? I would be grateful for any help Karin Csound mailing list Csound@listserv.heanet.ie https://listserv.heanet.ie/cgi-bin/wa?A0=CSOUND Send bugs reports to https://github.com/csound/csound/issues Discussions of bugs and features can be posted here |
Date | 2016-10-04 11:23 |
From | Victor Lazzarini |
Subject | Re: Questions to pvs opcodes |
Did you try pvscale with formant retention (mode 1 or 2)? > On 4 Oct 2016, at 11:11, Karin Daum |
Date | 2016-10-04 11:48 |
From | Karin Daum |
Subject | Re: Questions to pvs opcodes |
yes I did. This yields similar results to pvswarp as far as I remember > On 4 Oct 2016, at 12:23, Victor Lazzarini |
Date | 2016-10-04 12:00 |
From | Victor Lazzarini |
Subject | Re: Questions to pvs opcodes |
I am surprised, as my experience with pvscale is that it works fairly well to keep formants. > On 4 Oct 2016, at 11:48, Karin Daum |
Date | 2016-10-04 12:50 |
From | Karin Daum |
Subject | Re: Questions to pvs opcodes |
I’ll check it again. I remember that I had some issues with it when I was testing it quite some time ago. Thanks. I will come back with my findings when I’ve tested it thoroughly. > On 4 Oct 2016, at 13:00, Victor Lazzarini |
Date | 2016-10-04 13:52 |
From | Victor Lazzarini |
Subject | Re: Questions to pvs opcodes |
This does not sound too bad to my ears, but maybe it’s not up to what you need. instr 1 S1 = "fox.wav" p3 = filelen(S1) a1 diskin S1 fs1 pvsanal a1,2048,256,2048,1 fs2 pvscale fs1,1.9,1 out pvsynth(fs2) endin (changing mode from 1 to 0 shows the difference). > On 4 Oct 2016, at 12:50, Karin Daum |
Date | 2016-10-04 16:57 |
From | Karin Daum |
Subject | Re: Questions to pvs opcodes |
thanks Victor, this sound really quite good, much better than what I had before. it shows the wanted behaviour, that the spectrum above 2 kHz does not change much (that is what I wanted to get with using pvsftw/pvsftr. it needs some rebalancing of the amplitudes for f0 and f1. the amplitude of f0 has to be increased for scales below 1 and the amplitude of f1 has to be increased in case of scale>1. At least this is what I see when looking at the spectra. I don’t know why I considered some months ago, that using imode= 1 or 2 does not work the way I wanted it to work. I must have done something wrong that time. Thanks again Karin > On 4 Oct 2016, at 14:52, Victor Lazzarini |
Date | 2016-10-04 17:58 |
From | Victor Lazzarini |
Subject | Re: Questions to pvs opcodes |
The way this works is that the formants are extracted from the original and then reapplied. So there is no way to shift or rebalance them. Pvswarp could possibly be used for this as formants can be scaled and shifted in frequency. Victor Lazzarini Dean of Arts, Celtic Studies, and Philosophy Maynooth University Ireland > On 4 Oct 2016, at 16:57, Karin Daum |
Date | 2016-10-04 19:21 |
From | Karin Daum |
Subject | Re: Questions to pvs opcodes |
what I have to do is to increase/decrease the amplitudes of certain formants. When scaling down the pitch with pvscale (imode=2 or 1) I have to increase the amplitude of f0 and reduce the amplitude of f2. When scaling up I have to increase the amplitude of f1. Since i know the pitch f0 of the input i can do this using pvsftw and pvsftr. This works fine for me. But this may be more complicated in general. From the spectral analysis i realise that pvscale introduces a frequency cutoff of almost exactly sr/2*(scale factor)^2. This means lowering the pitch by a factor 2 introduces a high frequency cutoff at 6 kHz. The missing range can be recovered using pvsbandp and than mix it with the result of pvscale, this also works fine. Meanwhile I understand why I did not consider pvscale with imode=1 or 2 because when applying a scale factor of 0.5 (which i may have used at that time for testing) then f0 is suppressed by about 25 - 30 dB relative to f1 and may have been gotten lost in the noise if the input signal used at that time did not show very prominent formants. f1 and f3 of the scaled sound are at the positions of f0 and f1 of the input signal with similar amplitudes for f1 and f3 of the scaled signal as for f0 and f1 of the input. This looked odd and led me to the conclusion that I can not use it. cheers, Karin > On 4 Oct 2016, at 18:58, Victor Lazzarini |
Date | 2016-10-04 19:28 |
From | Victor Lazzarini |
Subject | Re: Questions to pvs opcodes |
When scaling up, pvscale will get rid of high harmonics as they go beyond sr/2. When scaling down, the top end of the spectrum will be empty as only freqs up to sr/2 exist in the original sound (there is nothing to transpose down beyond it). Maybe that us why you noticed a cutoff. Victor Lazzarini Dean of Arts, Celtic Studies, and Philosophy Maynooth University Ireland > On 4 Oct 2016, at 19:21, Karin Daum |
Date | 2016-10-05 09:54 |
From | Karin Daum |
Subject | Re: Questions to pvs opcodes |
Attachments | pvscale.pdf |
Hi Victor, the behaviour you are describing is what you get with pvscale using kkeepform=0. And this was the starting point for the problems I have(had), because this affects also those phonemes which are be using tongue, teeth and lips, e.g. ch, f, s, sh, st, t, th, x … which makes it difficult to understand the words synthesised. By scaling down by 0.5 the spectrum dies out at sr/4=12 kHz in my case. Th cutoff I see for kkeepform=2 is at 6 kHz. I’ve attached a pdf-file which shows an example how pvscale behaves to visualise what I’ve described in words before. This analysis is done with praat. It shows four panels: - The top left shows the input spectrum of a German “a” spoken by me. You can nicely see the formants (f0 - f11). - On the top right you see the spectrum scaled using pvscale with kscal=0.5 and kkeepform=0. The entire spectrum is scaled 0.5. The spectrum dies out at 12 kHz as expected with sr=48000. - The lower left shows the same but using kkeepform=2. Here the formants are scaled, but in the range 2-6 kHz the spectrum is similar to the input spectrum. This is the desired effect. You also see the features I mentioned before.The amplitude of f0 is reduced by about 30 dB, f2 is enhanced by about 10 dB and there is a sharp cutoff at 6 kHz. - The lower right panel shows the same as the lower left but with enhancing f0 by about 25 dB and reducing f2 by about 10 dB (red) and adding the input spectrum for f> 6 kHz (blue). The modification of the amplitudes is certainly application dependent. However, with the large suppression of f0 and enhancement of f2 it doesn’t sound natural, because you hear the broad resonance in the range 200 - 700 Hz below f1 to f4 you can see in the lower left panel. I guess this information may be interesting for others too. Cheers, Karin Csound mailing list Csound@listserv.heanet.ie https://listserv.heanet.ie/cgi-bin/wa?A0=CSOUND Send bugs reports to https://github.com/csound/csound/issues Discussions of bugs and features can be posted here > On 4 Oct 2016, at 20:28, Victor Lazzarini |
Date | 2016-10-05 11:26 |
From | Victor Lazzarini |
Subject | Re: Questions to pvs opcodes |
Hi Karim, there is no way to transpose a digital signal down, say an octave, and expect that frequencies beyond sr/4 will appear. This is regardless of the method you are using to transpose, because there is nothing in the original signal beyond sr/2. After transposing, if non-linear distortion is used or added noise, or another method, then components might be synthesised up there. But there is nothing in the original to go with. This is how pvscale works: mode 0: multiplies the bin freq by the scaling factor and moves the amp and freq data to a different bin according to where the freq is supposed to be. mode 1: does as mode 1, but before the scaling, it extracts the spectral envelope from the original sound using cepstrum and liftering. After transposition, it applies the spectral envelope, shaping all amplitudes as per the original. mode 2: as 1, but uses the true envelope method to get the spectral envelope. So in any mode, whenever you transpose down, everything gets moved down, noisy and pitched components. Looks like you want to do a separate process to noise and sinusoids. That’s not quite possible with the phase vocoder on its own (ATS and SMS do this, but then you might not have the facilities that pv gives you). Another thing you can try is this: use high-pass filtering to isolate the high freqs you want to keep and re-apply this after transposition by mixing it. Regards Victor > On 5 Oct 2016, at 09:54, Karin Daum |
Date | 2016-10-05 11:46 |
From | Victor Lazzarini |
Subject | Re: Questions to pvs opcodes |
Sorry, I misread a bit there. The reason why there is nothing above 6k is possibly because the spectral envelope as extracted by mode 2 has little energy there. > On 5 Oct 2016, at 11:26, Victor Lazzarini |
Date | 2016-10-05 16:00 |
From | Karin Daum |
Subject | Re: Questions to pvs opcodes |
Attachments | pvscale1.pdf |
Hi Victor, >> >> Another thing you can try is this: use high-pass filtering to isolate the high freqs you want to keep and re-apply this >> after transposition by mixing it. >> This was my starting point some time ago, but it produced quite some distortions in the way I did it (different approaches were tried) such that I finally gave it up. > On 5 Oct 2016, at 12:46, Victor Lazzarini |
Date | 2016-10-05 17:36 |
From | Steven Yi |
Subject | Re: Questions to pvs opcodes |
Hi Karin, Could you mention what settings you are using for sr (I think you mentioned 48000?), ksmps, hop size, and window size? I was thinking that if the CPU requirements are okay, you could process at 96k or 192k to get additional spectrum to work with when scaling down the spectra. steven On Wed, Oct 5, 2016 at 11:00 AM, Karin Daum |
Date | 2016-10-06 10:08 |
From | Karin Daum |
Subject | Re: Questions to pvs opcodes |
Hi Steven, thanks for your comment. I’m using Csound for an audio installation in which objects (essentially speakers) are talking to each other in a pseudo language with the structure of the German language in terms of probabilities w.r.t. phonemes, # of syllables in a word, intonation etc. based on my own voice. But it also can talk normal German (initially done for checking). Since the words to be spoken are generated randomly they are constructed from a library of phonemes and the sound is stored into tables which then are processed with pvstanal. up to now I’m using sr = 48000 ksmps = 128 … gifftsize init 1024 gihop init gifftsize/4 The relevant parts of the pitch manipulation look like this since yesterday (kFraq is the scale factor for the pitch which may change at k-rate and normally does change at k-rate because of intonation) klow = sr/2*kFreq*kFreq ; this is the upper cutoff obtained for kkeepfrom=2 fsig0 pvstanal 1/p12,kVol*kcorr,1,iscrtable,0,0,0,gifftsize,gihop fsig pvscale fsig0,kFreq,2 …. ; and now add the high frequency part beyond the cutoff of pvscale if klow<20000 then fsigh pvsbandp fsig0,klow,klow+100,sr/2-500,sr/2 fsig pvsmix fsigh,fsig endif This works well in terms of the high frequency response. Before Victor’s proposal to try kkeepform=1 or 2 I had two streams: one for the phonemes which are affected by the pitch modulation (vowels, l,m,n and some plosives) and one stream for the other consonants like: fsig pvstanal 1/p12,kVol/kcorr,kFreq,iscrtable,0,0,0,gifftsize,gihop fsig1 pvstanal 1/p12,kVol,1,iscrtable1,0,0,0,gifftsize,gihop …. fsigs pvsmix fsig,fsig1 This distinction in terms of phonemes I don’t need anymore after moving to kkeepfrom=2 and the code given above and I don’t have the problems anymore which triggered my question yesterday. There is only one caveat I could hear / see so far. There is a strong enhancement of 10 dB in the region of 300 - 500 Hz and a significant suppression at lower frequencies ( about -10 dB @ 100 Hz) even for kscal=1 (this is absent if kkeepform=0 is used). The effect is reduced when using gifftsize = 2048. Concerning timing this is very time consuming (MAC Core i7). The CPU load is about 25 % (gifftsize = 1024) and 45 % (gifftsize = 2048) for a single voice active at a time. If several voices are talking at a time, the CPU time increases correspondingly. Changing to sr=96000 or 192000 is nor really an option in terms of CPU time and not needed for the code outlined at the beginning. Karin > On 5 Oct 2016, at 18:36, Steven Yi |
Date | 2016-10-06 10:24 |
From | Karin Daum |
Subject | Re: Questions to pvs opcodes |
Hi Steve, the timing is not as bad as wrote in my previous mail when I remove all the gymnastics I’doing on fsig concerning the amplitudes of the formants after applying pvscale. Without this code it becomes about a factor of 3 faster. Karin > On 5 Oct 2016, at 18:36, Steven Yi |
Date | 2016-10-07 01:09 |
From | Steven Yi |
Subject | Re: Questions to pvs opcodes |
Hi Karin, Nothing really sticks out to me in terms of settings, but it's good to know to understand the processing context. The problem you mentioned though reminded me of something I had read in a paper: http://articles.ircam.fr/textes/Roebel05b/index.pdf Röbel, Axel, and Xavier Rodet. "Efficient spectral envelope estimation and its application to pitch shifting and envelope preservation." International Conference on Digital Audio Effects. 2005. On page 6, it discusses: "Nevertheless, for high pitched sounds the transposed signals sound rather dull. This is especially true if the pitch shift lowers the pitch of the signal. Further inspection of the problem reveals the following issues. The spectral envelope below the fundamental partial will generally have a rather steep slope towards 0. Pitch shifting down will therefore attenuate the fundamental and create a less complete sound perception. Moreover, due to the fact that the cepstral order needs to be adjusted to fit the highest fundamental frequency that is present in the whole signal, the formants that may be observed for lower pitched signals will be smoothed such that the sound is perceived as dull. A third problem for down transposition is that the originally unvoiced high frequency parts will be amplified by the larger amplitude ofthe lower frequency envelope." It sounded similar to some of the things you mentioned, though I suppose it depends upon the spectral content of the source material to understand the attenuation details you gave related to fixed frequencies. The paper discusses an adjusted pre-warping strategy (eq. 8 an 9). I was researching voice modulation with FFT earlier this year and came across this, but I never gave it a try myself. Anyways, not sure if that's directly useful, but thought I'd mention it! steven p.s. The installation sounds fascinating, best of luck with it! On Thu, Oct 6, 2016 at 5:24 AM, Karin Daum |
Date | 2016-10-07 08:02 |
From | Karin Daum |
Subject | Re: Questions to pvs opcodes |
Hi Steve, thank you for pointing me to the article. I will read it carefully to understand better how the algorithm works. The part you cited below seems to coincide with what I observed when lowering the pitch. Also the 3rd problem mentioned in the part you cited I’ve observed (and and determined corrections for my code) yesterday. When scanning through the article very quickly I also found some mentioning of splitting the spectrum into pitch shifted and unshifted parts, which would also agree with what I’ve seen/heard experimentally. This reference will be very helpful for me in understanding the algorithm and consequences for the coding in my application. Karin > On 7 Oct 2016, at 02:09, Steven Yi |