Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
Date | 2005-09-22 15:17 |
From | Michael Gogins |
Subject | Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code |
How efficient is the Whittle algorithm compared to the Mersenne Twister? And, how random is it? The MT has become the standard high-performance pseudo-random noise generator in many software packages. It is the default PRNG in boost, for example. Regards, Mike -----Original Message----- From: Istvan Varga |
Date | 2005-09-22 16:43 |
From | jlato@mail.utexas.edu |
Subject | Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code |
It is not as random as the Mersenne Twister. The period is much smaller, and there is serial correlation between successive outputs. There are some other flaws as well. The current algorithm should be faster, though, assuming it's implemented properly. Although I personally would advocate a move to the MT, I wouldn't consider it a high priority. I don't think that super high-quality randomness is necessary for audio work; specifically I think people are much more likely to impose patterns on the audio signal than to actually hear them (assuming the 31-bit PRNG; the period of a 16-bit PSRG is clearly audible), and I don't know that anyone is doing any algorithms that depend on good randomness. Also, the smaller footprint and faster speed of the current algo. is nice for people with less optimal computers, perhaps allowing more real-time DSP than otherwise possible. john lato Quoting Michael Gogins |
Date | 2005-09-22 16:50 |
From | jlato@mail.utexas.edu |
Subject | Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code |
Argh. I fell victim to the same error as Michel, and I said in my post that it's 31-bit. Thanks, Istvan. Anyway, if it were mod(2^32), it would definately be faster than the MT. As mod(2^31), I'm not sure that's true. I might have a chance to run a comparison later this week. jwl Quoting jlato@mail.utexas.edu: > It is not as random as the Mersenne Twister. The period is much smaller, > and > there is serial correlation between successive outputs. There are some > other > flaws as well. The current algorithm should be faster, though, assuming > it's > implemented properly. > > Although I personally would advocate a move to the MT, I wouldn't consider it > a > high priority. I don't think that super high-quality randomness is > necessary > for audio work; specifically I think people are much more likely to impose > patterns on the audio signal than to actually hear them (assuming the 31-bit > PRNG; the period of a 16-bit PSRG is clearly audible), and I don't know that > anyone is doing any algorithms that depend on good randomness. Also, the > smaller footprint and faster speed of the current algo. is nice for people > with > less optimal computers, perhaps allowing more real-time DSP than otherwise > possible. > > john lato > > Quoting Michael Gogins |
Date | 2005-09-23 10:47 |
From | Robin Whittle |
Subject | Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code |
Michael Gogins wrote: > How efficient is the Whittle algorithm compared to the Mersenne Twister? Its not my algorithm - its David G. Carta's. The properly commented code is here: http://www.firstpr.com.au/dsp/rand31/rand31-park-miller-carta-int.c.txt and the active lines are: long unsigned int hi, lo; lo = constapmc * (seed31pmc & 0xFFFF); hi = constapmc * (seed31pmc >> 16); lo += (hi & 0x7FFF) << 16; lo += hi >> 15; if (lo > 0x7FFFFFFF) lo -= 0x7FFFFFFF; return ( seed31pmc = (long)lo ); The "oscbnk_rand31(long seed)" code posted by Istvan is more complex than my code and probably somewhat slower, because it involves a 64 bit variable, and I think a 32 x 32 = 64 bit multiply. Carta's algorithm uses 16 x 16 = 32 bit multiplies and 32 bit variables. I don't know how fast the Mersenne Twister is. Probably, for Csound, where there is no shortage of RAM, the Mersenne Twister would be a good choice. Its easier to use a known good (to all contemporary critics) solution rather than waste human time in debate about the quality of a lesser PRNG, just to save a few CPU cycles. On the other hand, I think that there is no audio problem with the Park-Miller-Carta algorithm in terms of using it for noise, or in terms of its cycle length. For some probabilistic purposes there might be subtle problems, but it is hard for me to imagine how this would occur. Please consider: http://www.firstpr.com.au/dsp/rand31/#Quality and try to imagine what problems may be caused. Just because we can't imagine problems doesn't mean there won't be any, but I would want to see some pretty concrete arguments before adopting a slower algorithm, since speed is so important for real-time, and for increasing the ability to work on large pieces without excessively old waiting for the file to render. My main interest with the Park-Miller-Carta algorithm was embedded microcontrollers where CPU cycles means battery drain. - Robin -- Send bugs reports to this list. To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk |
Date | 2005-09-23 13:08 |
From | Istvan Varga |
Subject | Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code |
Robin Whittle wrote: > The "oscbnk_rand31(long seed)" code posted by Istvan is more complex > than my code and probably somewhat slower, because it involves a 64 bit > variable, and I think a 32 x 32 = 64 bit multiply. Carta's algorithm > uses 16 x 16 = 32 bit multiplies and 32 bit variables. Actually, it is faster, at least on x86 with GCC, because the compiler can take advantage of the fact that 32 bit multiplication always produces a 64 bit result (storing the most significant bits in edx). So, the function below from the current CVS sources static CS_PURE long oscbnk_rand31(long seed) { uint64_t tmp1; uint32_t tmp2; /* x = (16807 * x) % 0x7FFFFFFF */ tmp1 = (uint64_t) ((int32_t) seed * (int64_t) 16807); tmp2 = (uint32_t) tmp1 & (uint32_t) 0x7FFFFFFF; tmp2 += (uint32_t) (tmp1 >> 31); if ((int32_t) tmp2 < (int32_t) 0) tmp2 = (tmp2 + (uint32_t) 1) & (uint32_t) 0x7FFFFFFF; return (long) tmp2; } compiles to (GCC 4.1 beta, -O2 -march=pentium3 -mfpmath=387 -mno-sse; better code would be generated with -fomit-frame-pointer, eliminating the lines marked with FP): 00000000 |
Date | 2005-09-23 13:44 |
From | Robin Whittle |
Subject | Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code |
Hi Istvan, I have never tried looking at the assembly code generated from C-code like this. I imagine a 32 x 32 multiply takes more CPU cycles than a 16 x 16 - maybe twice as many? It would depend a great deal on the CPU, I expect. I understand it can be hard to test the performance of code, due to cache issues for code and data and probably other things I can't think of now. You wrote: > (note: for some reason, if the tmp2 = (tmp2 + 1) & 0x7FFFFFFF is replaced > with tmp2 -= 0x7FFFFFFF which would do the same, the compiler generates > much worse code, but only if the use of cmov instructions (-march=i686 and > above) is enabled) Hmmm . . . I think it is debatable whether code should be made more complex to generate faster code on just some CPUs with some compilers and some options set for those compilers - but I don't feel like getting into such a debate! - Robin -- Send bugs reports to this list. To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk |
Date | 2005-09-23 14:11 |
From | Istvan Varga |
Subject | Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code |
Robin Whittle wrote: > I have never tried looking at the assembly code generated from C-code > like this. I imagine a 32 x 32 multiply takes more CPU cycles than a 16 > x 16 - maybe twice as many? It would depend a great deal on the CPU, I > expect. I understand it can be hard to test the performance of code, > due to cache issues for code and data and probably other things I can't > think of now. No, your code compiles to 32 bit multiplies even if I add explicit casts to 16 bit types (perhaps because a 32 bit multiplication is actually faster than a 16 bit one on a modern CPU running in 32 bit mode ?). > Hmmm . . . I think it is debatable whether code should be made more > complex to generate faster code on just some CPUs with some compilers > and some options set for those compilers - but I don't feel like getting > into such a debate! It is not more complex (in fact it has less statements), and only looks complicated because of the many (and perhaps often redundant) explicit casts. The only bit that is made more complex to work around inefficient code generation by GCC is the use of "add one, then AND by 0x7FFFFFFF" instead of just "subtract 0x7FFFFFFF". However, on the other hand, more complexity is removed by not implementing workarounds for legacy compilers and CPUs by software emulation of multiplying 32 bit numbers to get a 64 bit result. It is left up to the compiler to solve this in whatever way is the most efficient, but the problem is that the compiler is not always very smart. Nevertheless, I can replace the random generator with your code to be on the safe side, and use the original only if __GNUC__ and __i386__ are defined (but then, for that single platform and compiler, I may just as well use inline assembly). -- Send bugs reports to this list. To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk |
Date | 2005-09-28 00:11 |
From | Istvan Varga |
Subject | Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code |
Here is a comparison of the time it takes to generate 100,000,000 random numbers with three different generators, compiled with various compilers on Linux. rand31: linear congruential generator (m = 0x7FFFFFFF, a = 742938285) rand: ANSI C rand() (note that it is much slower if not compiled with -static) MT: Mersenne Twister (MT19937) ------------------------------------------------------------------------ GCC 4.1 beta (-O2 -march=pentium3 -static) rand31: 1.703, rand: 4.559, MT: 2.501 GCC 4.1 beta (-O2 -march=pentium3 -fomit-frame-pointer -static) rand31: 1.731, rand: 4.561, MT: 2.321 GCC 4.1 beta (-O2 -march=pentium3) rand31: 1.702, rand: 7.004, MT: 2.500 GCC 4.0.0 (-O2 -march=pentium3 -static) rand31: 1.702, rand: 4.561, MT: 2.405 GCC 4.0.0 (-O2 -march=pentium3 -fomit-frame-pointer -static) rand31: 1.732, rand: 4.562, MT: 2.339 GCC 3.3.4 (-O2 -march=pentium3 -static) rand31: 1.863, rand: 4.172, MT: 2.579 GCC 3.3.4 (-O2 -march=pentium3 -fomit-frame-pointer -static) rand31: 1.730, rand: 4.172, MT: 2.385 ICC 9.0 (-O2 -xK -static) rand31: 1.841, rand: 4.410, MT: 2.474 ------------------------------------------------------------------------ #include |
Date | 2005-09-29 03:06 |
From | Robin Whittle |
Subject | Re: 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code |
Thanks Istvan for these timing results. What CPU and clock rate were you using? Andrew Simper, on the Music-DSP mailing list ( http://shoko.calarts.edu/musicdsp ), suggested an exceedingly nifty optimisation to the mod(0x7FFFFFFF) for my code: > Change > if (lo > 0x7FFFFFFF) lo -= 0x7FFFFFFF; > to: > lo = (lo & 0x7FFFFFFF) + (lo >> 31); I think this could also be applied to the 32 x 32 = 64 bit multiplication approach you use in csoundRand31(). I think it is highly likely that an add of two math operations which can be done in parallel will be faster than any kind of test and conditional branch. This should be applicable to my dsPIC assembly code too. Andrew Simper also plans to consider using XOR on the output of the basic PRNG to minimise any value correlations. I wrote the following about XORing: On XORing, I imagine that choosing a suitable fixed 31 bit value would do a pretty good job of messing up whatever patterns might exist in the value dimension of the output stream. (Shuffling the order of the whole output words would do a good job of messing up the time dimension, which would also reduce or eliminate short-term value dimension correlations.) To choose a fixed XOR value, maybe we could choose a binary representation of "pi" or "e", or scoop out some bits from a .wav file of Jimi Hendrix playing Purple Haze. Any such XOR factor would ensure that the resulting output stream included one instance of 0x7FFFFFFF and one instance of 0, whilst removing instances of the XOR factor and its complement. I was thinking of something more fiendish, but which would be more expensive than a fixed XOR value. Suppose we maintained in RAM a 32 bit XOR factor, we could make it a left to right shift-register, feeding it bit 0 of the raw PRNG result (or bit 31 of the intermediate result before the mod() operation). This changing XOR value would surely break up any value patterns, but it would introduce a number of other "features" which may or may not be desirable. Alternatively, the XOR factor could be a counter, incrementing every cycle, or being incremented (or added to by some factor) only when particular bits were set in the raw PRNG result. That would probably extend the cycle length by some significant factor. It seems the Mersenne Twister is not at all slow, so in many situations there's no point trying to do anything too complex. However, the Mersenne Twister: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html requires 624 x 32 bit words of RAM, which rules out its use in some embedded applications. - Robin |
Date | 2005-09-29 12:40 |
From | Istvan Varga |
Subject | Re: 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code |
Robin Whittle wrote: > Thanks Istvan for these timing results. What CPU and clock rate were > you using? A rather old Pentium III (Celeron) 1300 MHz. >>Change >> if (lo > 0x7FFFFFFF) lo -= 0x7FFFFFFF; >>to: >> lo = (lo & 0x7FFFFFFF) + (lo >> 31); Thanks for this tip, it seems to work well, so I will use it. > I think this could also be applied to the 32 x 32 = 64 bit > multiplication approach you use in csoundRand31(). Yes, as far as the trick to avoid using % is concerned, it is identical to the code you posted. The only difference is in the multiplication; by splitting the multiplication in the suggested way, the performance depends less on the compiler and CPU type. However, when 32 to 64 bit multiply is indeed available, using that is faster, and it also removes the limitation of having to use small multipliers. Trying to compile with the optimization to remove the conditional, and using a large multiplier: /* simple linear congruential generator */ PUBLIC int csoundRand31(int *seedVal) { uint64_t tmp1; uint32_t tmp2; /* x = (742938285 * x) % 0x7FFFFFFF */ tmp1 = (uint64_t) ((int32_t) (*seedVal) * (int64_t) 742938285); tmp2 = (uint32_t) tmp1 & (uint32_t) 0x7FFFFFFF; tmp2 += (uint32_t) (tmp1 >> 31); tmp2 = (tmp2 & (uint32_t) 0x7FFFFFFF) + (tmp2 >> 31); (*seedVal) = (int) tmp2; return (int) tmp2; } I get this with GCC 4.0 and 4.1: 00000000 |