Csound Csound-dev Csound-tekno Search About

Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code

Date2005-09-22 15:17
FromMichael Gogins
SubjectRe: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
How efficient is the Whittle algorithm compared to the Mersenne Twister? And, how random is it? The MT has become the standard high-performance pseudo-random noise generator in many software packages. It is the default PRNG in boost, for example.

Regards,
Mike

-----Original Message-----
From: Istvan Varga 
Sent: Sep 22, 2005 4:48 AM
To: csound@lists.bath.ac.uk
Subject: Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code

Robin Whittle wrote:

> I haven't been doing anything with Csound in recent years, but I lurk on
> this list.  Nearly 10 years ago I contributed a 31 bit pseudo-random
> number generator to Csound for noise etc.  As far as I know, it is still
> there.

Do you mean the x = (x * 16807) % 0x7FFFFFFF algorithm ? It is
actually used in some newer opcodes as well. In fact, I think using
rand() is a bad idea for a few reasons: it may be 16 bit on some systems,
it may be 32 bit or more but slow on others, different implementations
generate different pseudo-random sequences with the same seed, and
it is not thread safe (even though Csound 5 would require that).
-- 
Send bugs reports to this list.
To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk



-- 
Send bugs reports to this list.
To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk

Date2005-09-22 16:43
Fromjlato@mail.utexas.edu
SubjectRe: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
It is not as random as the Mersenne Twister.  The period is much smaller, and
there is serial correlation between successive outputs.  There are some other
flaws as well.  The current algorithm should be faster, though, assuming it's
implemented properly.

Although I personally would advocate a move to the MT, I wouldn't consider it a
high priority.  I don't think that super high-quality randomness is necessary
for audio work; specifically I think people are much more likely to impose
patterns on the audio signal than to actually hear them (assuming the 31-bit
PRNG; the period of a 16-bit PSRG is clearly audible), and I don't know that
anyone is doing any algorithms that depend on good randomness.  Also, the
smaller footprint and faster speed of the current algo. is nice for people with
less optimal computers, perhaps allowing more real-time DSP than otherwise
possible.

john lato

Quoting Michael Gogins :

> How efficient is the Whittle algorithm compared to the Mersenne Twister? And,
> how random is it? The MT has become the standard high-performance
> pseudo-random noise generator in many software packages. It is the default
> PRNG in boost, for example.
> 
> Regards,
> Mike
> 
> -----Original Message-----
> From: Istvan Varga 
> Sent: Sep 22, 2005 4:48 AM
> To: csound@lists.bath.ac.uk
> Subject: Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC
> assembly code
> 
> Robin Whittle wrote:
> 
> > I haven't been doing anything with Csound in recent years, but I lurk on
> > this list.  Nearly 10 years ago I contributed a 31 bit pseudo-random
> > number generator to Csound for noise etc.  As far as I know, it is still
> > there.
> 
> Do you mean the x = (x * 16807) % 0x7FFFFFFF algorithm ? It is
> actually used in some newer opcodes as well. In fact, I think using
> rand() is a bad idea for a few reasons: it may be 16 bit on some systems,
> it may be 32 bit or more but slow on others, different implementations
> generate different pseudo-random sequences with the same seed, and
> it is not thread safe (even though Csound 5 would require that).
> -- 
> Send bugs reports to this list.
> To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk
> 
> 
> 
> -- 
> Send bugs reports to this list.
> To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk
> 



-- 
Send bugs reports to this list.
To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk

Date2005-09-22 16:50
Fromjlato@mail.utexas.edu
SubjectRe: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
Argh.  I fell victim to the same error as Michel, and I said in my post that
it's 31-bit.  Thanks, Istvan.
Anyway, if it were mod(2^32), it would definately be faster than the MT.  As
mod(2^31), I'm not sure that's true.  I might have a chance to run a comparison
later this week.
jwl

Quoting jlato@mail.utexas.edu:

> It is not as random as the Mersenne Twister.  The period is much smaller,
> and
> there is serial correlation between successive outputs.  There are some
> other
> flaws as well.  The current algorithm should be faster, though, assuming
> it's
> implemented properly.
> 
> Although I personally would advocate a move to the MT, I wouldn't consider it
> a
> high priority.  I don't think that super high-quality randomness is
> necessary
> for audio work; specifically I think people are much more likely to impose
> patterns on the audio signal than to actually hear them (assuming the 31-bit
> PRNG; the period of a 16-bit PSRG is clearly audible), and I don't know that
> anyone is doing any algorithms that depend on good randomness.  Also, the
> smaller footprint and faster speed of the current algo. is nice for people
> with
> less optimal computers, perhaps allowing more real-time DSP than otherwise
> possible.
> 
> john lato
> 
> Quoting Michael Gogins :
> 
> > How efficient is the Whittle algorithm compared to the Mersenne Twister?
> And,
> > how random is it? The MT has become the standard high-performance
> > pseudo-random noise generator in many software packages. It is the default
> > PRNG in boost, for example.
> > 
> > Regards,
> > Mike
> > 
> > -----Original Message-----
> > From: Istvan Varga 
> > Sent: Sep 22, 2005 4:48 AM
> > To: csound@lists.bath.ac.uk
> > Subject: Re: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC
> > assembly code
> > 
> > Robin Whittle wrote:
> > 
> > > I haven't been doing anything with Csound in recent years, but I lurk on
> > > this list.  Nearly 10 years ago I contributed a 31 bit pseudo-random
> > > number generator to Csound for noise etc.  As far as I know, it is still
> > > there.
> > 
> > Do you mean the x = (x * 16807) % 0x7FFFFFFF algorithm ? It is
> > actually used in some newer opcodes as well. In fact, I think using
> > rand() is a bad idea for a few reasons: it may be 16 bit on some systems,
> > it may be 32 bit or more but slow on others, different implementations
> > generate different pseudo-random sequences with the same seed, and
> > it is not thread safe (even though Csound 5 would require that).
> > -- 
> > Send bugs reports to this list.
> > To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk
> > 
> > 
> > 
> > -- 
> > Send bugs reports to this list.
> > To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk
> > 
> 
> 
> 
> -- 
> Send bugs reports to this list.
> To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk
> 



-- 
Send bugs reports to this list.
To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk

Date2005-09-23 10:47
FromRobin Whittle
SubjectRe: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
Michael Gogins wrote:

> How efficient is the Whittle algorithm compared to the Mersenne Twister? 

Its not my algorithm - its David G. Carta's.   The properly commented
code is here:

  http://www.firstpr.com.au/dsp/rand31/rand31-park-miller-carta-int.c.txt

and the active lines are:

    long unsigned int hi, lo;

    lo = constapmc * (seed31pmc & 0xFFFF);
    hi = constapmc * (seed31pmc >> 16);

    lo += (hi & 0x7FFF) << 16;
    lo += hi >> 15;

    if (lo > 0x7FFFFFFF) lo -= 0x7FFFFFFF;

    return ( seed31pmc = (long)lo );

The "oscbnk_rand31(long seed)" code posted by Istvan is more complex
than my code and probably somewhat slower, because it involves a 64 bit
variable, and I think a 32 x 32  = 64 bit multiply.  Carta's algorithm
uses 16 x 16 = 32 bit multiplies and 32 bit variables.

I don't know how fast the Mersenne Twister is.  Probably, for Csound,
where there is no shortage of RAM, the Mersenne Twister would be a good
choice.  Its easier to use a known good (to all contemporary critics)
solution  rather than waste human time in debate about the quality of a
lesser PRNG, just to save a few CPU cycles.

On the other hand, I think that there is no audio problem with the
Park-Miller-Carta algorithm in terms of using it for noise, or in terms
of its cycle length.  For some probabilistic purposes there might be
subtle problems, but it is hard for me to imagine how this would occur.

Please consider:

  http://www.firstpr.com.au/dsp/rand31/#Quality

and try to imagine what problems may be caused.  Just because we can't
imagine problems doesn't mean there won't be any, but I would want to
see some pretty concrete arguments before adopting a slower algorithm,
since speed is so important for real-time, and for increasing the
ability to work on large pieces without excessively old waiting for the
file to render.

My main interest with the Park-Miller-Carta algorithm was embedded
microcontrollers where CPU cycles means battery drain.

  - Robin

-- 
Send bugs reports to this list.
To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk

Date2005-09-23 13:08
FromIstvan Varga
SubjectRe: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
Robin Whittle wrote:

> The "oscbnk_rand31(long seed)" code posted by Istvan is more complex
> than my code and probably somewhat slower, because it involves a 64 bit
> variable, and I think a 32 x 32  = 64 bit multiply.  Carta's algorithm
> uses 16 x 16 = 32 bit multiplies and 32 bit variables.

Actually, it is faster, at least on x86 with GCC, because the compiler can
take advantage of the fact that 32 bit multiplication always produces a 64
bit result (storing the most significant bits in edx). So, the function
below from the current CVS sources

static CS_PURE long oscbnk_rand31(long seed)
{
     uint64_t tmp1;
     uint32_t tmp2;

     /* x = (16807 * x) % 0x7FFFFFFF */
     tmp1 = (uint64_t) ((int32_t) seed * (int64_t) 16807);
     tmp2 = (uint32_t) tmp1 & (uint32_t) 0x7FFFFFFF;
     tmp2 += (uint32_t) (tmp1 >> 31);
     if ((int32_t) tmp2 < (int32_t) 0)
       tmp2 = (tmp2 + (uint32_t) 1) & (uint32_t) 0x7FFFFFFF;
     return (long) tmp2;
}

compiles to (GCC 4.1 beta, -O2 -march=pentium3 -mfpmath=387 -mno-sse;
better code would be generated with -fomit-frame-pointer, eliminating
the lines marked with FP):

00000000 :
        0:       b9 a7 41 00 00          mov    ecx,0x41a7
        5:       f7 e9                   imul   ecx
        7:       55                      push   ebp          ; FP
        8:       89 e5                   mov    ebp,esp      ; FP
        a:       89 c1                   mov    ecx,eax
        c:       81 e1 ff ff ff 7f       and    ecx,0x7fffffff
       12:       0f ac d0 1f             shrd   eax,edx,0x1f
       16:       c1 ea 1f                shr    edx,0x1f     ; junk ???
       19:       01 c8                   add    eax,ecx
       1b:       78 03                   js     20 
       1d:       5d                      pop    ebp          ; FP
       1e:       c3                      ret
       1f:       90                      nop
       20:       5d                      pop    ebp          ; FP
       21:       40                      inc    eax
       22:       25 ff ff ff 7f          and    eax,0x7fffffff
       27:       c3                      ret

(note: for some reason, if the tmp2 = (tmp2 + 1) & 0x7FFFFFFF is replaced
with tmp2 -= 0x7FFFFFFF which would do the same, the compiler generates much
worse code, but only if the use of cmov instructions (-march=i686 and above)
is enabled)

Now, here is a slightly modified (to make it easier to compare with the
above function) version of the code you posted:

#define constapmc 16807

static __attribute__((__pure__, __noinline__)) long rand31(long seed31pmc)
{
     long unsigned int hi, lo;

     lo = constapmc * (seed31pmc & 0xFFFF);
     hi = constapmc * (seed31pmc >> 16);

     lo += (hi & 0x7FFF) << 16;
     lo += hi >> 15;

     if (lo > 0x7FFFFFFF) lo -= 0x7FFFFFFF;

     return (long) lo;
}

this compiles to (using the same compiler and flags):

00000000 :
    0:   0f b7 c8                movzx  ecx,ax
    3:   c1 f8 10                sar    eax,0x10
    6:   69 c0 a7 41 00 00       imul   eax,eax,0x41a7
    c:   69 c9 a7 41 00 00       imul   ecx,ecx,0x41a7
   12:   55                      push   ebp                  ; FP
   13:   89 e5                   mov    ebp,esp              ; FP
   15:   89 c2                   mov    edx,eax
   17:   25 ff 7f 00 00          and    eax,0x7fff
   1c:   c1 ea 0f                shr    edx,0xf
   1f:   01 d1                   add    ecx,edx
   21:   c1 e0 10                shl    eax,0x10
   24:   01 c8                   add    eax,ecx
   26:   5d                      pop    ebp                  ; FP
   27:   83 f8 ff                cmp    eax,0xffffffff
   2a:   8d 90 01 00 00 80       lea    edx,[eax-2147483647]
   30:   0f 4e c2                cmovle eax,edx
   33:   c3                      ret

Now there are two multiplications instead of just one. However, it does have
the advantage of depending less on the compiler and CPU type; while versions
of GCC I tested (3.3.4, 4.0.0, 4.1 beta) can all optimize 32 to 64 bit
multiplies and produce similar assembly code to the above, the Intel compiler
is apparently not as good this time and generates a full 64 bit multiply.
-- 
Send bugs reports to this list.
To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk

Date2005-09-23 13:44
FromRobin Whittle
SubjectRe: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
Hi Istvan,

I have never tried looking at the assembly code generated from C-code
like this.  I imagine a 32 x 32 multiply takes more CPU cycles than a 16
x 16 - maybe twice as many?  It would depend a great deal on the CPU, I
expect.  I understand it can be hard to test the performance of code,
due to cache issues for code and data and probably other things I can't
think of now.

You wrote:

> (note: for some reason, if the tmp2 = (tmp2 + 1) & 0x7FFFFFFF is replaced
> with tmp2 -= 0x7FFFFFFF which would do the same, the compiler generates
> much worse code, but only if the use of cmov instructions (-march=i686 and
> above) is enabled)

Hmmm . . . I think it is debatable whether code should be made more
complex to generate faster code on just some CPUs with some compilers
and some options set for those compilers - but I don't feel like getting
into such a debate!

  - Robin

-- 
Send bugs reports to this list.
To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk

Date2005-09-23 14:11
FromIstvan Varga
SubjectRe: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
Robin Whittle wrote:

> I have never tried looking at the assembly code generated from C-code
> like this.  I imagine a 32 x 32 multiply takes more CPU cycles than a 16
> x 16 - maybe twice as many?  It would depend a great deal on the CPU, I
> expect.  I understand it can be hard to test the performance of code,
> due to cache issues for code and data and probably other things I can't
> think of now.

No, your code compiles to 32 bit multiplies even if I add explicit casts
to 16 bit types (perhaps because a 32 bit multiplication is actually faster
than a 16 bit one on a modern CPU running in 32 bit mode ?).

> Hmmm . . . I think it is debatable whether code should be made more
> complex to generate faster code on just some CPUs with some compilers
> and some options set for those compilers - but I don't feel like getting
> into such a debate!

It is not more complex (in fact it has less statements), and only looks
complicated because of the many (and perhaps often redundant) explicit
casts. The only bit that is made more complex to work around inefficient
code generation by GCC is the use of "add one, then AND by 0x7FFFFFFF"
instead of just "subtract 0x7FFFFFFF". However, on the other hand,
more complexity is removed by not implementing workarounds for legacy
compilers and CPUs by software emulation of multiplying 32 bit numbers
to get a 64 bit result. It is left up to the compiler to solve this in
whatever way is the most efficient, but the problem is that the compiler
is not always very smart.
Nevertheless, I can replace the random generator with your code to be on
the safe side, and use the original only if __GNUC__ and __i386__ are
defined (but then, for that single platform and compiler, I may just as
well use inline assembly).
-- 
Send bugs reports to this list.
To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk

Date2005-09-28 00:11
FromIstvan Varga
SubjectRe: [Csnd] 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
Here is a comparison of the time it takes to generate 100,000,000 random
numbers with three different generators, compiled with various compilers
on Linux.

rand31: linear congruential generator (m = 0x7FFFFFFF, a = 742938285)
rand:   ANSI C rand()
         (note that it is much slower if not compiled with -static)
MT:     Mersenne Twister (MT19937)

------------------------------------------------------------------------

GCC 4.1 beta (-O2 -march=pentium3 -static)

   rand31: 1.703, rand: 4.559, MT: 2.501

GCC 4.1 beta (-O2 -march=pentium3 -fomit-frame-pointer -static)

   rand31: 1.731, rand: 4.561, MT: 2.321

GCC 4.1 beta (-O2 -march=pentium3)

   rand31: 1.702, rand: 7.004, MT: 2.500

GCC 4.0.0 (-O2 -march=pentium3 -static)

   rand31: 1.702, rand: 4.561, MT: 2.405

GCC 4.0.0 (-O2 -march=pentium3 -fomit-frame-pointer -static)

   rand31: 1.732, rand: 4.562, MT: 2.339

GCC 3.3.4 (-O2 -march=pentium3 -static)

   rand31: 1.863, rand: 4.172, MT: 2.579

GCC 3.3.4 (-O2 -march=pentium3 -fomit-frame-pointer -static)

   rand31: 1.730, rand: 4.172, MT: 2.385

ICC 9.0 (-O2 -xK -static)

   rand31: 1.841, rand: 4.410, MT: 2.474

------------------------------------------------------------------------

#include 
#include 
#include 
#include 

#define CS_PURE __attribute__ ((__pure__))
#define CS_NOINLINE __attribute__ ((__noinline__))
#define PUBLIC

typedef struct CsoundRandMTState_ {
     int         mti;
     uint32_t    mt[624];
} CsoundRandMTState;

PUBLIC CS_PURE int csoundRand31(int seedVal)
{
     uint64_t  tmp1;
     uint32_t  tmp2;

     /* x = (742938285 * x) % 0x7FFFFFFF */
     tmp1 = (uint64_t) ((int32_t) seedVal * (int64_t) 742938285);
     tmp2 = (uint32_t) tmp1 & (uint32_t) 0x7FFFFFFF;
     tmp2 += (uint32_t) (tmp1 >> 31);
     if ((int32_t) tmp2 < (int32_t) 0)
       tmp2 = (tmp2 + (uint32_t) 1) & (uint32_t) 0x7FFFFFFF;
     return (int) tmp2;
}

/* Period parameters */

#define N           624
#define M           397
#define MATRIX_A    0x9908B0DFU     /* constant vector a */
#define UPPER_MASK  0x80000000U     /* most significant w-r bits */
#define LOWER_MASK  0x7FFFFFFFU     /* least significant r bits */

/* initializes mt[N] with a seed */

PUBLIC void csoundSeedRandMT(CsoundRandMTState *p, uint32_t seedVal)
{
     int       i;
     uint32_t  x;

     p->mt[0] = x = seedVal;
     for (i = 1; i < N; i++) {
       /* See Knuth TAOCP Vol2. 3rd Ed. P.106 for multiplier. */
       /* In the previous versions, MSBs of the seed affect   */
       /* only MSBs of the array mt[].                        */
       /* 2002/01/09 modified by Makoto Matsumoto             */
       x = ((uint32_t) 1812433253 * (x ^ (x >> 30)) + (uint32_t) i);
       p->mt[i] = x;
     }
     p->mti = N;
}

static CS_NOINLINE void MT_update_state(CsoundRandMTState *p)
{
     /* mag01[x] = x * MATRIX_A  for x=0,1 */
     const uint32_t  mag01[2] = { (uint32_t) 0, (uint32_t) MATRIX_A };
     int       i;
     uint32_t  y;
     for (i = 0; i < (N - M); i++) {
       y = (p->mt[i] & UPPER_MASK) | (p->mt[i + 1] & LOWER_MASK);
       p->mt[i] = p->mt[i + M] ^ (y >> 1) ^ mag01[y & (uint32_t) 1];
     }
     for ( ; i < (N - 1); i++) {
       y = (p->mt[i] & UPPER_MASK) | (p->mt[i + 1] & LOWER_MASK);
       p->mt[i] = p->mt[i + (M - N)] ^ (y >> 1) ^ mag01[y & (uint32_t) 1];
     }
     y = (p->mt[N - 1] & UPPER_MASK) | (p->mt[0] & LOWER_MASK);
     p->mt[N - 1] = p->mt[M - 1] ^ (y >> 1) ^ mag01[y & (uint32_t) 1];
}

/* generates a random number on [0,0xffffffff]-interval */

PUBLIC uint32_t csoundRandMT(CsoundRandMTState *p)
{
     int       i = p->mti;
     uint32_t  y;

     if (i >= N) {                   /* generate N words at one time */
       MT_update_state(p);
       i = 0;
     }
     y = p->mt[i];
     p->mti = i + 1;
     /* Tempering */
     y ^= (y >> 11);
     y ^= (y << 7) & (uint32_t) 0x9D2C5680U;
     y ^= (y << 15) & (uint32_t) 0xEFC60000U;
     y ^= (y >> 18);

     return y;
}

volatile uint32_t seed_;

int main(void)
{
     CsoundRandMTState st;
     double t0, t1, t2, t3;
     struct timeval tv;
     int i;

     gettimeofday(&tv, NULL);
     t0 = (double) tv.tv_usec * 0.000001 + (double) tv.tv_sec;
     seed_ = 12345;
     for (i = 0; i < 100000000; i++)
       seed_ = csoundRand31(seed_);
     gettimeofday(&tv, NULL);
     t1 = (double) tv.tv_usec * 0.000001 + (double) tv.tv_sec;
     srand(314159);
     for (i = 0; i < 100000000; i++)
       seed_ = rand();
     gettimeofday(&tv, NULL);
     t2 = (double) tv.tv_usec * 0.000001 + (double) tv.tv_sec;
     csoundSeedRandMT(&st, 5489);
     for (i = 0; i < 100000000; i++) {
       seed_ = csoundRandMT(&st);
/*    if (i < 500) {
         printf("%10u ", seed_);
         if ((i % 5) == 4) printf("\n");
       } */
     }
     gettimeofday(&tv, NULL);
     t3 = (double) tv.tv_usec * 0.000001 + (double) tv.tv_sec;
     printf("rand31: %.3f, rand: %.3f, MT: %.3f\n",
            (t1 - t0), (t2 - t1), (t3 - t2));
     return 0;
}
-- 
Send bugs reports to this list.
To unsubscribe, send email to csound-unsubscribe@lists.bath.ac.uk

Date2005-09-29 03:06
FromRobin Whittle
SubjectRe: 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
Thanks Istvan for these timing results.  What CPU and clock rate were
you using?

Andrew Simper, on the Music-DSP mailing list (
http://shoko.calarts.edu/musicdsp ), suggested an exceedingly nifty
optimisation to the mod(0x7FFFFFFF) for my code:

> Change
>           if (lo > 0x7FFFFFFF) lo -= 0x7FFFFFFF;
> to:
>           lo = (lo & 0x7FFFFFFF) + (lo >> 31);

I think this could also be applied to the 32 x 32 = 64 bit
multiplication approach you use in csoundRand31().

I think it is highly likely that an add of two math operations which can
be done in parallel will be faster than any kind of test and conditional
branch.  This should be applicable to my dsPIC assembly code too.

Andrew Simper also plans to consider using XOR on the output of the
basic PRNG to minimise any value correlations.  I wrote the following
about XORing:

On XORing, I imagine that choosing a suitable fixed 31 bit value would
do a pretty good job of messing up whatever patterns might exist in the
value dimension of the output stream.  (Shuffling the order of the whole
output words would do a good job of messing up the time dimension, which
 would also reduce or eliminate short-term value dimension correlations.)

To choose a fixed XOR value, maybe we could choose a binary
representation of "pi" or "e", or scoop out some bits from a .wav file
of Jimi Hendrix playing Purple Haze.  Any such XOR factor would ensure
that the resulting output stream included one instance of 0x7FFFFFFF and
one instance of 0, whilst removing instances of the XOR factor and its
complement.

I was thinking of something more fiendish, but which would be more
expensive than a fixed XOR value.  Suppose we maintained in RAM a 32 bit
XOR factor, we could make it a left to right shift-register, feeding it
bit 0 of the raw PRNG result (or bit 31 of the intermediate result
before the mod() operation).  This changing XOR value would surely break
up any value patterns, but it would introduce a number of other
"features" which may or may not be desirable.

Alternatively, the XOR factor could be a counter, incrementing every
cycle, or being incremented (or added to by some factor) only when
particular bits were set in the raw PRNG result.  That would probably
extend the cycle length by some significant factor.

It seems the Mersenne Twister is not at all slow, so in many situations
there's no point trying to do anything too complex.  However, the
Mersenne Twister:

  http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html

requires 624 x 32 bit words of RAM, which rules out its use in some
embedded applications.

  - Robin

Date2005-09-29 12:40
FromIstvan Varga
SubjectRe: 31 bit pseudo-random number gen in C, C++ & dsPIC assembly code
Robin Whittle wrote:

> Thanks Istvan for these timing results.  What CPU and clock rate were
> you using?

A rather old Pentium III (Celeron) 1300 MHz.

>>Change
>>          if (lo > 0x7FFFFFFF) lo -= 0x7FFFFFFF;
>>to:
>>          lo = (lo & 0x7FFFFFFF) + (lo >> 31);

Thanks for this tip, it seems to work well, so I will use it.

> I think this could also be applied to the 32 x 32 = 64 bit
> multiplication approach you use in csoundRand31().

Yes, as far as the trick to avoid using % is concerned, it is
identical to the code you posted. The only difference is in the
multiplication; by splitting the multiplication in the suggested
way, the performance depends less on the compiler and CPU type.
However, when 32 to 64 bit multiply is indeed available, using
that is faster, and it also removes the limitation of having to
use small multipliers.
Trying to compile with the optimization to remove the conditional,
and using a large multiplier:

/* simple linear congruential generator */

PUBLIC int csoundRand31(int *seedVal)
{
     uint64_t  tmp1;
     uint32_t  tmp2;

     /* x = (742938285 * x) % 0x7FFFFFFF */
     tmp1 = (uint64_t) ((int32_t) (*seedVal) * (int64_t) 742938285);
     tmp2 = (uint32_t) tmp1 & (uint32_t) 0x7FFFFFFF;
     tmp2 += (uint32_t) (tmp1 >> 31);
     tmp2 = (tmp2 & (uint32_t) 0x7FFFFFFF) + (tmp2 >> 31);
     (*seedVal) = (int) tmp2;
     return (int) tmp2;
}

I get this with GCC 4.0 and 4.1:

00000000 :
    0:   55                      push   ebp
    1:   b8 ad 56 48 2c          mov    eax,0x2c4856ad
    6:   89 e5                   mov    ebp,esp
    8:   53                      push   ebx
    9:   8b 5d 08                mov    ebx,DWORD PTR [ebp+8]
    c:   f7 2b                   imul   DWORD PTR [ebx]
    e:   89 c1                   mov    ecx,eax
   10:   81 e1 ff ff ff 7f       and    ecx,0x7fffffff
   16:   0f ac d0 1f             shrd   eax,edx,0x1f
   1a:   01 c1                   add    ecx,eax
   1c:   89 c8                   mov    eax,ecx
   1e:   25 ff ff ff 7f          and    eax,0x7fffffff
   23:   c1 e9 1f                shr    ecx,0x1f
   26:   01 c8                   add    eax,ecx
   28:   89 03                   mov    DWORD PTR [ebx],eax
   2a:   c1 ea 1f                shr    edx,0x1f
   2d:   5b                      pop    ebx
   2e:   5d                      pop    ebp
   2f:   c3                      ret

So, even with a frame pointer, and taking a pointer to the seed instead
of the seed itself (this makes the function slightly more complex, but
easier to use), the code fits into 48 bytes.
Using the previously posted benchmark program, I get about the same
times when compiled with frame pointer, but the new version is apparently
faster when -fomit-frame-pointer is used.

> Andrew Simper also plans to consider using XOR on the output of the
> basic PRNG to minimise any value correlations.  I wrote the following
> about XORing:

Does using a large multiplier have an effect on this issue ?

> It seems the Mersenne Twister is not at all slow, so in many situations
> there's no point trying to do anything too complex.  However, the
> Mersenne Twister:
> 
>   http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html
> 
> requires 624 x 32 bit words of RAM, which rules out its use in some
> embedded applications.

The average CPU usage is indeed quite good (I found it to be better than
the rand() of glibc on Linux, which is not bad given that rand() is also
not of particularly high quality), the problem is the memory usage, and
that some calls are much slower than the average, because on every 624th
call the whole table needs to be updated.
So, this generator may be good for use as a single global PRNG, but having
a separate instance per random opcode is likely to be too expensive.
I think I will use the Mersenne Twister in the "x-class" random opcodes
(unirand, trirand, gauss, etc.) and GEN21 which is based on the same code,
using a single shared generator for an instance of Csound, and use a 31 bit
LCG in other opcodes where rand() is to be replaced.