| I was able to easily reproduce your results. On my Windows XP Qosmio notebook, with Core 2 Duo, with multiple trials, I got:
1 thread: Elapsed time at end of performance: real: 6.606s, CPU: 6.609s
2 threads: Elapsed time at end of performance: real: 4.263s, CPU: 4.250s
So, with 2 threads, it runs about 1.5 times as fast. I would imagine that this will scale more or less linearly.
Your .csd is obviously much better than mine for testing this. I think we have proved that multi-threading can work, and will do Csound some good.
Now we just need to make sure it works with odd numbers of instances, that spout is protected, and that it works on the other platforms. We will then be more or less up to speed with Max/MSP 5.0.
I have now of course examined the code pretty closely. I don't see an easy way to speed up the partitioning, since in any kperiod the number of active instances might change. This might take some more work, since the partitioning overhead may become significant with the dozens or even hundreds of instances to be expected in a real rendering.
John ffitch mentioned some idea that he has for multi-core Csound. I wish he would tell us what it is, but it looks like it has something to do with 'costing' opcodes, probably to figure out how much load each instrument template imposes. If that were known, the partitioning loop would not need to have an outer pass for instrument templates, but would only need to make one pass, which would thus be slightly more efficient.
I've also gained some understanding of the Max/MSP approach to multi-threading. It depends on the poly~ object, which works with pluggo-ready sub-patches that have defined signal inlets and outlets (plugin~ and plugout~ for signals, pp for control parameters). These seem to be the only (or at least, the main) interface with the main thread. So, I presume that the Max approach is very similar to ours. They would probably have barriers that would run each instance of the sub-patch in its own thread, or from a pool of threads. So Max would need to protect the plugin~, plugout~, and pp interfaces, where Csound needs to protect the spin, spout, and channel interfaces. I imagine that the general efficiency and complexity will work out to be very similar.
The main difference would seem to be that Csound dynamically allocates new instances as the score requires, whereas the Max user pre-allocates instances for poly~ (somebody correct me if I am wrong). Here, I think Csound has a real advantage for composers. Note, however, that Pure Data now has a form of dynamic allocation, and I am sure that the Pure Data developers will now be thinking about multi-threading.
Thanks,
Mike
-----Original Message-----
>From: Steven Yi
>Sent: Jun 29, 2008 12:04 AM
>To: Developer discussions
>Subject: [Cs-dev] Multithreaded Performance
>
>Hi All,
>
>I ran the attached CSD with csound using the normal non-threaded kperf
>and also with --num-threads=2. The results of the run are listed at
>end of email. The CSD's are simple but they use the moogladder opcode
>which is pretty heavy CPU-wise. I ran it twice with both setups on
>WinXP using latest from CVS and got pretty much the same results both
>times. The elapsed time at end of runs is below:
>
>Elapsed time at end of performance: real: 5.243s, CPU: 5.250s (using
>normal kperf)
>Elapsed time at end of performance: real: 3.776s, CPU: 3.781s (--num-threads=2)
>
>This makes me hopeful that even with multithreading as-is, there is
>some benefit. I did try with adding
>
>mutex_lock 0
>outs aout, aout
>mutex_unlock 0
>
>and it seemed to bump up the performance time with --num-threads=2 to
>about 4.05 seconds on average, but still an improvement.
>
>Just wanted to report some findings and that I think we should take a
>similar approach to creating a test suite like the one that was built
>for the new compiler, which is to build simple CSD's and slowly add a
>feature to each new test CSD (i.e. next add if-goto's).
>
>Thanks,
>steven
>
>
>$ csound threadtest.csd -o test.wav
>time resolution is 279.365 ns
>virtual_keyboard real time MIDI plugin for Csound
>0dBFS level = 32768.0
>Csound version 5.08.91 beta (double samples) Jun 28 2008
>libsndfile-1.0.17
>Reading options from $HOME/.csoundrc
>UnifiedCSD: threadtest.csd
>STARTING FILE
>Creating orchestra
>Creating score
>orchname: C:/DOCUME~1/syi/LOCALS~1/Temp\cs376.orc
>scorename: C:/DOCUME~1/syi/LOCALS~1/Temp\cs377.sco
>rtaudio: WinMM module enabled
>orch compiler:
>14 lines read
> instr 1
>Elapsed time at end of orchestra compile: real: 0.078s, CPU: 0.078s
>sorting score ...
> ... done
>Elapsed time at end of score sort: real: 0.080s, CPU: 0.078s
>Csound version 5.08.91 beta (double samples) Jun 28 2008
>displays suppressed
>0dBFS level = 32768.0
>orch now loaded
>audio buffered in 4096 sample-frame blocks
>writing 16384-byte blks of shorts to test.wav (WAV)
>SECTION 1:
>new alloc for instr 1:
>new alloc for instr 1:
>new alloc for instr 1:
>new alloc for instr 1:
>new alloc for instr 1:
>new alloc for instr 1:
>new alloc for instr 1:
>B 0.000 .. 10.000 T 10.000 TT 10.000 M: 29397.7 29397.7
>Score finished in csoundPerform().
>inactive allocs returned to freespace
>end of score. overall amps: 29397.7 29397.7
> overall samples out of range: 0 0
>0 errors in performance
>Elapsed time at end of performance: real: 5.243s, CPU: 5.250s
>118 16384-byte soundblks of shorts written to test.wav (WAV)
>Removing temporary file C:/DOCUME~1/syi/LOCALS~1/Temp\cs378.srt ...
>Removing temporary file C:/DOCUME~1/syi/LOCALS~1/Temp\cs377.sco ...
>Removing temporary file C:/DOCUME~1/syi/LOCALS~1/Temp\cs376.orc ...
>
>
>$ csound threadtest.csd -o test.wav --num-threads=2
>time resolution is 279.365 ns
>virtual_keyboard real time MIDI plugin for Csound
>0dBFS level = 32768.0
>Csound version 5.08.91 beta (double samples) Jun 28 2008
>libsndfile-1.0.17
>Reading options from $HOME/.csoundrc
>UnifiedCSD: threadtest.csd
>STARTING FILE
>Creating orchestra
>Creating score
>orchname: C:/DOCUME~1/syi/LOCALS~1/Temp\cs376.orc
>scorename: C:/DOCUME~1/syi/LOCALS~1/Temp\cs377.sco
>rtaudio: WinMM module enabled
>orch compiler:
>14 lines read
> instr 1
>Elapsed time at end of orchestra compile: real: 0.080s, CPU: 0.078s
>sorting score ...
> ... done
>Elapsed time at end of score sort: real: 0.082s, CPU: 0.078s
>Csound version 5.08.91 beta (double samples) Jun 28 2008
>displays suppressed
>0dBFS level = 32768.0
>orch now loaded
>audio buffered in 4096 sample-frame blocks
>Multithread performance: insno: -1 thread 1 of 2 starting.
>writing 16384-byte blks of shorts to test.wav (WAV)
>SECTION 1:
>new alloc for instr 1:
>Multithread performance: insno: -1 thread 0 of 2 starting.
>new alloc for instr 1:
>new alloc for instr 1:
>new alloc for instr 1:
>new alloc for instr 1:
>new alloc for instr 1:
>new alloc for instr 1:
>B 0.000 .. 10.000 T 10.000 TT 10.000 M: 29397.7 29397.7
>Score finished in csoundPerform().
>inactive allocs returned to freespace
>end of score. overall amps: 29397.7 29397.7
> overall samples out of range: 0 0
>0 errors in performance
>Elapsed time at end of performance: real: 3.776s, CPU: 3.781s
>118 16384-byte soundblks of shorts written to test.wav (WAV)
>Removing temporary file C:/DOCUME~1/syi/LOCALS~1/Temp\cs378.srt ...
>Removing temporary file C:/DOCUME~1/syi/LOCALS~1/Temp\cs377.sco ...
>Removing temporary file C:/DOCUME~1/syi/LOCALS~1/Temp\cs376.orc ...
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net |