[Csnd] --num-threads question

Date	2013-07-16 13:47
From	Richard Henninger
Subject	[Csnd] --num-threads question
	I had understood that num-threads (or number_of_threads in the CSOUND_PARAMS structure) was for indicating to csound how many cores it could use during processing and that it should be set to the maximum core count of your processor. Experimenting with this, I got curious behavior which makes me question whether I am using this parameter correctly. On my quad core (Q9550) rendering “xanadu.csd” with ksmps=2, I got deteriorating results every time I increased number_of_threads: num-threads=1: 7 seconds num-threads=2: 8 seconds num-threads=3: 13 seconds num-threads=4: 22 seconds All of those are ok for a 60 piece and 7 seconds is very acceptable to me. I am just curious why I am seeing inverse behavior from what I expected. Have I misunderstood the purpose of this flag? Richard Henninger richard@rghmusic.com

Date	2013-07-16 14:00
From	Michael Gogins
Subject	Re: [Csnd] --num-threads question
	Multicore rendering always involves a trade-off between the overhead of managing multiple threads (which is always considerable) and the efficiency of running multiple threads (which is only sometimes considerable). The benefit will outweigh the cost only for certain kinds of pieces (usually those with multiple instances of similar instruments) and for relatively large ksmps (2 is WAY too small). The way to do this is to set -j equal to number of cores and ksmps to 1000 or so and see if you get a speedup. If not, forget about multicore. If so, then reduce ksmps until results are optimal. The overhead of threading is locking and unlocking protected state and switching between threads. Another overhead is switching code and variables into and out of low-level cache (which takes time). The gain of multi-threading is running code in parallel. You will only see such gains if there is not too much switching of code and variables in and out of low-level cache. I'm not completely sure what controls this in Csound pieces. The reason multiple instances of the same instrument speed up more is probably because it is easier for the semantic analyzer to partition the signal flow graph into efficiently manageable layers. For pieces that are optimal for multi-threading, multiplying the number of cores by 4 produces a speedup of about 2. I do not know how far beyond 4 cores this kind of speedup goes. Usually, in parallel code, the speedups taper off after a certain number of cores because the overheads of memory management and locking increase too much. Based on what I know about other parallel codes, I think 8 cores might produce about another factor of 2 speedup, but after that I think it might begin to taper off. I've worked on a number of parallel codes, and I can say that the parallel code in Csound is the best I have seen. Hope this helps, Mike =========================== Michael Gogins Irreducible Productions http://michaelgogins.tumblr.com Michael dot Gogins at gmail dot com On Tue, Jul 16, 2013 at 8:47 AM, Richard Henninger <richard@rghmusic.com> wrote: I had understood that num-threads (or number_of_threads in the CSOUND_PARAMS structure) was for indicating to csound how many cores it could use during processing and that it should be set to the maximum core count of your processor. Experimenting with this, I got curious behavior which makes me question whether I am using this parameter correctly. On my quad core (Q9550) rendering “xanadu.csd” with ksmps=2, I got deteriorating results every time I increased number_of_threads: num-threads=1: 7 seconds num-threads=2: 8 seconds num-threads=3: 13 seconds num-threads=4: 22 seconds All of those are ok for a 60 piece and 7 seconds is very acceptable to me. I am just curious why I am seeing inverse behavior from what I expected. Have I misunderstood the purpose of this flag? Richard Henninger richard@rghmusic.com

Date	2013-07-16 14:23
From	Richard Henninger
Subject	Re: [Csnd] --num-threads question
	Thanks, Mike. It does help. I was just curious if I had misunderstood the parameter. From your explanation, I presume that I’d gotten it right from the doc - such as it is. Setting parameters is like tuning an instrument. Here, the simplest setup turns out to be the best - I like that. Any of these times would be acceptable. And if telling csound to use only one core is three times better than using all four, why not? Eight to one over real time is way good enough! And for sound quality, obviously ksmps=2 is way preferred over ksmps=1000. In an idle moment, I’ll play around with your ksmps suggestion and see if the predicted behavior occurs. The more we understand and document this new capability of csound, the more the community benefits. I am happy to hear of your confidence in the parallel code implementation. Richard Richard Henninger richard@rghmusic.com From: Michael Gogins Sent: ‎Tuesday‎, ‎July‎ ‎16‎, ‎2013 ‎9‎:‎01‎ ‎AM To: Csound Multicore rendering always involves a trade-off between the overhead of managing multiple threads (which is always considerable) and the efficiency of running multiple threads (which is only sometimes considerable). The benefit will outweigh the cost only for certain kinds of pieces (usually those with multiple instances of similar instruments) and for relatively large ksmps (2 is WAY too small). The way to do this is to set -j equal to number of cores and ksmps to 1000 or so and see if you get a speedup. If not, forget about multicore. If so, then reduce ksmps until results are optimal. The overhead of threading is locking and unlocking protected state and switching between threads. Another overhead is switching code and variables into and out of low-level cache (which takes time). The gain of multi-threading is running code in parallel. You will only see such gains if there is not too much switching of code and variables in and out of low-level cache. I'm not completely sure what controls this in Csound pieces. The reason multiple instances of the same instrument speed up more is probably because it is easier for the semantic analyzer to partition the signal flow graph into efficiently manageable layers. For pieces that are optimal for multi-threading, multiplying the number of cores by 4 produces a speedup of about 2. I do not know how far beyond 4 cores this kind of speedup goes. Usually, in parallel code, the speedups taper off after a certain number of cores because the overheads of memory management and locking increase too much. Based on what I know about other parallel codes, I think 8 cores might produce about another factor of 2 speedup, but after that I think it might begin to taper off. I've worked on a number of parallel codes, and I can say that the parallel code in Csound is the best I have seen. Hope this helps, Mike =========================== Michael Gogins Irreducible Productions http://michaelgogins.tumblr.com Michael dot Gogins at gmail dot com On Tue, Jul 16, 2013 at 8:47 AM, Richard Henninger <richard@rghmusic.com> wrote: I had understood that num-threads (or number_of_threads in the CSOUND_PARAMS structure) was for indicating to csound how many cores it could use during processing and that it should be set to the maximum core count of your processor. Experimenting with this, I got curious behavior which makes me question whether I am using this parameter correctly. On my quad core (Q9550) rendering “xanadu.csd” with ksmps=2, I got deteriorating results every time I increased number_of_threads: num-threads=1: 7 seconds num-threads=2: 8 seconds num-threads=3: 13 seconds num-threads=4: 22 seconds All of those are ok for a 60 piece and 7 seconds is very acceptable to me. I am just curious why I am seeing inverse behavior from what I expected. Have I misunderstood the purpose of this flag? Richard Henninger richard@rghmusic.com