[Cs-dev] ParCS update

Date	2010-07-14 16:35
From	john ffitch
Subject	[Cs-dev] ParCS update
	I have been struggling to check it, and it seems that if i add serious compute to an instrument (I have written a "waste" opcode) then I get serious speedup. This suggests that the code is correct (or mainly so) but the calculation for grain size is all wrong; actually not really written yet. Will check it in when I have done a little more tidying of the code ==John ffitch ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net

Date	2010-07-14 17:02
From	Michael Gogins
Subject	Re: [Cs-dev] ParCS update
	How big is the "waste" instrument? The granularity is affected both by pure instruction overhead and by other overheads involved in synchronizing and scheduling (threading overhead, if you will). With pure instruction overhead I would guess that a few thousand machine instructions will justify a thread context switch. I would also guess that's a bit bigger than a small instrument instance, i.e., the instruments need to be "smallish or larger" not "small". I'm confident of this because Steven Yi's code definitely got significant speedups with mid size instruments. Threading overhead might well throw that out the window. I would think that the design of synchronizing opcodes that access global data would introduce additional threading overhead and require many more machine instructions to balance out. If so then, for a given layer of concurrency (e.g. a set of instances of the same instrument template without any potential data races), running some of those instances in series should balance it out. If there are too many potential data races, i.e. too much synchronization in the opcode graphs, then it may never really work out. Using spinlocks for opcode synchronization would almost certainly help, does ParCS do that? I don't recall... Does this make any sense? Regards Mike On Wed, Jul 14, 2010 at 11:35 AM, john ffitch wrote: > I have been struggling to check it, and it seems that if i add serious > compute to an instrument (I have written a "waste" opcode) then I get > serious speedup. This suggests that the code is correct (or mainly > so) but the calculation for grain size is all wrong; actually not > really written yet. > > Will check it in when I have done a little more tidying of the code > ==John ffitch > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Sprint > What will you do first with EVO, the first 4G phone? > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > _______________________________________________ > Csound-devel mailing list > Csound-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/csound-devel > -- Michael Gogins Irreducible Productions http://www.michael-gogins.com Michael dot Gogins at gmail dot com ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net

Date	2010-07-15 10:26
From	john ffitch
Subject	Re: [Cs-dev] ParCS update
	>>>>> "Michael" == Michael Gogins writes: Michael> How big is the "waste" instrument? At a loop of 30,000 times it did not do much but at 100,000,000 it made a simple oscillator take a very long time. This is for a very short sound Elapse CPU 1 processor 13.837 13.83 2 processors 6.915 13.82 3 processors 6.92 13.85 4 processors 3.618 13.85 There were 4 instruments playing, two oscillators and two plucks, and no common variables except spout, which is protected by a spinlock Michael> The granularity is affected both by pure instruction overhead and by Michael> other overheads involved in synchronizing and scheduling (threading Michael> overhead, if you will). With pure instruction overhead I would guess Michael> that a few thousand machine instructions will justify a thread context Michael> switch. I would also guess that's a bit bigger than a small instrument Michael> instance, i.e., the instruments need to be "smallish or larger" not Michael> "small". I'm confident of this because Steven Yi's code definitely got Michael> significant speedups with mid size instruments. There is code to could the "cost" of an instrument and balance one instr or a sequence against others Michael> Threading overhead might well throw that out the window. I would think Michael> that the design of synchronizing opcodes that access global data would Michael> introduce additional threading overhead and require many more machine Michael> instructions to balance out. If so then, for a given layer of Michael> concurrency (e.g. a set of instances of the same instrument template Michael> without any potential data races), running some of those instances in Michael> series should balance it out. If there are too many potential data Michael> races, i.e. too much synchronization in the opcode graphs, then it may Michael> never really work out. That is what needs to be viewed. At present I think some of the DAG dispatch is over complex and needs a close review Michael> Using spinlocks for opcode synchronization would almost certainly Michael> help, does ParCS do that? I don't recall... Most is spinlock and two barriers. There are a few mutex sections and I am not sure why -- one is for one read of a variable which is a classic spinlock case Michael> Does this make any sense? Sure. I need to finish my opcode cost counter to augment the parser, and look critically at the dag_comsume and dag_update code But clearing some of the bug list also needs to be done. Michael> Regards Michael> Mike ==John ffitch ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net

Date	2010-07-17 16:42
From	Michael Gogins
Subject	Re: [Cs-dev] ParCS update
	Thanks for your informative response. I have re-read parts of Wilson's dissertation, and I have a few more questions and comments. Please give them some thought. Also, please let me know if there is any part of this work that you would like me to take on, either in the ParCS code itself, or in fixing bugs to free up time for yourself. Once again, I think the dissertation is an impressive piece of work and covers many cases not obvious when beginning to think about the problem. Question -------- The dissertation documents significant speedups in pieces of interest to me (heavy, mostly parallel instruments). Are we sure that the ParCS code reflects all the speedups discussed in the dissertation? Several alternative designs are discussed. Which one does the ParCS code actually implement, or does the code follow a different design altogether? Comment ------- Off the top of my head, I don't see any place to prefer mutexes to spinlocks except while building the DAG. Comment ------- It is critical to find out if the ParCS code can be made fast enough. This requires optimizing and debugging the code. If it is still not fast enough, we need to think if another design might work, or if it just not going to work well enough. Comment ------- What is "fast enough?" In my view it is fast enough if (a) complex pieces run significantly faster with 2 or more cores, and (b) performance shows some kind of scaling up with increasing numbers of cores. I don't think it's fast enough if adding cores does no good or even slows things down again. Comment ------- What would be an alternative design? It seems clear that more complex designs would incur even more threading overhead, so we should think about simpler designs. But they would have to throw out features of ParCS. The simplest design I can think of would start with Steven Yi's approach. Currently it ignores global data and simply runs multiple instances of a given template in parallel with a barrier at the end. There is a spinlock or something to protect the out opcode buffers. Period. To this could be added atomic operations or spinlocks (whichever is most efficient) on all opcode buffers. I think these would incur little overhead (but a lot of detail work to implement). They would however throw out the ParCS feature of protecting the operation of first writing to a global variable, then reading it without interference from other threads. In other words, the alternative would trade the DAG analysis followed by protecting only global data at DAG edges, for no DAG analysis plus protecting all global data, with the theory that instrument order would guarantee few contests on global data operations and uncontested operations would have very low cost. The simpler design would throw out the DAG analysis and just use the instrument number order to determine concurrent layers. This would throw out some of the ParCS semantic analysis that is aimed at finding potential speedups. But I'm not sure these potential speedups can be made actual, because the DAG code introduces its own threading overhead. Question -------- So, is the DAG analysis actually required to compute the correct sound, or is it only aimed at finding more speedups in comparison with just relying on instrument number? Regards, Mike On Thu, Jul 15, 2010 at 5:26 AM, john ffitch wrote: >>>>>> "Michael" == Michael Gogins writes: > > Michael> How big is the "waste" instrument? > > At a loop of 30,000 times it did not do much but at 100,000,000 it > made a simple oscillator take a very long time. > > This is for a very short sound > Elapse CPU > 1 processor 13.837 13.83 > 2 processors 6.915 13.82 > 3 processors 6.92 13.85 > 4 processors 3.618 13.85 > > There were 4 instruments playing, two oscillators and two plucks, and > no common variables except spout, which is protected by a spinlock > > Michael> The granularity is affected both by pure instruction overhead and by > Michael> other overheads involved in synchronizing and scheduling (threading > Michael> overhead, if you will). With pure instruction overhead I would guess > Michael> that a few thousand machine instructions will justify a thread context > Michael> switch. I would also guess that's a bit bigger than a small instrument > Michael> instance, i.e., the instruments need to be "smallish or larger" not > Michael> "small". I'm confident of this because Steven Yi's code definitely got > Michael> significant speedups with mid size instruments. > > There is code to could the "cost" of an instrument and balance one > instr or a sequence against others > > Michael> Threading overhead might well throw that out the window. I would think > Michael> that the design of synchronizing opcodes that access global data would > Michael> introduce additional threading overhead and require many more machine > Michael> instructions to balance out. If so then, for a given layer of > Michael> concurrency (e.g. a set of instances of the same instrument template > Michael> without any potential data races), running some of those instances in > Michael> series should balance it out. If there are too many potential data > Michael> races, i.e. too much synchronization in the opcode graphs, then it may > Michael> never really work out. > > That is what needs to be viewed. At present I think some of the DAG > dispatch is over complex and needs a close review > > Michael> Using spinlocks for opcode synchronization would almost certainly > Michael> help, does ParCS do that? I don't recall... > > Most is spinlock and two barriers. There are a few mutex sections and > I am not sure why -- one is for one read of a variable which is a > classic spinlock case > > Michael> Does this make any sense? > > Sure. I need to finish my opcode cost counter to augment the parser, > and look critically at the dag_comsume and dag_update code > > But clearing some of the bug list also needs to be done. > > Michael> Regards > Michael> Mike > > ==John ffitch > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Sprint > What will you do first with EVO, the first 4G phone? > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > _______________________________________________ > Csound-devel mailing list > Csound-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/csound-devel > -- Michael Gogins Irreducible Productions http://www.michael-gogins.com Michael dot Gogins at gmail dot com ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net