[Cs-dev] ParCS questions
Date | 2010-08-20 13:55 |
From | Michael Gogins |
Subject | [Cs-dev] ParCS questions |
I am going to put in padded buffers in the outs opcode to see if that speeds things up by getting rid of false sharing. I also am looking closely at the ParCS code to see if there are any obvious ways to speed up or improve it, and I have found a few things to try, so far nothing big. In the meantime, I have questions. ParCS differs from the head branch thus. The compiled abstract syntax tree is scanned to produce an analysis of the directed acyclical graph (DAG) that represents the signal flow graph during performance. This appears to be done for the following purposes: (1) To identify chunks of work that may safely be assigned to different threads. (2) To identify global read/write operations, which are inserted into critical sections protected by spinlocks. (3) To compute the computational cost of chunks of work at the same layer of the DAG to see if it makes sense to run those chunks in separate threads. I would to know the answers to the following questions: (a) How often does (1) produce an analysis that differs from just running each insno in a separate layer as the head branch does? I can see that theoretically the DAG analysis could produce larger and more evenly balanced chunks of work, but I would like to know if this is actually what is seen to happen. (b) (2) implies that this approach is likely to be more efficient than simply protecting all global data that could possibly be involved in data races by spinlocks in the opcode structures. My question is, how often is the DAG locking scheme going to come up with fewer locks than just automatically locking all global data that could produce data races? If the answer is "not often" then it would be better to move the spinlocks from the DAG analysis and into the opcode calls as in the head branch, as that would save the work of locating read/write locations and inserting locks. I understand that (2) has the advantage of protecting all opcodes, including plugins, even if they do not already have any locking. (c) Has anyone done any tests of (3) to see its effect, e.g. by comparing ParCS performances with and without the costing analysis? -- Michael Gogins Irreducible Productions http://www.michael-gogins.com Michael dot Gogins at gmail dot com ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net |
Date | 2010-08-20 16:23 |
From | John ff |
Subject | Re: [Cs-dev] ParCS questions |
>>>>> "Michael" == Michael Gogins |
Date | 2010-08-20 17:03 |
From | Michael Gogins |
Subject | Re: [Cs-dev] ParCS questions |
Could you please elaborate on: > Just protecting global variable is not the same as the are > possibilities for changing the instrument in order semantics I assume you mean "there are" not "the are." What, in practice, would change the instrument in order semantics? Since there are no cost data as yet, I will code an option to disable this part of the processing loop. But actually, this is the part of the ParCS branch that makes the most sense to me. Regards, Mike -- Michael Gogins Irreducible Productions http://www.michael-gogins.com Michael dot Gogins at gmail dot com ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net |
Date | 2010-08-20 17:21 |
From | Victor Lazzarini |
Subject | Re: [Cs-dev] ParCS questions |
I think this is the case when you have say instr 1 feeding instr 2 , this pipeline has to be enforced, otherwise there will be a 1 ksmps delay between the two outputs. For instance instr 1 a1 oscil p4,p5,1 out a1 ga1 = a1+ ga1 endin instr 2 a1 reverb ga1, 1 out a1 endin where instr 1 has to produce its output before instr 2 is run. Just protecting ga1 will not enforce this order. Or am I missing the point? Victor On 20 Aug 2010, at 17:03, Michael Gogins wrote: > I assume you mean "there are" not "the are." What, in practice, would > change the instrument in order semantics? ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net |
Date | 2010-08-20 17:49 |
From | Michael Gogins |
Subject | Re: [Cs-dev] ParCS questions |
I don't think that's the case. Locking alone would simply make 1 instance wait until another was done. It sounds to me like ParCS is trying to add some new capability to Csound, e.g. one instance of the sample template feeding into another instance of the same template. That's not possible with single-threaded Csound, except maybe with fractional insnos to guarantee order of signal flow. Regards, Mike On Fri, Aug 20, 2010 at 12:21 PM, Victor Lazzarini |
Date | 2010-08-20 18:20 |
From | Victor Lazzarini |
Subject | Re: [Cs-dev] ParCS questions |
But because both instances are accessing the variable, can we guarantee that is instr 2 waiting for instr 1 and not the other way round? Victor On 20 Aug 2010, at 17:49, Michael Gogins wrote: > I don't think that's the case. Locking alone would simply make 1 > instance wait until another was done. > > It sounds to me like ParCS is trying to add some new capability to > Csound, e.g. one instance of the sample template feeding into another > instance of the same template. > > That's not possible with single-threaded Csound, except maybe with > fractional insnos to guarantee order of signal flow. > > Regards, > Mike > > On Fri, Aug 20, 2010 at 12:21 PM, Victor Lazzarini > |
Date | 2010-08-20 19:01 |
From | Michael Gogins |
Subject | Re: [Cs-dev] ParCS questions |
Instr 2 always executes after instr 1. That's the way the Csound orc compiler puts together the list of instances to execute. This is why I'm wondering about the additional semantic analysis. I'm trying to figure out where it's actually needed. Here we go: 0 instance.insno = 0 ; global 1 instance.insno = 1 ; instance 1 of insno 1 2 instance.insno = 1 ; instance 2 of insno 1 3 instance.insno = 2; instance 1 of insno 2 4 instance.insno = 3; instance 1 of insno 3 5 instance.insno = 3; instance 2 of insno 3 6 instance.insno = 3.1; instance 3 of insno 3 Et cetera. They execute in top down order - always. A number of opcodes implicitly depend on this ordering already. Regards, Mike On Fri, Aug 20, 2010 at 1:20 PM, Victor Lazzarini |
Date | 2010-08-20 19:38 |
From | Victor Lazzarini |
Subject | Re: [Cs-dev] ParCS questions |
I know this is the case with a sequential execution, obviously, but is this observed when instruments are split into threads? In any case, my example should not be split into two threads, as it is not really parallel. Victor On 20 Aug 2010, at 19:01, Michael Gogins wrote: > Instr 2 always executes after instr 1. That's the way the Csound orc > compiler puts together the list of instances to execute. This is why > I'm wondering about the additional semantic analysis. I'm trying to > figure out where it's actually needed. > > Here we go: > > 0 instance.insno = 0 ; global > 1 instance.insno = 1 ; instance 1 of insno 1 > 2 instance.insno = 1 ; instance 2 of insno 1 > 3 instance.insno = 2; instance 1 of insno 2 > 4 instance.insno = 3; instance 1 of insno 3 > 5 instance.insno = 3; instance 2 of insno 3 > 6 instance.insno = 3.1; instance 3 of insno 3 > > Et cetera. They execute in top down order - always. A number of > opcodes implicitly depend on this ordering already. > > Regards, > Mike > > On Fri, Aug 20, 2010 at 1:20 PM, Victor Lazzarini > |
Date | 2010-08-20 20:27 |
From | Michael Gogins |
Subject | Re: [Cs-dev] ParCS questions |
Attachments | None None |
In the head branch the list of instances, already sorted by insno, is divided in layers by insno. All instances with the same insno can run at the same time, but layers still run in insno order. MKG from cell phone On Aug 20, 2010 2:34 PM, "Victor Lazzarini" <Victor.Lazzarini@nuim.ie> wrote: |
Date | 2010-08-21 22:16 |
From | Michael Gogins |
Subject | Re: [Cs-dev] ParCS questions |
Duh, after reading the ParCS code again I finally get it. The "semantic analysis" and insertion of locks is about global variables, not global data in opcodes. Obviously two instruments in the same layer can write to and read from the same global variable. Without protection, global variables in multiple instances of the same instrument cannot safely be both read from and written to. I believe the same functionality could be achieved by using opcodes instead of variables to read and write shared data, and putting spinlocks on those opcodes, but this would not be backwards compatible. However, do we need to assure backwards compatibility for an old orchestra in multi-threaded new csound, when we would still have backwards compatibility with new Csound running single-threaded? In any event, after a fresh checkout and build today, the ParCS branch is performing slightly better, especially at smaller ksmps. There is now a real speedup at 100 ksmps, which is much more musical. CSD Branch -r -k ksmps Threads Time Speedup CloudStrata.csd ParCS/all spinlock 96000 960 100 4 38.815 1.45 CloudStrata.csd ParCS/all spinlock 96000 96 1000 4 29.696 1.89 CloudStrata.csd ParCS/mutex 96000 96 1000 4 31.325 1.79 CloudStrata.csd ParCS/mutex 96000 96 1000 -- 56.092 CloudStrata.csd ParCS/spinlock 96000 96 1000 4 30.189 1.86 CloudStrata.csd head/all spinlock 96000 960 100 4 60.759 1.19 CloudStrata.csd head/all spinlock 96000 960 100 -- 72.515 CloudStrata.csd head/all spinlock 96000 96 1000 4 33.110 1.82 CloudStrata.csd head/all spinlock 96000 96 1000 -- 60.120 xanadu.csd ParCS/all spinlock 96000 9600 10 4 12.243 0.95 xanadu.csd ParCS/all spinlock 96000 9600 10 -- 11.657 xanadu.csd ParCS/all spinlock 96000 960 100 4 4.920 1.68 xanadu.csd ParCS/all spinlock 96000 960 100 -- 8.286 xanadu.csd ParCS/all spinlock 96000 96 1000 4 3.916 1.94 xanadu.csd ParCS/all spinlock 96000 96 1000 -- 7.609 I'm not how much of this is due to my changing all mutexes and spinlocks to plain Pthreads spinlocks throughout Csound, and how much is due to other changes that jpff has made in the ParCS code, e.g. costing numbers are printing out, but since the improvement with smaller ksmps with all spinlocks is not as much in the head branch, the implication is that jpff's changes in ParCS have helped. Regards, Mike On Fri, Aug 20, 2010 at 3:27 PM, Michael Gogins |