[Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound

[Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound

Date	2010-08-19 04:34
From	Michael Gogins
Subject	[Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound
	CSD Branch -k Threads Time Speedup CloudStrata.csd head 96 -- 59.250 CloudStrata.csd head 96 4 38.183 1.55 CloudStrata.csd OpenMP 96 -- 57.376 CloudStrata.csd OpenMP 96 4 30.641 1.87 CloudStrata.csd ParCS 96 -- 57.296 CloudStrata.csd ParCS 96 4 31.653 1.81 xanadu.csd head 96 -- 7.730 xanadu.csd head 96 4 12.246 0.63 xanadu.csd OpenMP 96 -- 7.703 xanadu.csd OpenMP 96 4 4.994 1.54 xanadu.csd ParCS 96 -- 7.944 xanadu.csd ParCS 96 4 4.175 1.90 head is Steven Yi's original implementation of concurrency in Csound. It uses a pool of pthreads and barriers to synchronize multiple instances of instruments with the same insno. OpenMP is my re-implementation of Steven's code using OpenMP. According to online writings about OpenMP, itsninternal implementation of concurrency would in this case also use a thread of pools and barriers pretty much identical to Steven's code. The main difference is that my code invokes no threading operations if there is only one instance for an insno, thus cutting down on threading overhead. ParCS is John ffitch's branch implementing a more sophisticated version of concurrency, which tracks the cost of instrument execution to balance against threading overhead. Synchronization of shared data to prevent data races is done using spinlocks to protect various data objects. Currently this include out opcodes and some memory operations. The synchronization is for the same objects in all branches, but done using pthread spinlocks in the OpenMP branch and more primitive operations in the other branches. These results are instructive. It is beginning to make sense to use multithreading in Csound for certain types of orchestras if you have multiple cores. It should not be difficult to extend the spinlocks to protect all, or at least almost all, potential data races. I have other data, which I will post soon, indicating that in the OpenMP branch at least, performance scales more or less linearly with the number of cores. I will test this with the other branches and post the results. The most instructive implication of the data is that the OpenMP and ParCS data appear, at this very early stage of analysis, to be roughly equivalent in terms of speedup. That, in turn, may imply that the signal flow graph analysis and costing in the ParCS branch do not impose any significant overhead above and beyond the thread pooling and barriers that are common to all branches. It is clear from tests with higher values of -k that there is significant threading overhead, equivalent to hundreds of iterations of the inner performance loop for these orchestras. Unfortunately, this data implies that much of this overhead is more or less irreducible, since we cannot get rid of the barrier and sleep/awaken thread operations required to synchronize the "layers" of instruments with the same insno. I will merge the OpenMP branch back into the head branch to replace Steven Yi's original implementation, since the OpenMP implementation is both faster and simpler. I will now focus on looking at the ParCS code to see if it can be made faster in most cases than the OpenMP implementation. It's beginning to look like this can only be done by more rigorously avoiding all threading operations that cost more than they are worth. Possibly optimizing the signal flow graph and costing analysis would help, so that is what I will look at first. -- Michael Gogins Irreducible Productions http://www.michael-gogins.com Michael dot Gogins at gmail dot com ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net

Date	2010-08-19 04:53
From	Steven Yi
Subject	Re: [Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound
	Hi Michael, Thanks very much for this report! Also thanks to Victor and John for there work on ParCS. This is all very exciting to hear! Regarding the OpenMP branch, does this also use no intervention by user as the ParCS branch does? (i.e. no need for mutex opcodes I wrote). If so, that's great! Also, I had the thought the other day that I remember that while the last time I spent any time with the original multiprocessor code, while speedups were exhibited in rendering to disk, I seem to remember getting more breakups during realtime rendering. It would be nice if we had a test orchestra that would just keep calling event to add more instances of a sine wave oscillator to see how many one can get up to before breakups to retest. Thanks again! steven On Wed, Aug 18, 2010 at 11:34 PM, Michael Gogins wrote: > CSD Branch -k Threads Time Speedup > > CloudStrata.csd head 96 -- 59.250 > CloudStrata.csd head 96 4 38.183 1.55 > CloudStrata.csd OpenMP 96 -- 57.376 > CloudStrata.csd OpenMP 96 4 30.641 1.87 > CloudStrata.csd ParCS 96 -- 57.296 > CloudStrata.csd ParCS 96 4 31.653 1.81 > > xanadu.csd head 96 -- 7.730 > xanadu.csd head 96 4 12.246 0.63 > xanadu.csd OpenMP 96 -- 7.703 > xanadu.csd OpenMP 96 4 4.994 1.54 > xanadu.csd ParCS 96 -- 7.944 > xanadu.csd ParCS 96 4 4.175 1.90 > > head is Steven Yi's original implementation of concurrency in Csound. > It uses a pool of pthreads and barriers to synchronize multiple > instances of instruments with the same insno. > > OpenMP is my re-implementation of Steven's code using OpenMP. > According to online writings about OpenMP, itsninternal implementation > of concurrency would in this case also use a thread of pools and > barriers pretty much identical to Steven's code. The main difference > is that my code invokes no threading operations if there is only one > instance for an insno, thus cutting down on threading overhead. > > ParCS is John ffitch's branch implementing a more sophisticated > version of concurrency, which tracks the cost of instrument execution > to balance against threading overhead. > > Synchronization of shared data to prevent data races is done using > spinlocks to protect various data objects. Currently this include out > opcodes and some memory operations. The synchronization is for the > same objects in all branches, but done using pthread spinlocks in the > OpenMP branch and more primitive operations in the other branches. > > These results are instructive. It is beginning to make sense to use > multithreading in Csound for certain types of orchestras if you have > multiple cores. It should not be difficult to extend the spinlocks to > protect all, or at least almost all, potential data races. I have > other data, which I will post soon, indicating that in the OpenMP > branch at least, performance scales more or less linearly with the > number of cores. I will test this with the other branches and post the > results. > > The most instructive implication of the data is that the OpenMP and > ParCS data appear, at this very early stage of analysis, to be roughly > equivalent in terms of speedup. That, in turn, may imply that the > signal flow graph analysis and costing in the ParCS branch do not > impose any significant overhead above and beyond the thread pooling > and barriers that are common to all branches. > > It is clear from tests with higher values of -k that there is > significant threading overhead, equivalent to hundreds of iterations > of the inner performance loop for these orchestras. Unfortunately, > this data implies that much of this overhead is more or less > irreducible, since we cannot get rid of the barrier and sleep/awaken > thread operations required to synchronize the "layers" of instruments > with the same insno. > > I will merge the OpenMP branch back into the head branch to replace > Steven Yi's original implementation, since the OpenMP implementation > is both faster and simpler. > > I will now focus on looking at the ParCS code to see if it can be made > faster in most cases than the OpenMP implementation. It's beginning to > look like this can only be done by more rigorously avoiding all > threading operations that cost more than they are worth. Possibly > optimizing the signal flow graph and costing analysis would help, so > that is what I will look at first. > > -- > Michael Gogins > Irreducible Productions > http://www.michael-gogins.com > Michael dot Gogins at gmail dot com > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by > > Make an app they can't live without > Enter the BlackBerry Developer Challenge > http://p.sf.net/sfu/RIM-dev2dev > _______________________________________________ > Csound-devel mailing list > Csound-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/csound-devel > ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net

Date	2010-08-19 08:31
From	Victor Lazzarini
Subject	Re: [Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound
	John's work completely. I'm just testing it. Victor On 19 Aug 2010, at 04:53, Steven Yi wrote: > Thanks very much for this report! Also thanks to Victor and John for > there work on ParCS. This is all very exciting to hear! ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net

Date	2010-08-19 08:40
From	Victor Lazzarini
Subject	Re: [Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound
	This is to me the most pressing problem, which can be a real show stopper. It's not realistic to use a low kr (or high ksmps) to get speed ups. By the way, could you give us a quick explanation of how to use the Steven's/Your parallelism in Csound. As far as I understand, it involves modifying Csound code, but how? I want to try it. Victor On 19 Aug 2010, at 04:34, Michael Gogins wrote: > It is clear from tests with higher values of -k that there is > significant threading overhead, equivalent to hundreds of iterations > of the inner performance loop for these orchestras. Unfortunately, > this data implies that much of this overhead is more or less > irreducible, since we cannot get rid of the barrier and sleep/awaken > thread operations required to synchronize the "layers" of instruments > with the same insno. ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net

Date	2010-08-19 11:57
From	jpff@cs.bath.ac.uk
Subject	Re: [Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound
	Good to see some improvement swith multicore. I am still stuck with trapped.csd; it never completes, with an error message that I think means "I cannot find work but I know there is some" It can happen at different times so I suspect a misisng mutex somewhere, but so far have not found it. I have also used macros so the locks can be switched betweemn spinliocks and mutexes (the later being easier for the tools). But overall encouraging. ==John ff ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Csound-devel mailing list Csound-devel@lists.sourceforge.net