| Hi Michael,
Thanks very much for this report! Also thanks to Victor and John for
there work on ParCS. This is all very exciting to hear!
Regarding the OpenMP branch, does this also use no intervention by
user as the ParCS branch does? (i.e. no need for mutex opcodes I
wrote). If so, that's great!
Also, I had the thought the other day that I remember that while the
last time I spent any time with the original multiprocessor code,
while speedups were exhibited in rendering to disk, I seem to remember
getting more breakups during realtime rendering. It would be nice if
we had a test orchestra that would just keep calling event to add more
instances of a sine wave oscillator to see how many one can get up to
before breakups to retest.
Thanks again!
steven
On Wed, Aug 18, 2010 at 11:34 PM, Michael Gogins
wrote:
> CSD Branch -k Threads Time Speedup
>
> CloudStrata.csd head 96 -- 59.250
> CloudStrata.csd head 96 4 38.183 1.55
> CloudStrata.csd OpenMP 96 -- 57.376
> CloudStrata.csd OpenMP 96 4 30.641 1.87
> CloudStrata.csd ParCS 96 -- 57.296
> CloudStrata.csd ParCS 96 4 31.653 1.81
>
> xanadu.csd head 96 -- 7.730
> xanadu.csd head 96 4 12.246 0.63
> xanadu.csd OpenMP 96 -- 7.703
> xanadu.csd OpenMP 96 4 4.994 1.54
> xanadu.csd ParCS 96 -- 7.944
> xanadu.csd ParCS 96 4 4.175 1.90
>
> head is Steven Yi's original implementation of concurrency in Csound.
> It uses a pool of pthreads and barriers to synchronize multiple
> instances of instruments with the same insno.
>
> OpenMP is my re-implementation of Steven's code using OpenMP.
> According to online writings about OpenMP, itsninternal implementation
> of concurrency would in this case also use a thread of pools and
> barriers pretty much identical to Steven's code. The main difference
> is that my code invokes no threading operations if there is only one
> instance for an insno, thus cutting down on threading overhead.
>
> ParCS is John ffitch's branch implementing a more sophisticated
> version of concurrency, which tracks the cost of instrument execution
> to balance against threading overhead.
>
> Synchronization of shared data to prevent data races is done using
> spinlocks to protect various data objects. Currently this include out
> opcodes and some memory operations. The synchronization is for the
> same objects in all branches, but done using pthread spinlocks in the
> OpenMP branch and more primitive operations in the other branches.
>
> These results are instructive. It is beginning to make sense to use
> multithreading in Csound for certain types of orchestras if you have
> multiple cores. It should not be difficult to extend the spinlocks to
> protect all, or at least almost all, potential data races. I have
> other data, which I will post soon, indicating that in the OpenMP
> branch at least, performance scales more or less linearly with the
> number of cores. I will test this with the other branches and post the
> results.
>
> The most instructive implication of the data is that the OpenMP and
> ParCS data appear, at this very early stage of analysis, to be roughly
> equivalent in terms of speedup. That, in turn, may imply that the
> signal flow graph analysis and costing in the ParCS branch do not
> impose any significant overhead above and beyond the thread pooling
> and barriers that are common to all branches.
>
> It is clear from tests with higher values of -k that there is
> significant threading overhead, equivalent to hundreds of iterations
> of the inner performance loop for these orchestras. Unfortunately,
> this data implies that much of this overhead is more or less
> irreducible, since we cannot get rid of the barrier and sleep/awaken
> thread operations required to synchronize the "layers" of instruments
> with the same insno.
>
> I will merge the OpenMP branch back into the head branch to replace
> Steven Yi's original implementation, since the OpenMP implementation
> is both faster and simpler.
>
> I will now focus on looking at the ParCS code to see if it can be made
> faster in most cases than the OpenMP implementation. It's beginning to
> look like this can only be done by more rigorously avoiding all
> threading operations that cost more than they are worth. Possibly
> optimizing the signal flow graph and costing analysis would help, so
> that is what I will look at first.
>
> --
> Michael Gogins
> Irreducible Productions
> http://www.michael-gogins.com
> Michael dot Gogins at gmail dot com
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by
>
> Make an app they can't live without
> Enter the BlackBerry Developer Challenge
> http://p.sf.net/sfu/RIM-dev2dev
> _______________________________________________
> Csound-devel mailing list
> Csound-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/csound-devel
>
------------------------------------------------------------------------------
This SF.net email is sponsored by
Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net |