Csound Csound-dev Csound-tekno Search About

[Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound

Date2010-08-19 04:34
FromMichael Gogins
Subject[Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound
CSD                   Branch     -k  Threads     Time  Speedup

CloudStrata.csd  head        96           --   59.250
CloudStrata.csd  head        96            4  38.183        1.55
CloudStrata.csd  OpenMP  96           --   57.376
CloudStrata.csd  OpenMP  96            4  30.641        1.87
CloudStrata.csd  ParCS     96           --   57.296
CloudStrata.csd  ParCS     96            4  31.653        1.81

xanadu.csd        head        96           --    7.730
xanadu.csd        head        96            4  12.246        0.63
xanadu.csd       OpenMP   96           --    7.703
xanadu.csd       OpenMP   96            4   4.994         1.54
xanadu.csd       ParCS      96           --    7.944
xanadu.csd       ParCS      96            4   4.175         1.90

head is Steven Yi's original implementation of concurrency in Csound.
It uses a pool of pthreads and barriers to synchronize multiple
instances of instruments with the same insno.

OpenMP is my re-implementation of Steven's code using OpenMP.
According to online writings about OpenMP, itsninternal implementation
of concurrency would in this case also use a thread of pools and
barriers pretty much identical to Steven's code. The main difference
is that my code invokes no threading operations if there is only one
instance for an insno, thus cutting down on threading overhead.

ParCS is John ffitch's branch implementing a more sophisticated
version of concurrency, which tracks the cost of instrument execution
to balance against threading overhead.

Synchronization of shared data to prevent data races is done using
spinlocks to protect various data objects. Currently this include out
opcodes and some memory operations. The synchronization is for the
same objects in all branches, but done using pthread spinlocks in the
OpenMP branch and more primitive operations in the other branches.

These results are instructive. It is beginning to make sense to use
multithreading in Csound for certain types of orchestras if you have
multiple cores. It should not be difficult to extend the spinlocks to
protect all, or at least almost all, potential data races. I have
other data, which I will post soon, indicating that in the OpenMP
branch at least, performance scales more or less linearly with the
number of cores. I will test this with the other branches and post the
results.

The most instructive implication of the data is that the OpenMP and
ParCS data appear, at this very early stage of analysis, to be roughly
equivalent in terms of speedup. That, in turn, may imply that the
signal flow graph analysis and costing in the ParCS branch do not
impose any significant overhead above and beyond the thread pooling
and barriers that are common to all branches.

It is clear from tests with higher values of -k that there is
significant threading overhead, equivalent to hundreds of iterations
of the inner performance loop for these orchestras. Unfortunately,
this data implies that much of this overhead is more or less
irreducible, since we cannot get rid of the barrier and sleep/awaken
thread operations required to synchronize the "layers" of instruments
with the same insno.

I will merge the OpenMP branch back into the head branch to replace
Steven Yi's original implementation, since the OpenMP implementation
is both faster and simpler.

I will now focus on looking at the ParCS code to see if it can be made
faster in most cases than the OpenMP implementation. It's beginning to
look like this can only be done by more rigorously avoiding all
threading operations that cost more than they are worth. Possibly
optimizing the signal flow graph and costing analysis would help, so
that is what I will look at first.

--
Michael Gogins
Irreducible Productions
http://www.michael-gogins.com
Michael dot Gogins at gmail dot com

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net

Date2010-08-19 04:53
FromSteven Yi
SubjectRe: [Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound
Hi Michael,

Thanks very much for this report! Also thanks to Victor and John for
there work on ParCS.  This is all very exciting to hear!

Regarding the OpenMP branch, does this also use no intervention by
user as the ParCS branch does?  (i.e. no need for mutex opcodes I
wrote).  If so, that's great!

Also, I had the thought the other day that I remember that while the
last time I spent any time with the original multiprocessor code,
while speedups were exhibited in rendering to disk, I seem to remember
getting more breakups during realtime rendering.  It would be nice if
we had a test orchestra that would just keep calling event to add more
instances of a sine wave oscillator to see how many one can get up to
before breakups to retest.

Thanks again!
steven



On Wed, Aug 18, 2010 at 11:34 PM, Michael Gogins
 wrote:
> CSD                   Branch     -k  Threads     Time  Speedup
>
> CloudStrata.csd  head        96           --   59.250
> CloudStrata.csd  head        96            4  38.183        1.55
> CloudStrata.csd  OpenMP  96           --   57.376
> CloudStrata.csd  OpenMP  96            4  30.641        1.87
> CloudStrata.csd  ParCS     96           --   57.296
> CloudStrata.csd  ParCS     96            4  31.653        1.81
>
> xanadu.csd        head        96           --    7.730
> xanadu.csd        head        96            4  12.246        0.63
> xanadu.csd       OpenMP   96           --    7.703
> xanadu.csd       OpenMP   96            4   4.994         1.54
> xanadu.csd       ParCS      96           --    7.944
> xanadu.csd       ParCS      96            4   4.175         1.90
>
> head is Steven Yi's original implementation of concurrency in Csound.
> It uses a pool of pthreads and barriers to synchronize multiple
> instances of instruments with the same insno.
>
> OpenMP is my re-implementation of Steven's code using OpenMP.
> According to online writings about OpenMP, itsninternal implementation
> of concurrency would in this case also use a thread of pools and
> barriers pretty much identical to Steven's code. The main difference
> is that my code invokes no threading operations if there is only one
> instance for an insno, thus cutting down on threading overhead.
>
> ParCS is John ffitch's branch implementing a more sophisticated
> version of concurrency, which tracks the cost of instrument execution
> to balance against threading overhead.
>
> Synchronization of shared data to prevent data races is done using
> spinlocks to protect various data objects. Currently this include out
> opcodes and some memory operations. The synchronization is for the
> same objects in all branches, but done using pthread spinlocks in the
> OpenMP branch and more primitive operations in the other branches.
>
> These results are instructive. It is beginning to make sense to use
> multithreading in Csound for certain types of orchestras if you have
> multiple cores. It should not be difficult to extend the spinlocks to
> protect all, or at least almost all, potential data races. I have
> other data, which I will post soon, indicating that in the OpenMP
> branch at least, performance scales more or less linearly with the
> number of cores. I will test this with the other branches and post the
> results.
>
> The most instructive implication of the data is that the OpenMP and
> ParCS data appear, at this very early stage of analysis, to be roughly
> equivalent in terms of speedup. That, in turn, may imply that the
> signal flow graph analysis and costing in the ParCS branch do not
> impose any significant overhead above and beyond the thread pooling
> and barriers that are common to all branches.
>
> It is clear from tests with higher values of -k that there is
> significant threading overhead, equivalent to hundreds of iterations
> of the inner performance loop for these orchestras. Unfortunately,
> this data implies that much of this overhead is more or less
> irreducible, since we cannot get rid of the barrier and sleep/awaken
> thread operations required to synchronize the "layers" of instruments
> with the same insno.
>
> I will merge the OpenMP branch back into the head branch to replace
> Steven Yi's original implementation, since the OpenMP implementation
> is both faster and simpler.
>
> I will now focus on looking at the ParCS code to see if it can be made
> faster in most cases than the OpenMP implementation. It's beginning to
> look like this can only be done by more rigorously avoiding all
> threading operations that cost more than they are worth. Possibly
> optimizing the signal flow graph and costing analysis would help, so
> that is what I will look at first.
>
> --
> Michael Gogins
> Irreducible Productions
> http://www.michael-gogins.com
> Michael dot Gogins at gmail dot com
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by
>
> Make an app they can't live without
> Enter the BlackBerry Developer Challenge
> http://p.sf.net/sfu/RIM-dev2dev
> _______________________________________________
> Csound-devel mailing list
> Csound-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/csound-devel
>

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net

Date2010-08-19 08:31
FromVictor Lazzarini
SubjectRe: [Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound
John's work completely. I'm just testing it.

Victor
On 19 Aug 2010, at 04:53, Steven Yi wrote:

> Thanks very much for this report! Also thanks to Victor and John for
> there work on ParCS.  This is all very exciting to hear!


------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net

Date2010-08-19 08:40
FromVictor Lazzarini
SubjectRe: [Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound
This is to me the most pressing problem, which can be a real show  
stopper. It's not realistic to use a low kr (or high ksmps) to get  
speed ups.

By the way, could you give us a quick explanation of how to use the  
Steven's/Your parallelism in Csound. As far as I understand, it  
involves modifying Csound code, but how? I want to try it.

Victor
On 19 Aug 2010, at 04:34, Michael Gogins wrote:

> It is clear from tests with higher values of -k that there is
> significant threading overhead, equivalent to hundreds of iterations
> of the inner performance loop for these orchestras. Unfortunately,
> this data implies that much of this overhead is more or less
> irreducible, since we cannot get rid of the barrier and sleep/awaken
> thread operations required to synchronize the "layers" of instruments
> with the same insno.


------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net

Date2010-08-19 11:57
Fromjpff@cs.bath.ac.uk
SubjectRe: [Cs-dev] Some preliminary numerical comparisons of different implementations of concurrency in Csound
Good to see some improvement swith multicore.

I am still stuck with trapped.csd; it never completes, with an error
message that I think means "I cannot find work but I know there is some"
It can happen at different times so I suspect a misisng mutex somewhere,
but so far have not found it.

I have also used macros so the locks can be switched betweemn spinliocks
and mutexes (the later being easier for the tools).

But overall encouraging.


==John ff


------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net