Csound Csound-dev Csound-tekno Search About

[Cs-dev] ParCS update

Date2010-07-14 16:35
Fromjohn ffitch
Subject[Cs-dev] ParCS update
I have been struggling to check it, and it seems that if i add serious
compute to an instrument (I have written a "waste" opcode) then I get
serious speedup.  This suggests that the code is correct (or mainly
so) but the calculation for grain size is all wrong; actually not
really written yet.

Will check it in when I have done a little more tidying of the code
==John ffitch

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net

Date2010-07-14 17:02
FromMichael Gogins
SubjectRe: [Cs-dev] ParCS update
How big is the "waste" instrument?

The granularity is affected both by pure instruction overhead and by
other overheads involved in synchronizing and scheduling (threading
overhead, if you will). With pure instruction overhead I would guess
that a few thousand machine instructions will justify a thread context
switch. I would also guess that's a bit bigger than a small instrument
instance, i.e., the instruments need to be "smallish or larger" not
"small". I'm confident of this because Steven Yi's code definitely got
significant speedups with mid size instruments.

Threading overhead might well throw that out the window. I would think
that the design of synchronizing opcodes that access global data would
introduce additional threading overhead and require many more machine
instructions to balance out. If so then, for a given layer of
concurrency (e.g. a set of instances of the same instrument template
without any potential data races), running some of those instances in
series should balance it out. If there are too many potential data
races, i.e. too much synchronization in the opcode graphs, then it may
never really work out.

Using spinlocks for opcode synchronization would almost certainly
help, does ParCS do that? I don't recall...

Does this make any sense?

Regards
Mike

On Wed, Jul 14, 2010 at 11:35 AM, john ffitch  wrote:
> I have been struggling to check it, and it seems that if i add serious
> compute to an instrument (I have written a "waste" opcode) then I get
> serious speedup.  This suggests that the code is correct (or mainly
> so) but the calculation for grain size is all wrong; actually not
> really written yet.
>
> Will check it in when I have done a little more tidying of the code
> ==John ffitch
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Sprint
> What will you do first with EVO, the first 4G phone?
> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
> _______________________________________________
> Csound-devel mailing list
> Csound-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/csound-devel
>



-- 
Michael Gogins
Irreducible Productions
http://www.michael-gogins.com
Michael dot Gogins at gmail dot com

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net

Date2010-07-15 10:26
Fromjohn ffitch
SubjectRe: [Cs-dev] ParCS update
>>>>> "Michael" == Michael Gogins  writes:

 Michael> How big is the "waste" instrument?

At a loop of 30,000 times it did not do much but at 100,000,000 it
made a simple oscillator take a very long time.

This is for a very short sound
                   Elapse       CPU
1 processor        13.837       13.83
2 processors        6.915       13.82
3 processors        6.92        13.85
4 processors        3.618       13.85

There were 4 instruments playing, two oscillators and two plucks, and
no common variables except spout, which is protected by a spinlock

 Michael> The granularity is affected both by pure instruction overhead and by
 Michael> other overheads involved in synchronizing and scheduling (threading
 Michael> overhead, if you will). With pure instruction overhead I would guess
 Michael> that a few thousand machine instructions will justify a thread context
 Michael> switch. I would also guess that's a bit bigger than a small instrument
 Michael> instance, i.e., the instruments need to be "smallish or larger" not
 Michael> "small". I'm confident of this because Steven Yi's code definitely got
 Michael> significant speedups with mid size instruments.

There is code to could the "cost" of an instrument and balance one
instr or a sequence against others

 Michael> Threading overhead might well throw that out the window. I would think
 Michael> that the design of synchronizing opcodes that access global data would
 Michael> introduce additional threading overhead and require many more machine
 Michael> instructions to balance out. If so then, for a given layer of
 Michael> concurrency (e.g. a set of instances of the same instrument template
 Michael> without any potential data races), running some of those instances in
 Michael> series should balance it out. If there are too many potential data
 Michael> races, i.e. too much synchronization in the opcode graphs, then it may
 Michael> never really work out.

That is what needs to be viewed.  At present I think some of the DAG
dispatch is over complex and needs a close review

 Michael> Using spinlocks for opcode synchronization would almost certainly
 Michael> help, does ParCS do that? I don't recall...

Most is spinlock and two barriers.  There are a few mutex sections and
I am not sure why -- one is for one read of a variable which is a
classic spinlock case

 Michael> Does this make any sense?

Sure.  I need to finish my opcode cost counter to augment the parser,
and look critically at the dag_comsume and dag_update code

But clearing some of the bug list also needs to be done.

 Michael> Regards
 Michael> Mike

==John ffitch

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net

Date2010-07-17 16:42
FromMichael Gogins
SubjectRe: [Cs-dev] ParCS update
Thanks for your informative response. I have re-read parts of Wilson's
dissertation, and I have a few more questions and comments. Please
give them some thought.

Also, please let me know if there is any part of this work that you
would like me to take on, either in the ParCS code itself, or in
fixing bugs to free up time for yourself.

Once again, I think the dissertation is an impressive piece of work
and covers many cases not obvious when beginning to think about the
problem.

Question
--------

The dissertation documents significant speedups in pieces of interest
to me (heavy, mostly parallel instruments). Are we sure that the ParCS
code reflects all the speedups discussed in the dissertation? Several
alternative designs are discussed. Which one does the ParCS code
actually implement, or does the code follow a different design
altogether?

Comment
-------

Off the top of my head, I don't see any place to prefer mutexes to
spinlocks except while building the DAG.

Comment
-------

It is critical to find out if the ParCS code can be made fast enough.
This requires optimizing and debugging the code. If it is still not
fast enough, we need to think if another design might work, or if it
just not going to work well enough.

Comment
-------

What is "fast enough?" In my view it is fast enough if (a) complex
pieces run significantly faster with 2 or more cores, and (b)
performance shows some kind of scaling up with increasing numbers of
cores. I don't think it's fast enough if adding cores does no good or
even slows things down again.

Comment
-------

What would be an alternative design? It seems clear that more complex
designs would incur even more threading overhead, so we should think
about simpler designs. But they would have to throw out features of
ParCS.

The simplest design I can think of would start with Steven Yi's
approach. Currently it ignores global data and simply runs multiple
instances of a given template in parallel with a barrier at the end.
There is a spinlock or something to protect the out opcode buffers.
Period. To this could be added atomic operations or spinlocks
(whichever is most efficient) on all opcode buffers. I think these
would incur little overhead (but a lot of detail work to implement).
They would however throw out the ParCS feature of protecting the
operation of first writing to a global variable, then reading it
without interference from other threads.

In other words, the alternative would trade the DAG analysis followed
by protecting only global data at DAG edges, for no DAG analysis plus
protecting all global data, with the theory that instrument order
would guarantee few contests on global data operations and uncontested
operations would have very low cost.

The simpler design would throw out the DAG analysis and just use the
instrument number order to determine concurrent layers. This would
throw out some of the ParCS semantic analysis that is aimed at finding
potential speedups. But I'm not sure these potential speedups can be
made actual, because the DAG code introduces its own threading
overhead.

Question
--------

So, is the DAG analysis actually required to compute the correct
sound, or is it only aimed at finding more speedups in comparison with
just relying on instrument number?

Regards,
Mike

On Thu, Jul 15, 2010 at 5:26 AM, john ffitch  wrote:
>>>>>> "Michael" == Michael Gogins  writes:
>
>  Michael> How big is the "waste" instrument?
>
> At a loop of 30,000 times it did not do much but at 100,000,000 it
> made a simple oscillator take a very long time.
>
> This is for a very short sound
>                   Elapse       CPU
> 1 processor        13.837       13.83
> 2 processors        6.915       13.82
> 3 processors        6.92        13.85
> 4 processors        3.618       13.85
>
> There were 4 instruments playing, two oscillators and two plucks, and
> no common variables except spout, which is protected by a spinlock
>
>  Michael> The granularity is affected both by pure instruction overhead and by
>  Michael> other overheads involved in synchronizing and scheduling (threading
>  Michael> overhead, if you will). With pure instruction overhead I would guess
>  Michael> that a few thousand machine instructions will justify a thread context
>  Michael> switch. I would also guess that's a bit bigger than a small instrument
>  Michael> instance, i.e., the instruments need to be "smallish or larger" not
>  Michael> "small". I'm confident of this because Steven Yi's code definitely got
>  Michael> significant speedups with mid size instruments.
>
> There is code to could the "cost" of an instrument and balance one
> instr or a sequence against others
>
>  Michael> Threading overhead might well throw that out the window. I would think
>  Michael> that the design of synchronizing opcodes that access global data would
>  Michael> introduce additional threading overhead and require many more machine
>  Michael> instructions to balance out. If so then, for a given layer of
>  Michael> concurrency (e.g. a set of instances of the same instrument template
>  Michael> without any potential data races), running some of those instances in
>  Michael> series should balance it out. If there are too many potential data
>  Michael> races, i.e. too much synchronization in the opcode graphs, then it may
>  Michael> never really work out.
>
> That is what needs to be viewed.  At present I think some of the DAG
> dispatch is over complex and needs a close review
>
>  Michael> Using spinlocks for opcode synchronization would almost certainly
>  Michael> help, does ParCS do that? I don't recall...
>
> Most is spinlock and two barriers.  There are a few mutex sections and
> I am not sure why -- one is for one read of a variable which is a
> classic spinlock case
>
>  Michael> Does this make any sense?
>
> Sure.  I need to finish my opcode cost counter to augment the parser,
> and look critically at the dag_comsume and dag_update code
>
> But clearing some of the bug list also needs to be done.
>
>  Michael> Regards
>  Michael> Mike
>
> ==John ffitch
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Sprint
> What will you do first with EVO, the first 4G phone?
> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
> _______________________________________________
> Csound-devel mailing list
> Csound-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/csound-devel
>



-- 
Michael Gogins
Irreducible Productions
http://www.michael-gogins.com
Michael dot Gogins at gmail dot com

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net