Csound Csound-dev Csound-tekno Search About

[Cs-dev] Profiling multithreaded rendering in Csound 6

Date2012-07-04 16:33
FromMichael Gogins
Subject[Cs-dev] Profiling multithreaded rendering in Csound 6
I've done some preliminary profiling of Csound 6 rendering using
valgrind's callgrind tool.

Results indicate that multi-threaded rendering in Csound 6 is burdened
by building the DAG for assignment to threads every kperf call,
especially csp_dag_build_edges with 73% of inclusive time (!). And one
of the major consumers of time in these routines is simply memory
allocation. This is with the standard Trapped in Convert example.

With the current code, multithreaded rendering cannot be competitive
without very large ksmps (and probably long notes also). If the code
can be optimized, then multithreaded rendering in Csound 6 could lead
the field.

I welcome suggestions for optimizing the csp calls or even the entire
design. I may perform some experimental optimization of these calls
myself, or of the whole Csound memory management system.

Regards,
Mike



-- 
Michael Gogins
Irreducible Productions
http://www.michael-gogins.com
Michael dot Gogins at gmail dot com

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net

Date2012-07-05 17:13
FromMichael Gogins
SubjectRe: [Cs-dev] Profiling multithreaded rendering in Csound 6
More on this. In multithreaded performances, the calls to build the
multithreaded graph occur in each kperiod.

A great deal of time could be saved, if these calls occurred only as
required. I assume they are required whenever the graph specified by
the orchestra changes, which I believe occurs only when a new instance
of an instrument or UDO is allocated and activated as the result of an
i statement or dynamic schedule call, or when an existing instance of
an instrument or UDO is deactivated or reactivated.

Can someone confirm my understanding?

If that is the case, then a flag in the Csound instance can be set to
false at the end of every kperiod, then set to true whenever an
instrument or UDO instance is activated or deactivated. Then the
multithread graph calls will do any work only if this flag is true.

I believe this would lead to a big speedup.

Comments are appreciated. I have a feeling I'm missing something here.

Regards,
Mike

On Wed, Jul 4, 2012 at 11:33 AM, Michael Gogins
 wrote:
> I've done some preliminary profiling of Csound 6 rendering using
> valgrind's callgrind tool.
>
> Results indicate that multi-threaded rendering in Csound 6 is burdened
> by building the DAG for assignment to threads every kperf call,
> especially csp_dag_build_edges with 73% of inclusive time (!). And one
> of the major consumers of time in these routines is simply memory
> allocation. This is with the standard Trapped in Convert example.
>
> With the current code, multithreaded rendering cannot be competitive
> without very large ksmps (and probably long notes also). If the code
> can be optimized, then multithreaded rendering in Csound 6 could lead
> the field.
>
> I welcome suggestions for optimizing the csp calls or even the entire
> design. I may perform some experimental optimization of these calls
> myself, or of the whole Csound memory management system.
>
> Regards,
> Mike
>
>
>
> --
> Michael Gogins
> Irreducible Productions
> http://www.michael-gogins.com
> Michael dot Gogins at gmail dot com



-- 
Michael Gogins
Irreducible Productions
http://www.michael-gogins.com
Michael dot Gogins at gmail dot com

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net

Date2012-07-06 09:05
FromSteven Yi
SubjectRe: [Cs-dev] Profiling multithreaded rendering in Csound 6
I haven't looked at the multithreaded code in depth myself, but I
thought that it was doing an analysis on every k-pass.  I'm curious
about having open-ended analysis strategies, so that we could do some
research work to figure out if analysis-on-activation, every n-kpass,
etc. might lead to an optimal solution, or maybe lead to a dynamic
solution that optimizes to a specific strategy per-instrument.

One thing that would be nice is to start building up some tools to
help with the multithreaded work. I am thinking in particular that we
already have the possibility to print out the TREE* structure after
parsing as well as after optimization, just before compiling.  I could
having some visualization tools to show the tree and its
transformations would be nice, and might help us to see things a bit
better.  Also, extracting the tree transformation code for multicore
out from the bison parser and into a separate transformation pass
would probably help to clear up the codebase and make it easier for
all of us to get more involved with the multithreaded code.

On Thu, Jul 5, 2012 at 6:13 PM, Michael Gogins  wrote:
> More on this. In multithreaded performances, the calls to build the
> multithreaded graph occur in each kperiod.
>
> A great deal of time could be saved, if these calls occurred only as
> required. I assume they are required whenever the graph specified by
> the orchestra changes, which I believe occurs only when a new instance
> of an instrument or UDO is allocated and activated as the result of an
> i statement or dynamic schedule call, or when an existing instance of
> an instrument or UDO is deactivated or reactivated.
>
> Can someone confirm my understanding?
>
> If that is the case, then a flag in the Csound instance can be set to
> false at the end of every kperiod, then set to true whenever an
> instrument or UDO instance is activated or deactivated. Then the
> multithread graph calls will do any work only if this flag is true.
>
> I believe this would lead to a big speedup.
>
> Comments are appreciated. I have a feeling I'm missing something here.
>
> Regards,
> Mike
>
> On Wed, Jul 4, 2012 at 11:33 AM, Michael Gogins
>  wrote:
>> I've done some preliminary profiling of Csound 6 rendering using
>> valgrind's callgrind tool.
>>
>> Results indicate that multi-threaded rendering in Csound 6 is burdened
>> by building the DAG for assignment to threads every kperf call,
>> especially csp_dag_build_edges with 73% of inclusive time (!). And one
>> of the major consumers of time in these routines is simply memory
>> allocation. This is with the standard Trapped in Convert example.
>>
>> With the current code, multithreaded rendering cannot be competitive
>> without very large ksmps (and probably long notes also). If the code
>> can be optimized, then multithreaded rendering in Csound 6 could lead
>> the field.
>>
>> I welcome suggestions for optimizing the csp calls or even the entire
>> design. I may perform some experimental optimization of these calls
>> myself, or of the whole Csound memory management system.
>>
>> Regards,
>> Mike
>>
>>
>>
>> --
>> Michael Gogins
>> Irreducible Productions
>> http://www.michael-gogins.com
>> Michael dot Gogins at gmail dot com
>
>
>
> --
> Michael Gogins
> Irreducible Productions
> http://www.michael-gogins.com
> Michael dot Gogins at gmail dot com
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Csound-devel mailing list
> Csound-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/csound-devel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net

Date2012-07-07 21:24
FromMichael Gogins
SubjectRe: [Cs-dev] Profiling multithreaded rendering in Csound 6
Well, the question is whether the actual instrument and opcode lists
that created from the abstract syntax tree change for any reason
except inserts, activations, or deactivations. If that is the case,
then the DAG manipulation code can be fired only when an insert,
activation, or deactivation occurs, and not on every kperiod.

Regards,
Mike

On Fri, Jul 6, 2012 at 4:05 AM, Steven Yi  wrote:
> I haven't looked at the multithreaded code in depth myself, but I
> thought that it was doing an analysis on every k-pass.  I'm curious
> about having open-ended analysis strategies, so that we could do some
> research work to figure out if analysis-on-activation, every n-kpass,
> etc. might lead to an optimal solution, or maybe lead to a dynamic
> solution that optimizes to a specific strategy per-instrument.
>
> One thing that would be nice is to start building up some tools to
> help with the multithreaded work. I am thinking in particular that we
> already have the possibility to print out the TREE* structure after
> parsing as well as after optimization, just before compiling.  I could
> having some visualization tools to show the tree and its
> transformations would be nice, and might help us to see things a bit
> better.  Also, extracting the tree transformation code for multicore
> out from the bison parser and into a separate transformation pass
> would probably help to clear up the codebase and make it easier for
> all of us to get more involved with the multithreaded code.
>
> On Thu, Jul 5, 2012 at 6:13 PM, Michael Gogins  wrote:
>> More on this. In multithreaded performances, the calls to build the
>> multithreaded graph occur in each kperiod.
>>
>> A great deal of time could be saved, if these calls occurred only as
>> required. I assume they are required whenever the graph specified by
>> the orchestra changes, which I believe occurs only when a new instance
>> of an instrument or UDO is allocated and activated as the result of an
>> i statement or dynamic schedule call, or when an existing instance of
>> an instrument or UDO is deactivated or reactivated.
>>
>> Can someone confirm my understanding?
>>
>> If that is the case, then a flag in the Csound instance can be set to
>> false at the end of every kperiod, then set to true whenever an
>> instrument or UDO instance is activated or deactivated. Then the
>> multithread graph calls will do any work only if this flag is true.
>>
>> I believe this would lead to a big speedup.
>>
>> Comments are appreciated. I have a feeling I'm missing something here.
>>
>> Regards,
>> Mike
>>
>> On Wed, Jul 4, 2012 at 11:33 AM, Michael Gogins
>>  wrote:
>>> I've done some preliminary profiling of Csound 6 rendering using
>>> valgrind's callgrind tool.
>>>
>>> Results indicate that multi-threaded rendering in Csound 6 is burdened
>>> by building the DAG for assignment to threads every kperf call,
>>> especially csp_dag_build_edges with 73% of inclusive time (!). And one
>>> of the major consumers of time in these routines is simply memory
>>> allocation. This is with the standard Trapped in Convert example.
>>>
>>> With the current code, multithreaded rendering cannot be competitive
>>> without very large ksmps (and probably long notes also). If the code
>>> can be optimized, then multithreaded rendering in Csound 6 could lead
>>> the field.
>>>
>>> I welcome suggestions for optimizing the csp calls or even the entire
>>> design. I may perform some experimental optimization of these calls
>>> myself, or of the whole Csound memory management system.
>>>
>>> Regards,
>>> Mike
>>>
>>>
>>>
>>> --
>>> Michael Gogins
>>> Irreducible Productions
>>> http://www.michael-gogins.com
>>> Michael dot Gogins at gmail dot com
>>
>>
>>
>> --
>> Michael Gogins
>> Irreducible Productions
>> http://www.michael-gogins.com
>> Michael dot Gogins at gmail dot com
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Csound-devel mailing list
>> Csound-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/csound-devel
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Csound-devel mailing list
> Csound-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/csound-devel



-- 
Michael Gogins
Irreducible Productions
http://www.michael-gogins.com
Michael dot Gogins at gmail dot com

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Csound-devel mailing list
Csound-devel@lists.sourceforge.net