Csound Csound-dev Csound-tekno Search About

[Csnd] OT: Parallella, a Supercomputer for Csound ?

Date2013-03-18 09:36
FromAnthony Kozar
Subject[Csnd] OT: Parallella, a Supercomputer for Csound ?
Hello all,

Someone sent me this link yesterday for a project that aims to create an
open 16-core parallel computing platform for $99 and eventually a 64-core
version too:



The funding campaign appears to be over, and if you care to weed thru the
layer of marketing hype, I think this _could_ be an interesting platform for
a parallel version of Csound.  The company, Adapteva, hopes that their
Epiphany chips will end up as accelerators in smartphones and tablets too.
If they became common in these devices, then a mobile Csound might run at
desktop-like speeds in the near future. ^_^

What do the rest of you think?

Anthony

P.S.  I was pleased to see that an alpha of Csound 6 (with better
parallelism, I understand) is already available.  Fantastic work!!

P.P.S.  Here are some more thoughts & analysis that I did on the Parallella
computer for those who are interested (1):

I am a little skeptical that the 16-core version of the chip will provide
better performance than a typical desktop CPU for applications like Csound.
The 64-core version would be more interesting of course.

First, the Epiphany cores only support 32-bit, single-precision floating
point calculations.  (64-bit doubles require software emulation). (2)
Operations take 1 cycle, so a 16-core chip can theoretically perform 16 FP
ops per cycle (3), or 16 * clockspeed FLOPS.  A typical 4-core CPU today can
perform at least 4 FP ops per cycle without vectorization or 16 per cycle
using SIMD instructions.  So with maximum optimization, a naïve analysis
would expect roughly the same performance at the same clock.  I've seen
several different clock speeds listed for the Epiphany chips, so I'm not
sure whether they will be 700 MHz, 800 MHz, or 1 GHz in the end, but
obviously much lower than a desktop CPU.  The 64-core chips will give nearly
a 4x improvement but their clock speeds appear lower for now and even an 800
MHz chip may only give roughly the same performance as a 4-core, 3.2 GHz,
desktop CPU using SIMD. (4)

A bigger problem for Csound and other applications might be the apparent
lack of any division hardware on the Epiphany chips (I could not find either
FP or integer division operations in the instruction set!).  So, I am
guessing division will also have to be performed in software. :(

Other surprises for me were that the cores are nearly-"pure" 32-bit
processors: they have 32-bit registers and a 32-bit address space but do
support 64-bit data loads/stores (presumably with a 64-bit data bus).  So
there is a maximum memory size of 4 GB.  The cores use a shared memory model
and each has (only) 32K of local memory, directly addressable by the other
cores.  That sounds small to me for efficiently parallelizing Csound at a
large-grained task size but the on-chip network for data transfer is
supposedly quite efficient (compared to what?).

Notes:

(1)  Note that I am not an expert in computer architecture, so there may be
errors and misunderstandings in my analysis.
 
(2)  Not a huge problem for Csound but there would be a few issues.

(3)  Adapteva's website says 32 FLOP/cycle.  I think they must be counting
the fused multiply-add instruction as 2 ops because the FPU can only issue
and complete one instruction per cycle.  This may be the normal way to
measure but then the same logic would apply to desktop CPUs too.  In fact,
desktop CPUs may have a further advantage if they have multiple FPUs per
core.  I am not familiar enough with multi-core architectures to know how
common redundant execution units (particularly FPUs) within each core are.

(4)  Of course, my comparison assumes that the desktop CPU software will be
parallelized for both multiple cores and SIMD vector operations.  That could
be more work than making a parallel version for the Parallella depending on
how much extra programming is required to manage tasks on the Epiphany
cores.  However, the 64-core chip is expected to only consume 5 Watts of
power, so if it gives comparable performance to a 4-core, 3.x GHz desktop
CPU in a potentially mobile device (and at a lower price point), that would
be interesting. :)



Date2013-03-19 00:34
FromAndres Cabrera
SubjectRe: [Csnd] OT: Parallella, a Supercomputer for Csound ?
Interesting! It does sound like a great match to multicore Csound. It
would be interesting to see how multicore scales across that larger
number of CPUs...

Cheers,
Andrés

On Mon, Mar 18, 2013 at 2:36 AM, Anthony Kozar
 wrote:
> Hello all,
>
> Someone sent me this link yesterday for a project that aims to create an
> open 16-core parallel computing platform for $99 and eventually a 64-core
> version too:
>
>  -everyone>
>
> The funding campaign appears to be over, and if you care to weed thru the
> layer of marketing hype, I think this _could_ be an interesting platform for
> a parallel version of Csound.  The company, Adapteva, hopes that their
> Epiphany chips will end up as accelerators in smartphones and tablets too.
> If they became common in these devices, then a mobile Csound might run at
> desktop-like speeds in the near future. ^_^
>
> What do the rest of you think?
>
> Anthony
>
> P.S.  I was pleased to see that an alpha of Csound 6 (with better
> parallelism, I understand) is already available.  Fantastic work!!
>
> P.P.S.  Here are some more thoughts & analysis that I did on the Parallella
> computer for those who are interested (1):
>
> I am a little skeptical that the 16-core version of the chip will provide
> better performance than a typical desktop CPU for applications like Csound.
> The 64-core version would be more interesting of course.
>
> First, the Epiphany cores only support 32-bit, single-precision floating
> point calculations.  (64-bit doubles require software emulation). (2)
> Operations take 1 cycle, so a 16-core chip can theoretically perform 16 FP
> ops per cycle (3), or 16 * clockspeed FLOPS.  A typical 4-core CPU today can
> perform at least 4 FP ops per cycle without vectorization or 16 per cycle
> using SIMD instructions.  So with maximum optimization, a naïve analysis
> would expect roughly the same performance at the same clock.  I've seen
> several different clock speeds listed for the Epiphany chips, so I'm not
> sure whether they will be 700 MHz, 800 MHz, or 1 GHz in the end, but
> obviously much lower than a desktop CPU.  The 64-core chips will give nearly
> a 4x improvement but their clock speeds appear lower for now and even an 800
> MHz chip may only give roughly the same performance as a 4-core, 3.2 GHz,
> desktop CPU using SIMD. (4)
>
> A bigger problem for Csound and other applications might be the apparent
> lack of any division hardware on the Epiphany chips (I could not find either
> FP or integer division operations in the instruction set!).  So, I am
> guessing division will also have to be performed in software. :(
>
> Other surprises for me were that the cores are nearly-"pure" 32-bit
> processors: they have 32-bit registers and a 32-bit address space but do
> support 64-bit data loads/stores (presumably with a 64-bit data bus).  So
> there is a maximum memory size of 4 GB.  The cores use a shared memory model
> and each has (only) 32K of local memory, directly addressable by the other
> cores.  That sounds small to me for efficiently parallelizing Csound at a
> large-grained task size but the on-chip network for data transfer is
> supposedly quite efficient (compared to what?).
>
> Notes:
>
> (1)  Note that I am not an expert in computer architecture, so there may be
> errors and misunderstandings in my analysis.
>
> (2)  Not a huge problem for Csound but there would be a few issues.
>
> (3)  Adapteva's website says 32 FLOP/cycle.  I think they must be counting
> the fused multiply-add instruction as 2 ops because the FPU can only issue
> and complete one instruction per cycle.  This may be the normal way to
> measure but then the same logic would apply to desktop CPUs too.  In fact,
> desktop CPUs may have a further advantage if they have multiple FPUs per
> core.  I am not familiar enough with multi-core architectures to know how
> common redundant execution units (particularly FPUs) within each core are.
>
> (4)  Of course, my comparison assumes that the desktop CPU software will be
> parallelized for both multiple cores and SIMD vector operations.  That could
> be more work than making a parallel version for the Parallella depending on
> how much extra programming is required to manage tasks on the Epiphany
> cores.  However, the 64-core chip is expected to only consume 5 Watts of
> power, so if it gives comparable performance to a 4-core, 3.x GHz desktop
> CPU in a potentially mobile device (and at a lower price point), that would
> be interesting. :)
>
>
>
> Send bugs reports to the Sourceforge bug tracker
>             https://sourceforge.net/tracker/?group_id=81968&atid=564599
> Discussions of bugs and features can be posted here
> To unsubscribe, send email sympa@lists.bath.ac.uk with body "unsubscribe csound"
>