Intel details future Larrabee graphics chip

This is a discussion on Intel details future Larrabee graphics chip within the Arch forums in Other Technologies category; In article , Terje Mathisen writes: |> |> > Yes, it's important, but let's see if it is possible to program the |> > thing and get reliable results even with integers! And that is so |> > far unproven. Remember the Itanic? |> |> I'm very confident that the chip will actually work, and give useful, |> repeatable results, but I don't expect things like fast (or even any?) |> denormal handling except flush to zero. That's not what I mean. Yes, I agree that the chip should work according to specification. But, if it were so foul to program that only a few dozen people, ...

Go Back   Database Forum > Other Technologies > Arch

Database Forums

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #21  
Old 08-06-2008, 11:38 AM
Default Re: Intel details future Larrabee graphics chip


In article ,
Terje Mathisen writes:
|>
|> > Yes, it's important, but let's see if it is possible to program the
|> > thing and get reliable results even with integers! And that is so
|> > far unproven. Remember the Itanic?
|>
|> I'm very confident that the chip will actually work, and give useful,
|> repeatable results, but I don't expect things like fast (or even any?)
|> denormal handling except flush to zero.

That's not what I mean. Yes, I agree that the chip should work
according to specification. But, if it were so foul to program
that only a few dozen people, worldwide, could write code for it
that was reliable, efficient AND useful, then what?

The record of almost all seriously parallel features so far is
that they are straightforward to use for simple, vectorisable codes
like the BLAS and similar operations, or embarassingly parallel
applications, and utterly evil for almost anything else.

There isn't a problem for embarrassingly parallel codes - or is
there? Well, yes. The killer is that almost all computationally
intensive codes are memory intensive, too, and the memory conflict
kills you. There ARE exceptions, yes - cryptography, some work in
number theory, QCD and (to some extent) CODECs being the main ones.


Regards,
Nick Maclaren.
Reply With Quote
  #22  
Old 08-06-2008, 11:28 PM
Default Re: Intel details future Larrabee graphics chip

On Aug 5, 5:26*am, Dirk Bruere at NeoPax
wrote:
> Skybuck Flying wrote:
> > As the number of cores goes up the watt requirements goes up too ?

>
> > Will we need a zillion watts of power soon ?

>
> > Bye,
> > * Skybuck.

>
> Since the ATI Radeon™ HD 4800 series has 800 cores you work it out.
>
> --
> Dirk



Each of the 800 "cores", which are simple stream processors, in
ATI RV770
(Radeon 4800 series) are not comparable to the 16, 24, 32 or 48
cores that will be in Larrabee. Just like they're not comparable to
the 240 "cores" in Nvidia GeForce GTX 280. Though I'm not saying
you didn't realize that, just for those that might not have.
Reply With Quote
  #23  
Old 08-06-2008, 11:57 PM
Default Re: Intel details future Larrabee graphics chip

On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
wrote:

>"John Larkin" wrote in message
>news:rtrg9458spr43ss941mq9p040b2lp6hbgg-at-4ax.com...
>> On Tue, 5 Aug 2008 13:30:52 +0200, "Skybuck Flying"
>> wrote:
>>
>>>As the number of cores goes up the watt requirements goes up too ?

>>
>> Not necessarily, if the technology progresses and the clock rates are
>> kept reasonable. And one can always throttle down the CPUs that aren't
>> busy.
>>
>>>
>>>Will we need a zillion watts of power soon ?
>>>
>>>Bye,
>>> Skybuck.
>>>

>>
>> I saw suggestions of something like 60 cores, 240 threads in the
>> reasonable future.

>
>I can see it now... A mega-core GPU chip that can dedicate 1 core per-pixel.
>
>lol.
>
>
>
>
>> This has got to affect OS design.

>
>They need to completely rethink their multi-threaded synchronization
>algorihtms. I have a feeling that efficient distributed non-blocking
>algorihtms, which are comfortable running under a very weak cache coherency
>model will be all the rage. Getting rid of atomic RMW or StoreLoad style
>memory barriers is the first step.


Run one process per CPU. Run the OS kernal, and nothing else, on one
CPU. Never context switch. Never swap. Never crash.

John

Reply With Quote
  #24  
Old 08-07-2008, 04:47 AM
Default Re: Intel details future Larrabee graphics chip


In article ,
John Larkin writes:
|> On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
|> wrote:
|> >"John Larkin" wrote in message
|> >news:rtrg9458spr43ss941mq9p040b2lp6hbgg-at-4ax.com...
|> >
|> >> This has got to affect OS design.
|> >
|> >They need to completely rethink their multi-threaded synchronization
|> >algorihtms. I have a feeling that efficient distributed non-blocking
|> >algorihtms, which are comfortable running under a very weak cache coherency
|> >model will be all the rage. Getting rid of atomic RMW or StoreLoad style
|> >memory barriers is the first step.
|>
|> Run one process per CPU. Run the OS kernal, and nothing else, on one
|> CPU. Never context switch. Never swap. Never crash.

Been there - done that :-)

That is precisely how the early SMP systems worked, and it works
for dinky little SMP systems of 4-8 cores. But the kernel becomes
the bottleneck for many workloads even on those, and it doesn't
scale to large numbers of cores. So you HAVE to multi-thread the
kernel.

SGI were (are?) the leaders, but all of HP, IBM and Sun have been
along the same path. Modern Linux is multi-threaded.


Regards,
Nick Maclaren.
Reply With Quote
  #25  
Old 08-07-2008, 05:00 AM
Default Re: Intel details future Larrabee graphics chip

Nick Maclaren wrote:
> In article ,
> Terje Mathisen writes:
> |>
> |> > Yes, it's important, but let's see if it is possible to program the
> |> > thing and get reliable results even with integers! And that is so
> |> > far unproven. Remember the Itanic?
> |>
> |> I'm very confident that the chip will actually work, and give useful,
> |> repeatable results, but I don't expect things like fast (or even any?)
> |> denormal handling except flush to zero.
>
> That's not what I mean. Yes, I agree that the chip should work
> according to specification. But, if it were so foul to program
> that only a few dozen people, worldwide, could write code for it
> that was reliable, efficient AND useful, then what?


I do believe asm programmers/thinkers will stay employed for the
foreseeable future, yes. :-)

More seriously, the gather/scatter hw seems like a very good match for
more advanced codes that use sparse matrix techniques, with indirect
addressing etc.:

The G/S unit should be able to take a group of 16 aligned pointers to
data items, then lookup all very quickly and return the set of actual
data to be worked on. If the data blocks have been allocated in
sequential memory, then the "load all items from a given cache line in a
single cycle" would make such access patterns quite efficient.

>
> The record of almost all seriously parallel features so far is
> that they are straightforward to use for simple, vectorisable codes
> like the BLAS and similar operations, or embarassingly parallel
> applications, and utterly evil for almost anything else.


You're (unfortunately) almost certainly right. :-(

I believe most problems _can_ be mapped onto LRB style architectures,
but not without significant work by good programmers, i.e. nothing at
all like the "just recompile with our magic compiler" that seems to be
the holy grail.

Re. total memory bandwidth:

I agree that LRB will be no good at all for codes that work best without
caches, i.e. where blocking is impossible. The big question is if this
is an absolute requirement of the underlying problem, or if there is
some other way to solve it, even at the cost ofdoing (much) more work?

This is of course the area where nearly everyone has been working for
the last 2-3 decades, as the memory wall have rushed closer and closer.
I.e. this problem must be solved no matter which architecture you work with!

Terje
--
-
"almost all programming can be viewed as an exercise in caching"
Reply With Quote
  #26  
Old 08-07-2008, 05:21 AM
Default Re: Intel details future Larrabee graphics chip


In article ,
Terje Mathisen writes:
|>
|> > That's not what I mean. Yes, I agree that the chip should work
|> > according to specification. But, if it were so foul to program
|> > that only a few dozen people, worldwide, could write code for it
|> > that was reliable, efficient AND useful, then what?
|>
|> I do believe asm programmers/thinkers will stay employed for the
|> foreseeable future, yes. :-)

You know that's not what I meant :-)

More seriously, the history of the past 30 years has been to reduce
the requirements for such people by dropping standards. Will that
deliver something that can be claimed to work, even by salesdroids?
The jury hasn't even retired yet!

Several parallel systems of the past failed because their users
couldn't handle them, not because they didn't work. Are we about to
see a change? I just don't know.

|> More seriously, the gather/scatter hw seems like a very good match for
|> more advanced codes that use sparse matrix techniques, with indirect
|> addressing etc.:

[ Other relevant points snipped ]

Actually, I disagree. I think that it's a gimmick. Few people are
interested in sparsity within a cache line (or even page). What is
needed is something too radical for Intel, which is to separate off
the MMU aspects of the ISA and allow much better designed control
of cache preloading. And that DOESN'T mean adding yet another hack,
but a step back and serious reconsideration.

Sun's scout thread approach is along the right lines, though I doubt
that it is a very good one.

For example, consider an architecture where there was a sparse 'touch'
instruction, with some kind of prioritisation. Combine that with an
LRU algorithm that used different rates for touched pages that had not
yet been accessed and ones that had. I can see how to generate code
for that which would have the potential of reducing latency considerably.


Regards,
Nick Maclaren.



Reply With Quote
  #27  
Old 08-07-2008, 07:44 AM
Default Re: Intel details future Larrabee graphics chip

Hello all,

"Wilco Dijkstra" wrote in message
news:jzVlk.63795$dz3.20374-at-newsfe20.ams2...
> "Chris M. Thomasson" wrote in message
> news:kXPlk.7164$QX3.5075-at-newsfe02.iad...
>> Are they saying that programming this chip will be easier than
>> programming a GPU because it honors the well established x86 arch?

>
> That's rubbish indeed. The cache coherency seems to be the only advantage
> as other GPU also support C. However the claimed x86 "compatibility"
> isn't. If
> you use C the ISA doesn't matter much, and if you write assembler then
> there is
> no compatibility as the new SIMD instructions don't exist on any current
> x86's
> and Larrabee doesn't appear to support SSE instructions either...
>
> It would have made far more sense to use a simpler and more streamlined
> ISA which would give a significant codesize, area and power saving, if not
> a performance boost. But Intel is always keen to push their inefficient
> ISA
> where it doesn't belong...


"Never ungerestimate the power of x86!" - Yogurt

- Forward compatability: DX9(IIRC?) used the assembly language for
NVDIA's GPU. I remember at Intel someone having to write a dynamic
translator. Of course, NVIDIA was in the same boat as everyone else for
their next generation (which wanted a whole new instruction set). x86 has
been a stable platform for over twenty years.

- It's not an advantage for the customer, it's an advantage for the
designers! Think verification - all your unit tests continue to work.
Tools infrastructure (simulators, etc.)

The overhead is somewhat painful for tiny cores. But there are a lot of
tricks you can play... And when you are a process generation ahead of
everyone else...

Ned

Not speaking for Intel


Reply With Quote
  #28  
Old 08-07-2008, 11:08 AM
Default Re: Intel details future Larrabee graphics chip

On 7 Aug 2008 07:47:13 GMT, nmm1-at-cus.cam.ac.uk (Nick Maclaren) wrote:

>
>In article ,
>John Larkin writes:
>|> On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
>|> wrote:
>|> >"John Larkin" wrote in message
>|> >news:rtrg9458spr43ss941mq9p040b2lp6hbgg-at-4ax.com...
>|> >
>|> >> This has got to affect OS design.
>|> >
>|> >They need to completely rethink their multi-threaded synchronization
>|> >algorihtms. I have a feeling that efficient distributed non-blocking
>|> >algorihtms, which are comfortable running under a very weak cache coherency
>|> >model will be all the rage. Getting rid of atomic RMW or StoreLoad style
>|> >memory barriers is the first step.
>|>
>|> Run one process per CPU. Run the OS kernal, and nothing else, on one
>|> CPU. Never context switch. Never swap. Never crash.
>
>Been there - done that :-)
>
>That is precisely how the early SMP systems worked, and it works
>for dinky little SMP systems of 4-8 cores. But the kernel becomes
>the bottleneck for many workloads even on those, and it doesn't
>scale to large numbers of cores. So you HAVE to multi-thread the
>kernel.


Why? All it has to do is grant run permissions and look at the big
picture. It certainly wouldn't do I/O or networking or file
management. If memory allocation becomes a burden, it can set up four
(or fourteen) memory-allocation cores and let them do the crunching.
Why multi-thread *anything* when hundreds or thousands of CPUs are
available?

Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

John


Reply With Quote
  #29  
Old 08-07-2008, 11:25 AM
Default Re: Intel details future Larrabee graphics chip


In article ,
John Larkin writes:
|>
|> >|> Run one process per CPU. Run the OS kernal, and nothing else, on one
|> >|> CPU. Never context switch. Never swap. Never crash.
|> >
|> >Been there - done that :-)
|> >
|> >That is precisely how the early SMP systems worked, and it works
|> >for dinky little SMP systems of 4-8 cores. But the kernel becomes
|> >the bottleneck for many workloads even on those, and it doesn't
|> >scale to large numbers of cores. So you HAVE to multi-thread the
|> >kernel.
|>
|> Why? All it has to do is grant run permissions and look at the big
|> picture. It certainly wouldn't do I/O or networking or file
|> management. If memory allocation becomes a burden, it can set up four
|> (or fourteen) memory-allocation cores and let them do the crunching.
|> Why multi-thread *anything* when hundreds or thousands of CPUs are
|> available?

I don't have time to describe 40 years of experience to you, and
it is better written up in books, anyway. Microkernels of the sort
you mention were trendy a decade or two back (look up Mach), but
introduced too many bottlenecks.

In theory, the kernel doesn't have to do I/O or networking, but
have you ever used a system where they were outside it? I have.

The reason that exporting them to multiple CPUs doesn't solve the
scalability problems is that the interaction rate goes up more
than linearly with the number of CPUs. And the same problem
applies to memory management, if you are going to allow shared
memory - or even virtual shared memory, as in PGAS languages.

And so it goes. TANSTAAFL.

|> Using multicore properly will require undoing about 60 years of
|> thinking, 60 years of believing that CPUs are expensive.

Now, THAT is true.


Regards,
Nick Maclaren.
Reply With Quote
  #30  
Old 08-07-2008, 11:42 AM
Default Re: Intel details future Larrabee graphics chip

"John Larkin" wrote in message
news:d10m94d7etb6sfcem3hmdl3hk8qnels3kg-at-4ax.com...
> On 7 Aug 2008 07:47:13 GMT, nmm1-at-cus.cam.ac.uk (Nick Maclaren) wrote:
>
>>
>>In article ,
>>John Larkin writes:
>>|> On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
>>|> wrote:
>>|> >"John Larkin" wrote in
>>message
>>|> >news:rtrg9458spr43ss941mq9p040b2lp6hbgg-at-4ax.com...
>>|> >
>>|> >> This has got to affect OS design.
>>|> >
>>|> >They need to completely rethink their multi-threaded synchronization
>>|> >algorihtms. I have a feeling that efficient distributed non-blocking
>>|> >algorihtms, which are comfortable running under a very weak cache
>>coherency
>>|> >model will be all the rage. Getting rid of atomic RMW or StoreLoad
>>style
>>|> >memory barriers is the first step.
>>|>
>>|> Run one process per CPU. Run the OS kernal, and nothing else, on one
>>|> CPU. Never context switch. Never swap. Never crash.
>>
>>Been there - done that :-)
>>
>>That is precisely how the early SMP systems worked, and it works
>>for dinky little SMP systems of 4-8 cores. But the kernel becomes
>>the bottleneck for many workloads even on those, and it doesn't
>>scale to large numbers of cores. So you HAVE to multi-thread the
>>kernel.

>
> Why? All it has to do is grant run permissions and look at the big
> picture. It certainly wouldn't do I/O or networking or file
> management. If memory allocation becomes a burden, it can set up four
> (or fourteen) memory-allocation cores and let them do the crunching.



FWIW, I have a memory allocation algorithm which can scale because its based
on per-thread/core/node heaps:

http://groups.google.com/group/comp....c40d42a04ee855

AFAICT, there is absolutely no need for memory-allocation cores. Each thread
can have a private heap such that local allocations do not need any
synchronization. Also, thread local deallocations of memory do not need any
sync. Local meaning that Thread A allocates memory M which is subsequently
freed by Thread A. When a threads memory pool is exhausted, it then tries to
allocate from the core local heap. If that fails, then it asks the system,
and perhaps virtual memory comes into play.


A scaleable high-level memory allocation algorithm for a super-computer
could look something like:
__________________________________________________ ___________
void* malloc(size_t sz) {
void* mem;

/* level 1 - thread local */
if ((! mem = Per_Thread_Try_Allocate(sz))) {

/* level 2 - core local */
if ((! mem = Per_Core_Try_Allocate(sz))) {

/* level 3 - physical chip local */
if ((! mem = Per_Chip_Try_Allocate(sz))) {

/* level 4 - node local */
if ((! mem = Per_Node_Try_Allocate(sz))) {

/* level 5 - system-wide */
if ((! mem = System_Try_Allocate(sz))) {

/* level 6 - failure */
Report_Allocation_Failure(sz);
return NULL;
}
}
}
}
}

return mem;
}
__________________________________________________ ___________



Level 1 does not need any atomic RMW OR membars at all.

Level 2 does not need membars, but needs atomic RMW.

Level 3 would need membars and atomic RMW.

Level 4 is same as level 3

Level 5 is worst case senerio, may need MPI...

Level 6 is total memory exhaustion! Ouch...



All local frees have same overhead while all remote frees need atomic RMW
and possibly membars.


This algorithm can scale to very large numbers of cores, chips and nodes.




> Using multicore properly will require undoing about 60 years of
> thinking, 60 years of believing that CPUs are expensive.


The bottleneck is the cache-coherency system. Luckily, there is years of
experience is dealing with weak cache schemes... Think RCU.




> Why multi-thread *anything* when hundreds or thousands of CPUs are
> available?


You don't think there is any need for communication between cores on a chip?

Reply With Quote
Reply


Thread Tools
Display Modes



All times are GMT -4. The time now is 05:17 AM.


Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Integrated by bbpixel2008 :: jvbPlugin R1013.368.1

Search Engine Friendly URLs by vBSEO 3.1.0
vB Ad Management by =RedTyger=
In an effort to better serve ads to our visitors, cookies are used on Mydatabasesupport.com. For more information, check out our Privacy Policy.