| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#11
|
| Wilco Dijkstra wrote: > "Chris M. Thomasson" >> "NV55" >>> Larrabee will be a stand-alone chip, meaning it will be very different >>> than the low-end--but widely used--integrated graphics that Intel now >>> offers as part of the silicon that accompanies its processors. And >>> Larrabee will be based on the universal Intel x86 architecture. >> [...] >> >> Are they saying that programming this chip will be easier than programming a GPU because it honors the well >> established x86 arch? > > That's rubbish indeed. The cache coherency seems to be the only advantage > as other GPU also support C. The real advantage has been lost in the Page Ranking: Larrabee doesn't just support C, it supports pthreads (and thus any other concurrency model that can be built on pthreads). MIMD + cache coherence + x86 is a significant advantage over CUDA (which I would describe as "C, but not as we know it"). I noticed recently that Cilk++, TBB, Fortress, and X10 are all using work-stealing rather than static partitioning. AFAIK MIMD is a prerequisite for work-stealing, so many of the future parallel programming languages may not be able to run on conventional GPUs at all. Wes Felter - wesley-at-felter.org |
|
#12
|
| In article <48987d7b$1@kcnews01>, Wes Felter |> |> The real advantage has been lost in the Page Ranking: Larrabee doesn't just |> support C, it supports pthreads (and thus any other concurrency model |> that can be built on pthreads). Unfortunately, the very concept of supporting C and pthreads is ill-formed. The standards are so grossly inconsistent that God alone knows what they mean. I know for a certainty that nobody who worked on them does. The reason that pthreads causes only as much problem as it does is that users don't use pthreads as such for high-communication applications, and so the incidence of failing race conditions and exposed inconsistencies is low. That applies EVEN to codes written solely for the x86! If users start using Larrabee or Niagara etc. for high-communication applications, and use pthreads, all that will change. |> I noticed recently that Cilk++, TBB, Fortress, and X10 are all using |> work-stealing rather than static partitioning. AFAIK MIMD is a |> prerequisite for work-stealing, so many of the future parallel |> programming languages may not be able to run on conventional GPUs |> at all. I notice your implication that those have a future - well, we can agree that they don't have a past :-) More seriously, I agree with you, whether it is those languages or others. SIMD has been proven to be a massively successful model, for a restricted set of problems. And attempts to extend it to a very much wider range of problems have failed, over a period of 30+ years. I teach that you should always look at SIMD first, and use it if at all possible, but don't be surprised if it isn't. Regards, Nick Maclaren. |
|
#13
|
| Nick Maclaren wrote: > In article <48987d7b$1@kcnews01>, Wes Felter > |> > |> The real advantage has been lost in the Page Ranking: Larrabee doesn't just > |> support C, it supports pthreads (and thus any other concurrency model > |> that can be built on pthreads). > > Unfortunately, the very concept of supporting C and pthreads is > ill-formed. The standards are so grossly inconsistent that God > alone knows what they mean. I know for a certainty that nobody > who worked on them does. According to the nice white paper Intel published, they've already extended pthreads: http://softwarecommunity.intel.com/U...e_manycore.pdf "We have extended the API to also allow developers to specify thread affinity with a particular HW thread or core." and then they go on to say: "Although P-threads is a powerful thread programming API, its thread creation and thread switching costs may be too high for some application threading. To amortize such costs, Larrabee Native provides a task scheduling API based on a light weight distributed task stealing scheduler [Blumofe et al. 1996]. A production implementation of such a task programming API can be found in Intel Thread Building Blocks" The key missing item, at least to me, was a specification of the double vs single precision performance. On the original Cell, double ran at 1/8 the speed of float, but it seems like more recent versions is fixing this, to the point where you get about 50% of the throughput. This is an important point for people (like me) who would like to have a TFlop or so available in single chip and then gang up a cluster of them to run serious simulation tasks. Terje -- - "almost all programming can be viewed as an exercise in caching" |
|
#14
|
| In article Terje Mathisen |> |> > Unfortunately, the very concept of supporting C and pthreads is |> > ill-formed. The standards are so grossly inconsistent that God |> > alone knows what they mean. I know for a certainty that nobody |> > who worked on them does. |> |> According to the nice white paper Intel published, they've already |> extended pthreads: |> |> http://softwarecommunity.intel.com/U...e_manycore.pdf |> |> "We have extended the API to also allow developers to specify thread |> affinity with a particular HW thread or core." Clearly useful, but it doesn't address my points. If they had defined a proper memory model, or sorted out the thread- safety mess, that would be much more useful. |> and then they go on to say: |> |> "Although P-threads is a powerful thread programming API, its |> thread creation and thread switching costs may be too high for |> some application threading. To amortize such costs, Larrabee |> Native provides a task scheduling API based on a light weight |> distributed task stealing scheduler [Blumofe et al. 1996]. A |> production implementation of such a task programming API can |> be found in Intel Thread Building Blocks" Well, the actual specification may say something more rational; as it stands, that is codswallop. Because there is so much state in C and a pthread, you can't quiesce one section of code and start another without doing it at the thread level. |> The key missing item, at least to me, was a specification of the double |> vs single precision performance. On the original Cell, double ran at 1/8 |> the speed of float, but it seems like more recent versions is fixing |> this, to the point where you get about 50% of the throughput. A key point compared with the chip being unprogrammable? Yes, it's important, but let's see if it is possible to program the thing and get reliable results even with integers! And that is so far unproven. Remember the Itanic? Regards, Nick Maclaren. |
|
#15
|
| On Tue, 05 Aug 2008 08:24:04 -0700, John Larkin >On Tue, 5 Aug 2008 13:30:52 +0200, "Skybuck Flying" > > >>As the number of cores goes up the watt requirements goes up too ? > >Not necessarily, if the technology progresses and the clock rates are >kept reasonable. And one can always throttle down the CPUs that aren't >busy. > >> >>Will we need a zillion watts of power soon ? >> >>Bye, >> Skybuck. >> > >I saw suggestions of something like 60 cores, 240 threads in the >reasonable future. > Oops, 4 threads per core is 320 threads. My XP is currently running 33 processes and maybe a couple dozen device drivers. John |
|
#16
|
| "Terje Mathisen" news:XZWdnTqNEJhJFgXVnZ2dnUVZ8sDinZ2d-at-giganews.com ... > Nick Maclaren wrote: >> In article <48987d7b$1@kcnews01>, Wes Felter >> |> |> The real advantage has been lost in the Page Ranking: Larrabee doesn't just >> |> support C, it supports pthreads (and thus any other concurrency model >> |> that can be built on pthreads). >> >> Unfortunately, the very concept of supporting C and pthreads is >> ill-formed. The standards are so grossly inconsistent that God >> alone knows what they mean. I know for a certainty that nobody >> who worked on them does. > > According to the nice white paper Intel published, they've already > extended pthreads: > > http://softwarecommunity.intel.com/U...e_manycore.pdf > > "We have extended the API to also allow developers to specify thread > affinity with a particular HW thread or core." > > and then they go on to say: > > "Although P-threads is a powerful thread programming API, its > thread creation and thread switching costs may be too high for > some application threading. To amortize such costs, Larrabee > Native provides a task scheduling API based on a light weight > distributed task stealing scheduler [Blumofe et al. 1996]. A > production implementation of such a task programming API can > be found in Intel Thread Building Blocks" FWIW, last time I checked, there was a very nasty race-condition in the TBB "scheduler": http://groups.google.com/group/comp....e96ade96038553 (read all...) Also, there is a much better work-stealing algorithm out there: http://research.sun.com/scalable/pub...rkstealing.pdf http://groups.google.com/group/comp....d297f61b369a41 However, knowing SUN, its probably has a patent application... > The key missing item, at least to me, was a specification of the double vs > single precision performance. On the original Cell, double ran at 1/8 the > speed of float, but it seems like more recent versions is fixing this, to > the point where you get about 50% of the throughput. > > This is an important point for people (like me) who would like to have a > TFlop or so available in single chip and then gang up a cluster of them to > run serious simulation tasks. |
|
#17
|
| "John Larkin" news:rtrg9458spr43ss941mq9p040b2lp6hbgg-at-4ax.com... > On Tue, 5 Aug 2008 13:30:52 +0200, "Skybuck Flying" > > >>As the number of cores goes up the watt requirements goes up too ? > > Not necessarily, if the technology progresses and the clock rates are > kept reasonable. And one can always throttle down the CPUs that aren't > busy. > >> >>Will we need a zillion watts of power soon ? >> >>Bye, >> Skybuck. >> > > I saw suggestions of something like 60 cores, 240 threads in the > reasonable future. I can see it now... A mega-core GPU chip that can dedicate 1 core per-pixel. lol. > This has got to affect OS design. They need to completely rethink their multi-threaded synchronization algorihtms. I have a feeling that efficient distributed non-blocking algorihtms, which are comfortable running under a very weak cache coherency model will be all the rage. Getting rid of atomic RMW or StoreLoad style memory barriers is the first step. |
|
#18
|
| Chris M. Thomasson wrote: > "John Larkin" > message news:rtrg9458spr43ss941mq9p040b2lp6hbgg-at-4ax.com... >> On Tue, 5 Aug 2008 13:30:52 +0200, "Skybuck Flying" >> >> >>> As the number of cores goes up the watt requirements goes up too ? >> >> Not necessarily, if the technology progresses and the clock rates are >> kept reasonable. And one can always throttle down the CPUs that aren't >> busy. >> >>> >>> Will we need a zillion watts of power soon ? >>> >>> Bye, >>> Skybuck. >>> >> >> I saw suggestions of something like 60 cores, 240 threads in the >> reasonable future. > > I can see it now... A mega-core GPU chip that can dedicate 1 core > per-pixel. Why not? Probably configured as a systolic array http://en.wikipedia.org/wiki/Systolic_array -- Dirk http://www.transcendence.me.uk/ - Transcendence UK http://www.theconsensus.org/ - A UK political party http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff |
|
#19
|
| "Dirk Bruere at NeoPax" news:6fqv72Fcv806U1-at-mid.individual.net... > Skybuck Flying wrote: >> As the number of cores goes up the watt requirements goes up too ? >> >> Will we need a zillion watts of power soon ? >> >> Bye, >> Skybuck. > > Since the ATI Radeon™ HD 4800 series has 800 cores you work it out. Just note that the 4870 needs TWO of those 6 pin power leads... Rarius ---- Posted via Pronews.com - Premium Corporate Usenet News Provider ---- http://www.pronews.com offers corporate packages that have access to 100,000+ newsgroups |
|
#20
|
| Nick Maclaren wrote: > In article > Terje Mathisen > |> The key missing item, at least to me, was a specification of the double > |> vs single precision performance. On the original Cell, double ran at 1/8 > |> the speed of float, but it seems like more recent versions is fixing > |> this, to the point where you get about 50% of the throughput. > > A key point compared with the chip being unprogrammable? > > Yes, it's important, but let's see if it is possible to program the > thing and get reliable results even with integers! And that is so > far unproven. Remember the Itanic? I'm very confident that the chip will actually work, and give useful, repeatable results, but I don't expect things like fast (or even any?) denormal handling except flush to zero. Terje -- - "almost all programming can be viewed as an exercise in caching" |
![]() |
| Thread Tools | |
| Display Modes | |