back to article Sun fattens up Niagara for middleware play

Sun Microsystems' Niagara processors are growing up. The company this week confirmed that the third-generation Niagara chip code-named "Victoria Falls" will slot into two- and four-socket servers. (We revealed this move about 18 months ago and thank Sun for getting with the program.) When Victoria Falls ships during the …


This topic is closed for new posts.
  1. Anonymous Coward
    Anonymous Coward

    Has anyone seen a Sun processor roadmap update since 4/21/06?


    “Although the Niagara 1 core executes just one instruction per clock cycle, it manages four separate execution threads simultaneously…At a processor frequency of 1.2GHz, the four executable threads each run at a speed of 300MHz.” Microprocessor Report

    T2 1.4GHz chip - each light thread is only 1.4GHz/4 = 350MHz

    What kind of performance servers would allow the consolidation of 64 servers onto one chip with a 350MHz thread each?

    When will Sun have dynamic thread allocation for its chip partitioning?

    Did Sun really run 64 copies of the SPEC benchmark and add up the result of each one and pretend it was single application performance?

    I think you mean a "four chip box" in the article. Will Victoria Falls be a true SMP? One OS running across both chips transparently?

    Where does this leave "ROCK" as Sun calls it the "Single thread performance chip". Is it 4 cores or 16 mini-cores?

  2. Anonymous Coward
    Anonymous Coward

    Re: Has anyone seen a Sun processor roadmap update since 4/21/06?

    You're making a assumptions here which make your analysis needlessly negative.

    For example, you say: T2 1.4GHz chip - each light thread is only 1.4GHz/4 = 350MHz

    This is only true if you have four threads running on the same core, each running at CPI=1 (Cycles per instruction). However, it has been shown repeatedly that for commercial software 2<CPI<8 depending on the application. The T2 does not execute a thread that is waiting for data to arrive. Let's assume that our applications run at CPI=2 on average (fairly typical for webservers on UltraSPARC II), then the calculation becomes:

    1.4GHz/4*2=700MHz. The T2 appears to perform as 4x700MHz!

    Frequent I/O accesses (network/disk) provide much more benefit - as long as you have the calculation threads ready to run.

    Regarding you other points:

    Dynamic thread allocation for chip partitioning - care to explain what it is you mean? It sounds like your optimized question process did not adhere to all 6 sigma required procedures.

    Running multiple copies of the SPEC benchmark simultaneously are known as SPECrate results and are an accepted standard.

    Victoria falls is a true SMP design (see You are confusing T2 (non smp) with Niagara2/Victoria Falls (true smp)

    Rock = 16 cores, multiple threads per core (see

  3. amanfromMars Silver badge

    Joining up the Dots.

    "One OS running across both chips transparently?"

    Chip Designers running Operating Systems is more likely. Check out more detail on "Calloused hands and AI Bug at Total Information Arrivals" ... which may be posited as Comment here/there ....

    "did not adhere to all 6 sigma required procedures." Are you saying, anonymous, that there are rules and regulations for Imaginative Processor IntelAIgent Design?

    I'm sure you will agree that that is QuITe Preposterous.

  4. Anonymous Coward
    Anonymous Coward

    Re: Has anyone seen a Sun processor roadmap update since 4/21/06?

    In response to some of the other questions:

    "n socket" is fairly conventional terminology by now.

    Yes, of course it will be a true SMP, and one OS instance will run across all the sockets, if you want that. I imagine that, like the current Niagara offerings, you will be able to partition the system if you want to support more than one OS instance as well. Sun's terminology for this partitioning is LDoms - it's distinct from the domains offered in larger systems which offer more isolation, and from zones/containers where there is only one OS instance. Of course you can use LDoms with zones within each LDom etc.


  5. This post has been deleted by its author

  6. Anonymous Coward
    Anonymous Coward

    Re: Has anyone seen a Sun processor roadmap update since 4/21/06?

    I would agree the first post here seems either negative or based on some loose talk from the linked source of this post but its clear Sun are using a bit of unprofessional licence with use of the word simultaneous. (ie: saying 4 threads can execute 'simultaneously' )

    I thought each core keeps track of 4 seperate threads and only switches threads when a thread has stalled waiting for new data or instructions to execute. instruction or data caches don't have the required load therefore causing a miss & the thread stalls.

    So the maths of dividing 1.2Ghz by 4 threads is riduculous. One thread at a time runs at 1.2Ghz, when stalled gets switched out for another thread that runs until stalled etc etc. This approach seems to beat the ass out of the old world "single core, single thread, high Ghz" that sit there stalled doing nothing waiting for their payload.

    Even dual core, high Mhz is affected, having 32 threads means good utilisation.

    MPRonline simplifying this to core speed divided by four is a bit of a joke, Sun released simplified diagrams explaining this.

    Depending on your own situation your Cores may have lots to wait for, fetching from disk/network/other slow resources which is where the multithreads works I believe or you may do lots with in-cache data with little stalling which reduces the benefit somewhat.

    As they say, "Your mileage may vary"

  7. Lewis

    T2 vs T1

    Can anyone confirm that the T2 has removed T1's max thread speed of 1/4 core clock?

    The only place I see any reference to this feature is

    "Support for up to 32 simultaneous threads, with eight threads executed per clock cycle."

    from this page:

    And event that doesn't mention that idle threads get just as many executions as the busy.

    Now obviously it's not good marketing to say we scale up but not down, but over inflating expectations is just asking for disappointment.

    PS I'm not pulling this out my arse, I was involved in evaluating the T1 when it was released and witnessed 250Mhz Max per thread on a 1Ghz chip.

  8. Matt Bryant Silver badge

    Re: Re: Has anyone seen a Sun processor roadmap update since 4/21/06?

    I'm sure you're trying to paint the chip in the best light possible, but maybe comparing the performance to UltraSPARC II webserving isn't such a smart idea seeing as most of that old junk has been replaced by cheaper and faster x86 servers.....

    So, basically then, this is more of the same only more cores? So, good for simple webserving, just about pants at anything else. I'm also not too sure there is enough cache to keep all the threads happy if they are all stalling - how good is SUN's cache-hit ratio and how often will the small amount of cache have to be flushed and be loaded from slower RAM or much slower disk? Frequent I/O accesses only work if the cache can keep the cores supplied.

  9. Anonymous Coward
    Anonymous Coward

    T2 threads per core vs. T1 threads per core

    The UltraSPARC T1 uses vertical multithreading (only one thread executing at a time). Each core timeslices between four active threads. Each clock cycle, the thread changes. If a thread is stalled and waiting for data, it is not run, the next one is. This way, there is not a penalty like "switch on event" (SOE) vertical multithreading, where one thread has to stall for another thread to be loaded. In SOE, there is a penalty as another thread is loaded.

    The round-robin nature of T1 threading is simple, requiring fewer transistors to implement. It is also efficient. I saw a presentation which showed four threads on a T1 core produces about 3X the output of a single thread. That compares with about 1.25X for Intel Pentium 4's Hyperthreading and 1.5X for IBM POWER5's SMT.

    Each UltraSPARC T2 core has two integer execution units, each supporting four threads. See the preso on (

    So in the T1, 8 threads are simultaneously executing, and up to 24 threads are waiting. In the T2, 16 threads are simultaneously executing, and up to 48 threads are waiting.

    Regarding the earlier comment on "dynamic thread allocation", I am assuming the commenter means some kind of ability to dynamically allocate threads between LDOMs. I assume this could be done with some external scripting, as CPU threads are dynamic in LDOMs. If "dynamic thread allocation" refers to the ability to allocate fractional thread to LDOMs, I don't see a need for this. A T1 chip has about the same throughput as two POWER5 chips (four cores), and offers up to 32 LDOMs vs. two POWER5's up to 40 LPARs. I doubt on such small machines customers have much need for so many partitions, nor do I believe they would want to fractionally weight partitions by less than 3% of the total compute available (in the IBM scenario, the fractional weigh is 0.25%, as each core can be weighted into 100 increments). The overhead required to share threads between partitions (cache flushes, etc.), as in IBM's Micropartioning likely outweighs some of the benefits the additional 0.25%-3.0% of performance increment provide.

    That brings up another advantage of the T1 and T2's timeslicing of threads in an LDOM environment. Each thread's context (cache state, TLBs, etc.) is maintained in hardware, so there is very low overhead to have different OSs on each thread state. It makes the hypervisor's job easier too.

    Regarding the question is Victoria Falls an SMP, the answer is yes. It has cache coherency and interconnect logic on the chip.

This topic is closed for new posts.

Biting the hand that feeds IT © 1998–2021