T2 threads per core vs. T1 threads per core
The UltraSPARC T1 uses vertical multithreading (only one thread executing at a time). Each core timeslices between four active threads. Each clock cycle, the thread changes. If a thread is stalled and waiting for data, it is not run, the next one is. This way, there is not a penalty like "switch on event" (SOE) vertical multithreading, where one thread has to stall for another thread to be loaded. In SOE, there is a penalty as another thread is loaded.
The round-robin nature of T1 threading is simple, requiring fewer transistors to implement. It is also efficient. I saw a presentation which showed four threads on a T1 core produces about 3X the output of a single thread. That compares with about 1.25X for Intel Pentium 4's Hyperthreading and 1.5X for IBM POWER5's SMT.
Each UltraSPARC T2 core has two integer execution units, each supporting four threads. See the preso on OpenSPARC.net (http://www.opensparc.net/pubs/preszo/06/HotChips06_09_ppt_master.pdf).
So in the T1, 8 threads are simultaneously executing, and up to 24 threads are waiting. In the T2, 16 threads are simultaneously executing, and up to 48 threads are waiting.
Regarding the earlier comment on "dynamic thread allocation", I am assuming the commenter means some kind of ability to dynamically allocate threads between LDOMs. I assume this could be done with some external scripting, as CPU threads are dynamic in LDOMs. If "dynamic thread allocation" refers to the ability to allocate fractional thread to LDOMs, I don't see a need for this. A T1 chip has about the same throughput as two POWER5 chips (four cores), and offers up to 32 LDOMs vs. two POWER5's up to 40 LPARs. I doubt on such small machines customers have much need for so many partitions, nor do I believe they would want to fractionally weight partitions by less than 3% of the total compute available (in the IBM scenario, the fractional weigh is 0.25%, as each core can be weighted into 100 increments). The overhead required to share threads between partitions (cache flushes, etc.), as in IBM's Micropartioning likely outweighs some of the benefits the additional 0.25%-3.0% of performance increment provide.
That brings up another advantage of the T1 and T2's timeslicing of threads in an LDOM environment. Each thread's context (cache state, TLBs, etc.) is maintained in hardware, so there is very low overhead to have different OSs on each thread state. It makes the hypervisor's job easier too.
Regarding the question is Victoria Falls an SMP, the answer is yes. It has cache coherency and interconnect logic on the chip.