[GTALUG] How to go fast without speculating... maybe
David Collier-Brown
davec-b at rogers.com
Mon Jan 29 17:36:18 EST 2018
Kunle Olukotun didn't like systems that wasted their time stalled on
loads and branches. He and his team at Afara Websystems therefor
designed a non-speculating processor that did work without waits. It
became the Sun T1.
Speed without speculating
The basic idea is to have more decoders than ALUs, so you can have lots
of threads competing for an ALU. If, for example, thread 0 comes to a
load, it will stall, so on the next instruction thread 1 gets the ALU,
and runs... until it stalls and thread 2 get the ALU. Ditto for thread
3, and control goes back to thread 0, which has completed a multi-cycle
fetch from cache and is ready to proceed once more.
That is the basic idea of the Sun T-series processors.
The strength is that the ALUs are never waiting for work. The weakness
is that individual threads still have to wait for data to come from cache.
You can improve on that
Now imagine it isn't entire ALUs that are the available resources, its
individual ALU component, like adders. Now the scenario becomes
* thread 0 stalls
* thread 1 get an adder
* thread 2 gets a compare (really a subtracter)
* thread 3 gets a branch unit, and will probably need to wait in the
next cycle
* thread 4 gets an adder
* thread 5 gets an FPU
... and so on. Each cycle, the hardware assigns as many ALU components
as it has available to threads, all of which can run. Only the stalled
threads are waiting, and they don't need ALU bits to do that.
Now more threads can run at the same time, the ALU components are
(probabilistically) all busy, and we have increased capacity. But
individual threads are still waiting for cache...
Do I feel lucky?
In principle, we could allocate two adders to thread 5, one doing the
current instruction and another doing a subsequent, non-dependent
instruction. It's not speculative, but it is out-of-order. That makes
some threads twice as fast when doing non-interacting calculations.
Allocate it three adders and it's three times as fast.
If we're prepared to have more ALU components than decoders, decode
deeply and we have enough of each to be likely to be able to find lots
of non-dependent instructions, then we can be executing multiple
instructions at once in multiple streams, and probabilistically get
/startlingly/ better performance.
I can see a new kind of optimizing compiler, too: one which tries to
group non-dependent instructions together.
Conclusion
Is this what happens in a T5? That's a question for a hardware
developer: I have no idea... yet
Links:
https://en.wikipedia.org/wiki/Kunle_Olukotun
https://en.wikipedia.org/wiki/Afara_Websystems
https://web.archive.org/web/20110720050850/http://www-hydra.stanford.edu/~kunle/
--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
davecb at spamcop.net | -- Mark Twain
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gtalug.org/pipermail/talk/attachments/20180129/fc82d26a/attachment.html>
More information about the talk
mailing list