[GTALUG] How to go fast without speculating... maybe
David Collier-Brown
davec-b at rogers.com
Wed Jan 31 11:09:20 EST 2018
I was writing a note for the GTA Linux user group about how, in
principle, a T-like processor could avoid falling into the hole that
speculative processors with slow access checks have fallen into... but I
realize I don't know enough about the published designs.
In your opinion, can a T5-like system dodge the bullet?
And should we write an article for ACM queue if it can? Or should Dr.
Olukotun?
--dave
> Kunle Olukotun didn't like systems that wasted their time stalled on
> loads and branches. He and his team at Afara Websystems therefor
> designed a non-speculating processor that did work without waits. It
> became the Sun T1.
>
>
> Speed without speculating
>
> The basic idea is to have more decoders than ALUs, so you can have
> lots of threads competing for an ALU. If, for example, thread 0 comes
> to a load, it will stall, so on the next instruction thread 1 gets the
> ALU, and runs... until it stalls and thread 2 get the ALU. Ditto for
> thread 3, and control goes back to thread 0, which has completed a
> multi-cycle fetch from cache and is ready to proceed once more.
>
> That is the basic idea of the Sun T-series processors.
>
> The strength is that the ALUs are never waiting for work. The weakness
> is that individual threads still have to wait for data to come from cache.
>
>
> You can improve on that
>
> Now imagine it isn't entire ALUs that are the available resources, its
> individual ALU component, like adders. Now the scenario becomes
>
> * thread 0 stalls
> * thread 1 get an adder
> * thread 2 gets a compare (really a subtracter)
> * thread 3 gets a branch unit, and will probably need to wait in the
> next cycle
> * thread 4 gets an adder
> * thread 5 gets an FPU
>
> ... and so on. Each cycle, the hardware assigns as many ALU components
> as it has available to threads, all of which can run. Only the stalled
> threads are waiting, and they don't need ALU bits to do that.
>
> Now more threads can run at the same time, the ALU components are
> (probabilistically) all busy, and we have increased capacity. But
> individual threads are still waiting for cache...
>
>
> Do I feel lucky?
>
> In principle, we could allocate two adders to thread 5, one doing the
> current instruction and another doing a subsequent, non-dependent
> instruction. It's not speculative, but it is out-of-order. That makes
> some threads twice as fast when doing non-interacting calculations.
> Allocate it three adders and it's three times as fast.
>
> If we're prepared to have more ALU components than decoders, decode
> deeply and we have enough of each to be likely to be able to find lots
> of non-dependent instructions, then we can be executing multiple
> instructions at once in multiple streams, and probabilistically get
> /startlingly/ better performance.
>
> I can see a new kind of optimizing compiler, too: one which tries to
> group non-dependent instructions together.
>
>
> Conclusion
>
> Is this what happens in a T5? That's a question for a hardware
> developer: I have no idea... yet
>
>
> Links:
>
> https://en.wikipedia.org/wiki/Kunle_Olukotun
>
> https://en.wikipedia.org/wiki/Afara_Websystems
>
> https://web.archive.org/web/20110720050850/http://www-hydra.stanford.edu/~kunle/
>
> --
> David Collier-Brown, | Always do right. This will gratify
> System Programmer and Author | some people and astonish the rest
> davecb at spamcop.net | -- Mark Twain
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gtalug.org/pipermail/talk/attachments/20180131/91dd106e/attachment.html>
More information about the talk
mailing list