[GTALUG] How to go fast without speculating... maybe

Wed Jan 31 11:09:20 EST 2018

I was writing a note for the GTA Linux user group about how, in 
principle, a T-like processor could avoid falling into the hole that 
speculative processors with slow access checks have fallen into... but I 
realize I don't know enough about the published designs.

In your opinion, can a T5-like system dodge the bullet?

And should we write an article for ACM queue if it can? Or should Dr. 
Olukotun?

--dave

> Kunle Olukotun didn't like systems that wasted their time stalled on 
> loads and branches. He and his team at Afara Websystems therefor 
> designed a non-speculating processor that did work without waits. It 
> became the Sun T1.
>
>
>   Speed without speculating
>
> The basic idea is to have more decoders than ALUs, so you can have 
> lots of threads competing for an ALU.  If, for example, thread 0 comes 
> to a load, it will stall, so on the next instruction thread 1 gets the 
> ALU, and runs... until it stalls and thread 2 get the ALU.  Ditto for 
> thread 3, and control goes back to thread 0, which has completed a 
> multi-cycle fetch from cache and is ready to proceed once more.
>
> That is the basic idea of the Sun T-series processors.
>
> The strength is that the ALUs are never waiting for work. The weakness 
> is that individual threads still have to wait for data to come from cache.
>
>
>   You can improve on that
>
> Now imagine it isn't entire ALUs that are the available resources, its 
> individual ALU component, like adders.  Now the scenario becomes
>
>   * thread 0 stalls
>   * thread 1 get an adder
>   * thread 2 gets a compare (really a subtracter)
>   * thread 3 gets a branch unit, and will probably need to wait in the
>     next cycle
>   * thread 4 gets an adder
>   * thread 5 gets an FPU
>
> ... and so on. Each cycle, the hardware assigns as many ALU components 
> as it has available to threads, all of which can run. Only the stalled 
> threads are waiting, and they don't need ALU bits to do that.
>
> Now more threads can run at the same time, the ALU components are 
> (probabilistically) all busy, and we have increased capacity. But 
> individual threads are still waiting for cache...
>
>
>   Do I feel lucky?
>
> In principle, we could allocate two adders to thread 5, one doing the 
> current instruction and another doing a subsequent, non-dependent 
> instruction. It's not speculative, but it is out-of-order. That makes 
> some threads twice as fast when doing non-interacting calculations. 
> Allocate it three adders and it's three times as fast.
>
> If we're prepared to have more ALU components than decoders, decode 
> deeply and we have enough of each to be likely to be able to find lots 
> of non-dependent instructions, then we can be executing multiple 
> instructions at once in multiple streams, and probabilistically get 
> /startlingly/ better performance.
>
> I can see a new kind of optimizing compiler, too: one which tries to 
> group non-dependent instructions together.
>
>
>   Conclusion
>
> Is this what happens in a T5? That's a question for a hardware 
> developer: I have no idea... yet
>
>
> Links:
>
> https://en.wikipedia.org/wiki/Kunle_Olukotun
>
> https://en.wikipedia.org/wiki/Afara_Websystems
>
> https://web.archive.org/web/20110720050850/http://www-hydra.stanford.edu/~kunle/
>
> -- 
> David Collier-Brown,         | Always do right. This will gratify
> System Programmer and Author | some people and astonish the rest
> davecb at spamcop.net            |                      -- Mark Twain
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gtalug.org/pipermail/talk/attachments/20180131/91dd106e/attachment.html>