<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Kunle Olukotun didn't like systems that wasted their time stalled

      on loads and branches. He and his team at Afara Websystems

      therefor designed a non-speculating processor that did work

      without waits. It became the Sun T1.</p>

    <h1>Speed without speculating</h1>

    <p>The basic idea is to have more decoders than ALUs, so you can

      have lots of threads competing for an ALU.  If, for example,

      thread 0 comes to a load, it will stall, so on the next

      instruction thread 1 gets the ALU, and runs... until it stalls and

      thread 2 get the ALU.  Ditto for thread 3, and control goes back

      to thread 0, which has completed a multi-cycle fetch from cache

      and is ready to proceed once more.</p>

    <p>That is the basic idea of the Sun T-series processors.</p>

    <p>The strength is that the ALUs are never waiting for work. The

      weakness is that individual threads still have to wait for data to

      come from cache.</p>

    <h1>You can improve on that</h1>

    <p>Now imagine it isn't entire ALUs that are the available

      resources, its individual ALU component, like adders.  Now the

      scenario becomes</p>

    <ul>

      <li>thread 0 stalls</li>

      <li>thread 1 get an adder</li>

      <li>thread 2 gets a compare (really a subtracter)</li>

      <li>thread 3 gets a branch unit, and will probably need to wait in

        the next cycle</li>

      <li>thread 4 gets an adder</li>

      <li>thread 5 gets an FPU</li>

    </ul>

    <p>... and so on. Each cycle, the hardware assigns as many ALU

      components as it has available to threads, all of which can run.

      Only the stalled threads are waiting, and they don't need ALU bits

      to do that.</p>

    <p>Now more threads can run at the same time, the ALU components are

      (probabilistically) all busy, and we have increased capacity. But

      individual threads are still waiting for cache...<br>

    </p>

    <h1>Do I feel lucky?</h1>

    <p>In principle, we could allocate two adders to thread 5, one doing

      the current instruction and another doing a subsequent,

      non-dependent instruction. It's not speculative, but it is

      out-of-order. That makes some threads twice as fast when doing

      non-interacting calculations. Allocate it three adders and it's

      three times as fast.</p>

    <p>If we're prepared to have more ALU components than decoders,

      decode deeply and we have enough of each to be likely to be able

      to find lots of non-dependent instructions, then we can be

      executing multiple instructions at once in multiple streams, and

      probabilistically get <em>startlingly</em> better performance.</p>

    <p>I can see a new kind of optimizing compiler, too: one which tries

      to group non-dependent instructions together.</p>

    <h1>Conclusion</h1>

    <p>Is this what happens in a T5? That's a question for a hardware

      developer: I have no idea... yet</p>

    <p><br>

    </p>

    <p>Links:<br>

    </p>

    <p><a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Kunle_Olukotun">https://en.wikipedia.org/wiki/Kunle_Olukotun</a></p>

    <p><a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Afara_Websystems">https://en.wikipedia.org/wiki/Afara_Websystems</a></p>

    <p><a class="moz-txt-link-freetext" href="https://web.archive.org/web/20110720050850/http://www-hydra.stanford.edu/~kunle/">https://web.archive.org/web/20110720050850/http://www-hydra.stanford.edu/~kunle/</a></p>

    <pre class="moz-signature" cols="72">-- 

David Collier-Brown,         | Always do right. This will gratify

System Programmer and Author | some people and astonish the rest

<a class="moz-txt-link-abbreviated" href="mailto:davecb@spamcop.net">davecb@spamcop.net</a>           |                      -- Mark Twain

</pre>

  </body>

</html>