Hello, been a while, dual CPU mobos

Lennart Sorensen lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Thu Sep 9 18:12:12 UTC 2004


On Thu, Sep 09, 2004 at 08:57:39PM -0400, Peter L. Peres wrote:
> Is it a secret ? ;-)

I despise the design of the P4.  I don't like inefficient designs.

The athlon 64 is what I want in my next machine.

> Question: if the pipeline length is so important then these cpus should 
> work better with 'flattened' code (no jumps, no loops). Do the speed tests 
> just so happen to be compiled flattened ? I.e. do the chip makers add 
> pipeline length to look better in tests or to give better all-round 
> performance. I suspect that code with very short runs between jumps and 
> loops (such as code compiled for size optimisation) will run relatively 
> slowly on such a cpu. True ?

Well the longer pipeline means less of the instruction is being
performed at each stage, so each stage is simpler and can hence be
executed faster.  This helps you to increase clock speed.

If your code has no branches, and doesn't have too much dependancy on
previous results, your instructions can pretty much just flow through
with no problems (out of order execution helps a bit with the
dependancies on prior results by doing other non dependant instructions
ahead of the ones waiting).  So in the ideal case, the pipeline length
doesn't hurt at all, and allows higher clock rates, so the cpu is
faster.

In the bad case, with lots of branches, or instructions are
continuously waiting for the prior instructions result before being able
to start excuting, the pipeline length becomes an issue, since if it
takes an instruction 31 stages to execute and you have to wait for most
of the previous instruction to finish before you are starting your turn,
then you essentially end up dividing the clock speed by the pipeline
length to determine number of instructions executed per second, which in
the worst case is terrible with a long pipeline.

Fortunately branch prediction often works by assuming that what happened
in a certain branch location last time is likely to be what happens
again this time, so the code from that branch can already go through the
pipeline.  Some even run both choices of a pipeline at the same time
using unused execution units of the cpu (while instructions are waiting
for prior instructions and such) and evaluate both choices, and simple
throw away the wrong branch when the branch to take is determined.
Getting smarter branch predictions that can detect paterns in when a
branch is taken, and such are all important to cpu design when the
pipeline length is increased.  Compiler optimizations, such as loop
unrolling and such can also help by turning a 1000 short loops into 100
longer loops (saves 90% of the branch checks and helps the branch
prediction logic).

Basically lots can be done to help a cpu with a longer pipeline perform
well, but it depends on the software algorithms involved and on the
compiler optimizations.  This is why when the P4 came out, it didn't
look so good, but eventually software came out that had been optimized
for it's pipeline length and instruction preference, and all of a sudden
it's performance got much better.  The athlon and athlon 64 with a
shorter pipeline deals much better with legacy code optimized for older
generations since it is more similar in behaviour to those older chips.

The itanium (IA64) doesn't do any instruction reordering, and has 3
instructions for every 128bit instruction word that goes into it, and
those 3 instructions are placed together by the compiler to run at the
same time.  A good compiler is required for the itanium to perform well,
but it made the design simpler.  It just requires the compiler to do all
the work to determine what can be done at once and what order everything
would work best in.  If a problem happens to be very dynamic and can't
be determined optimally at compile time, the itanium won't perform as
well.  Worst case for it would be a problem that is so linear that only
one instruction would actually be in each instruction word leaving 1/3
of the cpu unused at all times running that code.

Hmm, that may have been wayyy to long, and I think my kernel compile
just finished.

Lennart Sorensen
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list