Hello, been a while, dual CPU mobos

Thu Sep 9 18:28:32 UTC 2004

On Thu, 2004-09-09 at 14:12, Lennart Sorensen wrote:
> On Thu, Sep 09, 2004 at 08:57:39PM -0400, Peter L. Peres wrote:
> > Is it a secret ? ;-)
> 
> I despise the design of the P4.  I don't like inefficient designs.
> 
> The athlon 64 is what I want in my next machine.
> 
> > Question: if the pipeline length is so important then these cpus should 
> > work better with 'flattened' code (no jumps, no loops). Do the speed tests 
> > just so happen to be compiled flattened ? I.e. do the chip makers add 
> > pipeline length to look better in tests or to give better all-round 
> > performance. I suspect that code with very short runs between jumps and 
> > loops (such as code compiled for size optimisation) will run relatively 
> > slowly on such a cpu. True ?
> 
> Well the longer pipeline means less of the instruction is being
> performed at each stage, so each stage is simpler and can hence be
> executed faster.  This helps you to increase clock speed.
> 
> If your code has no branches, and doesn't have too much dependancy on
> previous results, your instructions can pretty much just flow through
> with no problems (out of order execution helps a bit with the
> dependancies on prior results by doing other non dependant instructions
> ahead of the ones waiting).  So in the ideal case, the pipeline length
> doesn't hurt at all, and allows higher clock rates, so the cpu is
> faster.
> 
> In the bad case, with lots of branches, or instructions are
> continuously waiting for the prior instructions result before being able
> to start excuting, the pipeline length becomes an issue, since if it
> takes an instruction 31 stages to execute and you have to wait for most
> of the previous instruction to finish before you are starting your turn,
> then you essentially end up dividing the clock speed by the pipeline
> length to determine number of instructions executed per second, which in
> the worst case is terrible with a long pipeline.
> 
> Fortunately branch prediction often works by assuming that what happened
> in a certain branch location last time is likely to be what happens
> again this time, so the code from that branch can already go through the
> pipeline.  Some even run both choices of a pipeline at the same time
> using unused execution units of the cpu (while instructions are waiting
> for prior instructions and such) and evaluate both choices, and simple
> throw away the wrong branch when the branch to take is determined.
> Getting smarter branch predictions that can detect paterns in when a
> branch is taken, and such are all important to cpu design when the
> pipeline length is increased.  Compiler optimizations, such as loop
> unrolling and such can also help by turning a 1000 short loops into 100
> longer loops (saves 90% of the branch checks and helps the branch
> prediction logic).
> 
> Basically lots can be done to help a cpu with a longer pipeline perform
> well, but it depends on the software algorithms involved and on the
> compiler optimizations.  This is why when the P4 came out, it didn't
> look so good, but eventually software came out that had been optimized
> for it's pipeline length and instruction preference, and all of a sudden
> it's performance got much better.  The athlon and athlon 64 with a
> shorter pipeline deals much better with legacy code optimized for older
> generations since it is more similar in behaviour to those older chips.
> 
> The itanium (IA64) doesn't do any instruction reordering, and has 3
> instructions for every 128bit instruction word that goes into it, and
> those 3 instructions are placed together by the compiler to run at the
> same time.  A good compiler is required for the itanium to perform well,
> but it made the design simpler.  It just requires the compiler to do all
> the work to determine what can be done at once and what order everything
> would work best in.  If a problem happens to be very dynamic and can't
> be determined optimally at compile time, the itanium won't perform as
> well.  Worst case for it would be a problem that is so linear that only
> one instruction would actually be in each instruction word leaving 1/3
> of the cpu unused at all times running that code.
> 
> Hmm, that may have been wayyy to long, and I think my kernel compile
> just finished.
> 
> Lennart Sorensen


I was going to pipe in with a defence of the P4 design but after
Lennart's email I now realize that I know nothing about computers and
that my only possible rebuttal is...."My cat's breath smells like cat
food"....I will now proceed to go home and throw my brand new Pentium
machine in the garbage :)

Later


-- 
Devin Whalen
Programmer
Synaptic Vision Inc
Phone-(416) 539-0801
Fax- (416) 539-8280
1179A King St. West
Toronto, Ontario
Suite 309 M6K 3C5
Home-(416) 653-3982
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml