Linux Kernel Network Subsystem Patching

Lennart Sorensen lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Wed Jan 22 14:44:42 UTC 2014


On Wed, Jan 22, 2014 at 01:41:27AM -0500, D. Hugh Redelmeier wrote:
> It all depends on what your computing bottleneck is.
> 
> For example, hyperthreading can be quite good at hiding memory
> latency: while one hyperthread is waiting for a memory fetch, another
> hyperthread can be using the CPU's resources.
> 
> This was used to great effect in the Sun Niagara.
> 
> If your bottleneck is the Floating Point Unit, and that unit is shared
> between hyperthreads, there is no improvement.
> 
> If all threads are competing for memory bandwidth, it is quite
> possible that hyperthreading could make the memory access patterns
> worse.

Exactly.  There are cases you can think of where it helps, and cases
where it hurts.  So in theory it could do either, and in practice it
does either.

> I know that the Pentium 4 hyperthreading often made performance worse.
> I never heard why.  It could have been badly implemented.  This claims
> to be an answer: <https://en.wikipedia.org/wiki/Replay_system>.  That
> seems odd since it claims that the problem (uselessly tied up
> execution resources) is in just the case that hyperthreading should be
> best (if the resources hadn't been tied up).

The instruction decoder is too small on the P4.  There is only one (shared
between the two threads), and it can only decode one x86 instruction into
microops per clock cycle.  So unless your code is already decoded in an
the L1 trace cache, you can't start more than one instruction per clock
cycle for the core.  So only one of the two threads can start anything
on that clock cycle.  As far as I recall up to 3 instructions from the
trace cache can be started in one clock cycle.  There are 3 integer
units, 2 floating point units and 2 memory units in total.  Even before
intel added hyperthreading, the instruction decoder was a bottleneck
on a lot of code.  Only code with fairly small tight loops that could
run out of the L1 cache would really have a chance to get full speed,
and even then, only if the branch predictor didn't make a mistake (given
the cost of the pipeline flush on the P4's extremently long pipeline,
which was between 20 and 31 stages depending on the generation of the P4).
It was a very good branch predictor, but it can still be wrong.

> (Reading more Wikipedia.  They point out that the generic (non-Intel)
> name is <https://en.wikipedia.org/wiki/Simultaneous_multithreading>.
> They also claim Niagara isn't SMT, but I don't see an important
> difference.)

Yes, the niagara (which I have never cared to look into very much), seems
to be trying to essentially do a round robin between threads that have
instructions to run, which can certainly help hide memory stalls and such.
So essentialyl it ends up doing very fast switching between threads,
without the overhead of doing a context switch, because it does have
the duplicated register sets of an SMT design, but where SMT can issue
instructions from multiple threads at once, the niagara can only issue one
instruction each clock and simply takes turns from the threads.  Nice and
simple, and for some work loads would very likely work quite well.

-- 
Len Sorensen
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list