RHEL kernel patch backport [was Re: Are you running Linux as your desktop?]

Mon Nov 15 19:25:18 UTC 2010

| From: Lennart Sorensen <lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org>

| On Sat, Nov 13, 2010 at 04:36:37PM -0500, D. Hugh Redelmeier wrote:

| > Linux fans (like me) don't always add up the time we spend to do
| > things the better way.
| > 
| > (Example from today: RHEL 5 / CentOS 5 introduced a kernel bug a year
| > and a half ago that causes one of my boxes to fail to lower the CPU
| > speed when idle.  I'm the only one reporting the bug.  The kernel.org
| > kernels don't have this bug -- it's from a backport the RHEL of half a
| > change.  It isn't going to get fixed (note: RH was willing to fix it
| > but I suggested that the chances of breaking something outweighed the
| > inconvenience to me (I'm not even a customer)). So whenever there is a
| > CentOS kernel update, I have to build a fixed kernel and install it.
| > Today was such a day.)
| 
| I would say you were wrong.  The chances half a change backported breaks
| something is rather high.  It broke your machine after all.  They should
| very much have fixed it.

The story is long and probably boring to most folks.  But sometimes
one can learn from others travails, so here's a summary.

My original summary "half a change" isn't accurtate enough.  Even this new 
summary won't be 100% accurate since I'm basing it on my decaying memory.

My CentOS box is an old first-generation HP AMD64 box.
<http://h10025.www1.hp.com/ewfrf/wc/softwareCategory?lc=en&dlc=en&cc=ca&os=228&product=404646>

Optional boring details about my computer; skip if you feel like it:

    I bought it in 2004 or 2005 from the precursor of TechSource,
    debranded and refurbished
    <http://www.techsourcecanada.ca/store/flyer.html> It was probably
    a customer return from a US retail store, sold off in bulk.
    "Debranded" meant that the HP branding was covered up or removed
    and no software was included, not even Windows.  An awesomely
    inexpensive way for me to get into the AMD64 world.  The
    motherboard was made for HP by Asus.  It turns out to be a great
    box: quiet and reliable.  I use it as a server now, hence CentOS.

The ACPI system allows the BIOS to export functionality to any willing
OS.  This is a great idea: the original way of exporting
functionality, code entry-points accessed through INT instructions,
required that your CPU be in the stupidest mode (i.e. 16-bit mode with
no memory mapping).  The way ACPI does this is to use a specified
pseudo-machine code and have each OS include an interpreter for this
machine code.  Intel even provides and maintains tools for this ACPI
machine code that work in Linux and Windows (assemblers,
disassemblers, and interpreters).

It is up to the machine maker to write/customize/maintain the ACPI
code itself that is embedded in the BIOS.  Sadly, like all BIOS
functionality, most manufacturers whack on the code until MS Windows
seems to work and then never look back.  It turns out that that leaves
several problems for Linux machines.

Linux ACPI support has to deal with broken ACPI code: otherwise the
machines won't be supported.

My machine's ACPI has a kind of breakage (I think).  There are two
ways of determining the number of entries in the table that specifies
how many power states there are.  By one method, all is well.  By
another, there are bad entries at the end that must be ignored.  No
problem: the Linux kernel's AMD64 ACPI code ignores bad entries (at
least the kind I have).

Apparently the Linux kernel's Intel 64 ACPI code does not ignore bad
entries.  At some point, some Xeon BIOSes were produced with such bad
entries.  The Linux kernel hung on those machines.  As a fix, the
Linux kernel folks put sanity checks in for these table entries,
common to AMD64 and Intel 64, upstream of where the control diverges.
This sanity check says: if any entry is bad, consider the whole table
to be bad.

With the table ignored, Linux would no longer run my server at less
than full clock rate.  More precisely, Linux didn't know how to change
the clock rate so it left it in the initial speed, full.

With the kernel.org Kernel, a second change was made.  Before the sanity
check was done, the length of the table was calculated to be the
lesser of the lengths yielded by the two ways of determining table
size.  Since the bad entries on my machine are beyond one of these
lengths, my machine would operate as expected if this second change
were included.

Unfortunately, RHEL only backported the first change.  So my computer
does not get properly throttled on RHEL or CentOS.

This showed up about a year and a half ago -- 4+ years into the life
of the computer.  Probably few of them are still in service running
RHEL/CentOS.

Googling found me no other reports of this problem.

Figuring this out took me a long time.  Convincing others took me a
long time.  I eventually reported it to the kernel.org Kernel bugzilla only
to have the experts come back and point out that it couldn't happen in
that kernel (due to the second change).  I then took that report back
to Red Hat and that was enough to get them to finally see that there
was a problem.

Reporting this to CentOS was worse than useless.  They will not
diverge from RHEL (a good thing).  But reporting and discussing did
take my time (and theirs).  Red Hat seems fairly open to reports of
bugs from CentOS users.  Wow.

My CentOS bug report:
<https://www.centos.org/modules/newbb/viewtopic.php?viewmode=flat&topic_id=22341&forum=44>

RHEL bz that culminated in the fix that broke my system:
<https://bugzilla.redhat.com/show_bug.cgi?id=500311>

My RHEL bz entry:
<https://bugzilla.redhat.com/show_bug.cgi?id=559357>

My kernel.org bz entry.  Oops.
<https://bugzilla.kernel.org/show_bug.cgi?id=15174>
Note: Zhang Rui and Bob Moore are at Intel and are kernel
developers.

Should Red Hat backport the second change?

- kernel patch backporting can be dangerous.  Heck, my problem is an
  example of that.  Skilled/experienced kernel folks are already busy
  so it might fall onto an inexpereinced person.

- the current situation is only known to hurt one non-customer who
  knows a work-around.  True, others may experience this, but where's
  the evidence?
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists