Fixed a very hard bug at work today.

Jan 21, 2010

In my work, I’m a computer programmer. Lately, I work mainly on a couple of linux device drivers. Today, with the help of a colleague, a very hard to find bug got fixed. It was pretty interesting in retrospect, but frustrating, right up until we found it.

So the driver we were working on controls some hardware, which shall remain nameless. The way it works, in very general terms, is commands come into the driver, are sort of reformatted, are sent down to the nameless hardware, which processes them, and then every so often this hardware dumps a bunch of information into a chunk of host memory, and generates an interrupt. The driver gets the interrupt (which appears as a function call) and processes the information that was dumped into host memory. The information that gets dumped into hsot memory is a list of which commands the hardware is finished processing, essentially.

There are some details which need to be understood. When a command is “sent down” to the hardware, what’s really sent down is a “bus address” of a block of host memory which contains the details of the command. This bus address is jammed into a memory mapped register on the nameless hardware. The hardware takes this bus address, and uses DMA to suck the details of the command into it’s own memory for processing. One of the things inside the command details is a “tag.” This tag uniquely identifies the commands. When the hardware periodically dumps its information into host memory, what it’s dumping is a bunch of these tags.

The symptoms of the bug that we were seeing were as follows:

  1. On one particular set of hardware (several different instances), things seemed to work pretty well. Oddly, this was the newest, least likely to work hardware.
  2. On another set of hardware (several different instances) things didn’t work so well. Oddly, this was older, proven hardware.
  3. In the failure cases, we’d occasionally see what appeared to be bogus tags dumped into the host memory instead of the tags we were expecting. Sometimes these tags looked suspiciously like kernel virtual addresses, which were not what was expected at all.
  4. On the failing system, it seemed that 80 to 90 percent of the time, commands worked, but 10 to 20 percent of the time, we’d get these bogus tags instead. That tags always appeared just where we’d expect the hardware to be writing new tags.

The fact that the bogus tags looked a lot like kernel virtual addresses, and that these addresses could actually be dereferenced, and dumped out, combined with the fact that the failures occurred on known, good, proven hardware made me think that it had to be some stray write coming from the driver, accidentally writing into this area of host memory. It pretty much had to be a write coming from the kernel, it couldn’t be coming from the hardware. How could the hardware come up with a valid kernel virtual address? Not just any valid kernel address, but one obviously within the data space of the driver. Although dumping out what was at that address did not reveal anything obvious about exactly what it was. All of this combined, well, it didn’t make sense. Made you go, “hmmm….”

I spent quite a lot of time trying to come up with ways of trapping these suspected accidental writes. I investigated mmiotrace, but this relies on the use of ioremap(), and we weren’t using that. (we were using pci_alloc_consistent instead.) I tried changing the driver to superflously ioremap this region, just so I could use mmiotrace on it, but as soon as I activated mmiotrace, the system hung. I tried using crash, but gdb (via crash) does not allow watch points to be set on a running kernel. I tried setting up kgdb and debugging remotely, but ran into some roadblocks there as well. All in all, extremely frustrating.

My colleague ended up spotting the problem.

Had I thought of what he thought of, I likely would have dismissed it, as I’d actually made this mistake before, and previously, the symptom was a kernel panic, not the mess we were seeing now. I didn’t even think of it though.

The problem was that we were not aligning the commands in the way that the hardware expected. Consequently, when we stuffed our bus address into the register on the hardware, it zeroed out some low order bits, and sucked down the command details from a slightly lower address than the one we had set up the command in. The hardware interpreted the data as a command, — a garbled command, to be sure — and the only thing it could do was post the “tag”, which was whatever data was where it expected the tag to be in the supposed command that it sucked down from the “wrong” place, with a couple low order bits set to indicate problems. So, that’s where the mysterious kernel virtual addresses that we were seeing came from.

But, how come most commands worked? And how come this worked on the new hardware, but not the old hardware? Well, most commands worked because we allocated an entire block of commands at once with pci_alloc_consistent, which returns a page-aligned block of memory. It so happened that the first command in this block would be page aligned, and many times, this first block would get reused over and over, and the alignment so happened to work out that every fourth command or so was naturally aligned. And it worked on the new hardware, because it just so happened that all the new hardware happened to be installed in 64 bit systems, and on those, the size of a pointer is 64 bits, and it just so happened that the command structure worked out to be perfectly sized so that all the commands were properly aligned. Not so on 32 bit systems, which just happened to be what all the old hardware was installed in. So it wasn’t a distinction between old and new hardware, as it appeared, it was a distinction between 32 bit and 64 bit systems.

All the symptoms pointed to a driver bug, in which the driver was writing somehow to this area of memory. But the write was actually coming from the hardware, even though the hardware was just fine! Ultimately the clue of the mysterious kernel virtual address appearing amongst the tags was indicative of a driver bug, just not the straightforward kind of driver bug you might first think of.

We actually did have a BUILD_BUG_ON in the code that is designed to catch precisely this problem at build time, but somehow in a patch, the update to this safety check got dropped, (and we verified older versions of the patch contained the correct check) and so the safety check was incorrect, and it slipped through, causing me to spend the last three and a half days at work staring a the code trying to figure out how in the heck the tags were getting corrupted, thinking blasphemous, forbidden thoughts, like, “maybe it’s a compiler bug.”

Feels good to be rid of that bug, as it was one of those bugs that put me in the position of not having enough ideas to try. It’s not fun staring at data that seems impossible, and racking your brains to try to come up with some kind of experiment that can pry a new clue out of the system and coming up with nothing for hours at a time. And when I finally did come up with experiments, I was thwarted at every turn, by mmiotrace, by kgdb, by crash. And, even if I hadn’t been thwarted, it wouldn’t have worked, as ultimately, the writes were coming from the hardware, and my attempts to trap the non-existent stray host-writes would have come up empty.

But, then, sometimes, I read about the biologists and chemists that have figured out so much about how the chemistry of life works over the years, and I realize that my days or hours of staring at impenetrable data, searching for an explanation is nothing compared to what real scientists face every day.

Even so, it feels pretty good to be rid of that bug.

Anyway, if you ever wondered what it’s like to work on linux device drivers, what’s above represents one of the tougher things I’ve run up against lately.

~ by scaryreasoner on January 22, 2010.

2 Responses to “Fixed a very hard bug at work today.”

  1. Nice blog, i like it, its informative,
    i will visit his blog more often.
    i like your article specially about
    Fixed a very hard bug at work today.


  2. I like this blog too, but I didn’t understand a word of this post. It’s painful to be so ignorant of this stuff.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: