The Life of I/O in the linux kernel

September 14, 2013 — The Life of I/O in the linux kernel

Yesterday I had a need to trace the path that i/o takes through the linux kernel, starting with entering the read(2) system call going down through the block layer, through the SCSI layer, through the low level driver to the hardware, back from the hardware into the low level driver’s interrupt handler, back up through the SCSI layer and block layer, and finally back out to userland. This is something I’ve tried a few times before, but this time I did a more thorough job of it, and wrote down a lot of what I found. It’s still by no means a very complete explanation, but it’s the closest to such a thing which I’ve been able to make so far. So I figured I’d docuement it here, in case anyone finds it useful, interesting or amusing.

It is a long, twisty, confusing trip. “One does not simply walk into Mordor.” One first has to spend a few years in an apprenticeship at the Ministry of Complicated Walks.

I made it through alive, though I may have taken a wrong turn or two here and there and there are still some hand-wavy bits around certain function pointers, but a worthwhile trip.

So unpack your 3.11 kernel source and put on your complicated shoes and let’s go.

Assume we’re starting with a read(2) system call on a file descriptor opened with O_DIRECT.

Via glibc, read(2) will eventually do something like:

	syscall(SYS_read, fd, buf, len);

this ends up eventually entering the kernel in fs/read_write.c:

        SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)

This does:

        ret = vfs_read(f.file, buf, count, &pos);

vfs_read(), also in fs/read_write.c, ends up doing this:

         ret = file->f_op->read(file, buf, count, pos);

or possibly this:

          ret = do_sync_read(file, buf, count, pos);

Where that f_op->read() function pointer leads depends,

On ext3 filesystem: (from fs/ext3/file.c:)

const struct file_operations ext3_file_operations = {
        .llseek         = generic_file_llseek,
        .read           = do_sync_read, <------------------- here
        .write          = do_sync_write,
        .aio_read       = generic_file_aio_read,
        .aio_write      = generic_file_aio_write,

on xfs (from fs/xfs/xfs_file.c):

const struct file_operations xfs_file_operations = {
        .llseek         = xfs_file_llseek,
        .read           = do_sync_read,<------------------- here
        .write          = do_sync_write,
        .aio_read       = xfs_file_aio_read,
        .aio_write      = xfs_file_aio_write,

So, it tends to go to do_sync_read()

do_sync_read(), in fs/read_write.c, does this:

ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
        struct iovec iov = { .iov_base = buf, .iov_len = len };
        struct kiocb kiocb;
        ssize_t ret;

        init_sync_kiocb(&kiocb, filp);
        kiocb.ki_pos = *ppos;
        kiocb.ki_left = len;
        kiocb.ki_nbytes = len;

        ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
        if (-EIOCBQUEUED == ret)
                ret = wait_on_sync_kiocb(&kiocb);
        *ppos = kiocb.ki_pos;
        return ret;

“kiocb” means “kernel i/o call back”, so it calls through the function pointer filp->f-op->aio_read(), and potentially waits for this to complete by calling wait_on_sync_kiocb(&kiocb) if the request gets queued.

f_op->aio_read(), if we assume ext3, or plain block devive, (e.g. /dev/sda) this goes through generic_file_aio_read(). For xfs, it will go through xfs_file_aio_read().

In generic_file_aio_read() (in mm/filemap.c), there is a section at the top where I believe it is setting up mappings for the userland buffers sort of linking them to LBA rangess on the device (my interpretation here is pretty sketchy and hand-wavy here because I don’t really understand what’s going on here.)

                retval = mapping->a_ops->direct_IO(READ, iocb,
                                                iov, pos, nr_segs);

It looks like this part above may actually do the i/o, or possibly only part of the i/o, or maybe none of the i/o, depending on circumstances, which I’m not sure about. The above ends up eventually calling do_direct_IO() in fs/direct-io.c. probably first via blkdev_direct_IO() (in fs/block_dev.c), which calls,

__blkdev_direct_IO() (in fs/direct-io.c) which calls
do_blockdev_direct_IO(), (note: end_io param is NULL, iocb is not NULL) note also that end_io and iocb are copied into dio which is passed to:

do_direct_IO() ends ends up calling get_more_blocks() in fs/direct-io.c, which calls through another function pointer:

         ret = (*sdio->get_block)(dio->inode, fs_startblk,
                                                map_bh, create);

I think the get_block() function is filesystem specific, I kind of lost the trail, but for block devices (e.g. /dev/sda without a filesystem) it probably goes through fs/block_dev.c:blkdev_get_block(). which just does…

static int
blkdev_get_block(struct inode *inode, sector_t iblock,
                struct buffer_head *bh, int create)
        bh->b_bdev = I_BDEV(inode);
        bh->b_blocknr = iblock;
        return 0;

Hmm, so that would appear to be just setting up mapping between buffers and devices and LBAs (logical block addresses), I think.

do_direct_io() will call submit_page_section(), and this calls:
dio_send_cur_page() in a few different cases, and this can call:
dio_new_bio(), which can call:
dio_bio_alloc(), which will set
bio->bi_end_io to either dio_bio_end_aio, or dio_bio_end_io
. <– this is important!

then a bit further down in do_blockdev_direct_IO(), it calls:

        if (
                dio_bio_submit(dio, &sdio);

dio_bio_submit() calls:

         submit_bio(dio->rw, bio); (I am ignoring sdio->submit_bio().)

See SUBMIT_BIO section below which will follow the i/o all the way through the block layer to low level driver and back up again..

Eventually, do_blockdev_direct_IO() calls:

        if (retval != -EIOCBQUEUED)
static void dio_await_completion(struct dio *dio) (in fs/direct-io.c)
        struct bio *bio;
        do {
                bio = dio_await_one(dio);
                if (bio)
                        dio_bio_complete(dio, bio);
        } while (bio);

dio_await_one(dio) does…

        while (dio->refcount > 1 && dio->bio_list == NULL) {
                dio->waiter = current;   <------------------  current is the CURRENT TASK.
                spin_unlock_irqrestore(&dio->bio_lock, flags);
                io_schedule(); <-----  will put process to sleep.
                /* wake up sets us TASK_RUNNING */
                spin_lock_irqsave(&dio->bio_lock, flags);
                dio->waiter = NULL;

io_schedule (in kernel/core/sched.c) looks like:

void __sched io_schedule(void)
        struct rq *rq = raw_rq();

        current->in_iowait = 1;
        schedule(); <--------------- will put process to sleep
        current->in_iowait = 0;

Eventually (see below) dio_bio_end_aio() or dio_bio_end_io() will get called, and these will do:


which will make it return from schedule().
which will return back up through
dio_await_one, and up through
dio_await_completion() then up through
and back up through

        retval = mapping->a_ops->direct_IO(READ, iocb,
                                        iov, pos, nr_segs);

which was in generic_file_aio_read() (mm/filemap.c), so assuming that the whole i/o got done direct, then this returns back into do_sync_read() which will return back to vfs_read() and back out through the to user land read(2) system call.

----- this section below is if the direct i/o can't be done all direct
      ing generic_file_aio_read() and there is some remaining
      non-direct i/o to do.

So after the virtual memory/disk mapping stuff,
mm/filemap.c:generic_file_aio_read() in case it did a short
read(???) it ends up calling:

        do_generic_file_read(filp, ppos, &desc, file_read_actor);

(eh, that file_read_actor is a function pointer, but it appears to be about setting up pages tables, etc. not actually doing i/o.)

in filemap.c, and which is horrendously complicated, but it looks like the business end is this line:

          error = mapping->a_ops->readpage(filp, page);

(oh goody, another function pointer.)

Lets assume straight block device no filesystem, in which case (I guess) it goes through blkdev_readpage() in fs/block_dev.c:

static int blkdev_readpage(struct file * file, struct page * page)
        return block_read_full_page(page, blkdev_get_block);

block_read_full_page() is in fs/buffer.c, and is horrendously complicated (you expected something else?) The business end appears to be this line:

         submit_bh(READ, bh);

submit_bh starts building a struct bio (yay! some stuff I recognize!) and calls submit_bio().

----- End of section (above) is if the direct i/o can't be done all direct
      and there is some remaining non-direct i/o to do.


submit_bio() is in block/blk-core.c (so we’re out of mysterious filesystem / mm territory and into the hopefully slightly less mysterious block layer now.)

The business end of submit_bio() appears to be:


generic_make_request(), also in block/blk-core.c contains

         * We only want one ->make_request_fn to be active at a time, else
         * stack usage with stacked devices could be a problem.  So use
         * current->bio_list to keep a list of requests submited by a
         * make_request_fn function.  current->bio_list is also used as a
         * flag to say if generic_make_request is currently active in this
         * task or not.  If it is NULL, then no make_request is active.  If
         * it is non-NULL, then a make_request is active, and new requests
         * should be added at the tail
        if (current->bio_list) {
                bio_list_add(current->bio_list, bio);

Presuming there is not another active ->make_request_fn(),
generic_make_request() continues, processing the bio list

         q->make_request_fn(q, bio);

This calls through struct request_queue->make_request_fn().

(begin diversion about scsi device initialization)

for each disk device, scsi_add_device() gets called either by the low level driver (if the driver uses scsi_host->scan_start()/scan_finished()), or by the scsi mid layer on behalf of the low level driver if the generic scanning code is used.

scsi_add_device() calls:
__scsi_add_device(), which calls:
scsi_probe_and_add_lun(), which calls:
scsi_alloc_sdev(), which calls:
scsi_alloc_queue(), which calls:
__scsi_alloc_queue(), which calls:
q = blk_init_queue(request_fn, NULL), in block/blk-core.c, which calls:

^^^ note the parameter request_fn, above

        request_fn  == scsi_request_fn in drivers/scsi/scsi_lib.c <--- important!

blk_init_queue_node(), which calls:
blk_init_allocated_queue(), which assigns:
q->request_fn = rfn; <—- == scsi_request_fn()
and which calls:
blk_queue_make_request(q, blk_queue_bio); (in block/blk-settings.c)
which assigns:
q->make_request_fn = mfn; <—- mfn == blk_queue_bio(). <– important!

(end diversion about scsi device initialization)

So, q->make_request_fn() calls blk_queue_bio().

blk_queue_bio calls:

         req = get_request(q, rw_flags, bio, GFP_NOIO);

Note get_request() may sleep, which is another opportunity for your i/o to get preempted.

Next, it makes a request from the bio:

       init_request_from_bio(req, bio);

Next it checks the queue flags and associates the current cpu with the request if the right queue flags are set (see Documentation/block/queue-sysfs.txt section about rq_affinity)

       if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))
                req->cpu = raw_smp_processor_id();

means the request is trying to remember which cpu it is being submitted on — not clear to me that this is not preemptible somewhere along the way so this may just be a “best effort” rather than a guarantee.

Then it calls:


__blk_run_queue() calls __blk_run_queue_uncond(q);
__blk_run_queue_uncond() calls:


which q->request_fn == scsi_request_fn().

scsi_request_fn() in drivers/scsi/scsi_lib.c builds a scsi_cmnd from the request and calls :

scsi_dispatch_cmd() in drivers/scsi/scsi.c does this, eventually:

        rtn = host->hostt->queuecommand(host, cmd);

This calls into the low level driver’s
scsi_host_template->queuecommand() function, which was set by the low level driver in the scsi_host_template when it called scsi_add_host().

Let’s assume we’re using the hpsa driver for HP’s Smart Array driver (because that’s what I know because that’s my job), this calls
hpsa_scsi_queue_command(), or through some macro magic,

This function builds up a hardware specific command,
maps the buffers in the request for DMA, and submits
the command to the hardware, via:

        set_performant_mode(h, c);
                 h->access.submit_command(h, c);

Note that set_performant_mode(h, c) does this:

static void set_performant_mode(struct ctlr_info *h, struct CommandList *c)
        if (likely(h->transMethod & CFGTBL_Trans_Performant)) {
                c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
                if (likely(h->msix_vector))
                        c->Header.ReplyQueue =
                                smp_processor_id() % h->nreply_queues;

Note that the c->Header.ReplyQueue is set to the current processor via smp_processor_id(). This dictates which msix vector and (if irq affinity is set) potentially which cpu the interrupt will come back on.

At this point, the i/o is submitted, and the current thread of control eventually unwinds back up the stack.

When the i/o completes, the interrupt handler of the hpsa driver will be called. If irq affinity is set, the cpu on which the interrupt handler will be called may be controlled to be the same as that which the command was submitted to the hardware from.

The interrupt handler looks like this:

static irqreturn_t do_hpsa_intr_msi(int irq, void *queue)
        struct ctlr_info *h = queue_to_hba(queue);
        u32 raw_tag;
        u8 q = *(u8 *) queue;

        h->last_intr_timestamp = get_jiffies_64();
        raw_tag = get_next_completion(h, q);
        while (raw_tag != FIFO_EMPTY) {
                if (likely(hpsa_tag_contains_index(raw_tag)))
                        process_indexed_cmd(h, raw_tag);
                        process_nonindexed_cmd(h, raw_tag);
                raw_tag = next_command(h, q);
        return IRQ_HANDLED;

The normal i/o path would have the command completing through the call to process_indexed_cmd().

/* process completion of an indexed ("direct lookup") command */
static inline void process_indexed_cmd(struct ctlr_info *h,
        u32 raw_tag)
        u32 tag_index;
        struct CommandList *c;

        tag_index = hpsa_tag_to_index(raw_tag);
        if (!bad_tag(h, tag_index, raw_tag)) {
                c = h->cmd_pool + tag_index;

finish_cmd(c) would normally be called here.

static inline void finish_cmd(struct CommandList *c)
        unsigned long flags;

        spin_lock_irqsave(&c->h->lock, flags);
        spin_unlock_irqrestore(&c->h->lock, flags);
        dial_up_lockup_detection_on_fw_flash_complete(c->h, c);
        if (likely(c->cmd_type == CMD_SCSI))
        else if (c->cmd_type == CMD_IOCTL_PEND)

The normal i/o path would complete via complete_scsi_command(c);

complete_scsi_command() will unmap the command for DMA, fill in any status information in the struct scsi_cmnd, then in the normal non-error case, call cmd->scsi_done() which points to
scsi_done() in drivers/scsi/scsi.c.

scsi_done() calls
blk_complete_request(cmd->request) (in block/blk-softirq.c) calls
__blk_complete_request() (in block/blk-softirq.c),
which does some interesting things regarding the cpu.

        cpu = smp_processor_id();

^^^ gets the current cpu.

         * Select completion CPU
        if (req->cpu != -1) {
                ccpu = req->cpu;
                if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
                        shared = cpus_share_cache(cpu, ccpu);
        } else
                ccpu = cpu;

^^^ NOTICE here it selects the completion cpu.

         * If current CPU and requested CPU share a cache, run the softirq on
         * the current CPU. One might concern this is just like
         * QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is
         * running in interrupt handler, and currently I/O controller doesn't
         * support multiple interrupts, so current CPU is unique actually. This
         * avoids IPI sending from current CPU to the first CPU of a group.
        if (ccpu == cpu || shared) {
                struct list_head *list;
                list = &__get_cpu_var(blk_cpu_done);
                list_add_tail(&req->csd.list, list);

                 * if the list only contains our just added request,
                 * signal a raise of the softirq. If there are already
                 * entries there, someone already raised the irq but it
                 * hasn't run yet.
                if (list->next == &req->csd.list)
        } else if (raise_blk_irq(ccpu, req))
                goto do_local;

So in the end, it either calls raise_softirq_irqoff() or raise_blk_irq(), (and notice that funky goto in the “else” clause that jumps back into the “then” clause of that “if” statement.)

raise_softirq_irqoff(BLOCK_SOFTIRQ); ends up triggering blk_done_softirq() to get called.

see kernel/softirq.c:

        if (!in_interrupt())

which does:

        /* Interrupts are disabled: no need to stop preemption */
        struct task_struct *tsk = __this_cpu_read(ksoftirqd);

        if (tsk && tsk->state != TASK_RUNNING)

The softirqd code looks like:

static void run_ksoftirqd(unsigned int cpu)
        if (local_softirq_pending()) {

Notice that here is potentially another context switch, though I think it will remain on the same cpu (since that it’s called from under a goto label that’s named “do_local:”?)

The business end is __do_softirq(), which looks pretty complicated, but the business end of __do_softirq() appears to be:


which was presumably set by open_softirq():

void open_softirq(int nr, void (*action)(struct softirq_action *))
        softirq_vec[nr].action = action;

which was previously called from blk_softirq_init():

        open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);

So it does call blk_done_softirq(), eventually.

The business end of blk_done_softirq() is:


The softirq_done_fn of the queue, which in this case isscsi_softirq_done (in drivers/scsi/scsi_lib.c), previously set up in scsi_alloc_queue() via a call to blk_queue_softirq_done().

scsi_softirq_done, (we will ignore all but the successful path), calls
scsi_finish_command() (in drivers/scsi/scsi.c) which calls:
scsi_io_completion() (in drivers/scsi/scsi_lib.c) which calls:
blk_end_request() (in block/blk-core.c) which calls:
blk_end_bidi_request (“bidi” means bi-directional) which calls
blk_update_bidi_request() which calls
blk_update_request() which calls
req_bio_endio() for each bio in the request, which calls
bio_endio() (in fs/bio.c) which calls
bio->bi_end_io, which points to either:
dio_bio_end_aio() or dio_bio_end_io() which both call:

which will wake up the process that called schedule(), which if you recall, was: dio_await_one(), called by
dio_await_completion(), called by
do_blockdev_direct_IO(), called by
blkdev_direct_IO(), called by
generic_file_aio_read(), called by
do_sync_read(), called by
vfs_read(), called by (drum roll…)
the read() system call!


And that’s how you get to Mordor and back.

Can you believe all of that works?

~ by scaryreasoner on September 14, 2013.

3 Responses to “The Life of I/O in the linux kernel”

  1. Heiroglyphics to me, but impressive that you did that.

  2. May I just say congratulations on making it out alive.

  3. Did it recently … found your post well documented ;) thanks

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 39 other followers

%d bloggers like this: