An Introduction to Block Device Drivers

Introduction

I have been researching IO for a long time and have been unable to connect bio and block device drivers. I only knew that bio is passed to the block device driver through the IO scheduling algorithm, but I never fully understood how this happens or where the IO scheduling algorithm plays its role. After reviewing many materials, I finally gained an understanding of block device drivers and connected bio to the block device.

1. Traditional Block Devices

Let’s first implement a traditional block device driver based on memory.

1.1 Initializing Some Things

// Temporarily use COMPAQ_SMART2_MAJOR as the major device number to prevent device number conflicts
#define SIMP_BLKDEV_DEVICEMAJOR   COMPAQ_SMART2_MAJOR
// Block device name
#define SIMP_BLKDEV_DISKNAME "simp_blkdev"

// Use an array to simulate a physical storage
#define SIMP_BLKDEV_BYTES (16*1024*1024)
unsigned char simp_blkdev_data[SIMP_BLKDEV_BYTES];

static struct request_queue *simp_blkdev_queue;// Request queue
static struct gendisk *simp_blkdev_disk;// Block device

struct block_device_operations simp_blkdev_fops = {// Block device operation functions
    .owner = THIS_MODULE, 
};

1.2 Loading the Driver

The whole process: 1. Create request_queue (one queue for each block device), bind function simp_blkdev_do_request 2. Create a gendisk (each block device is a gendisk) 3. Bind request_queue and gendisk 4. Register gendisk

static int __init simp_blkdev_init(void)
{
    int ret;
    // Initialize request queue
    simp_blkdev_queue = blk_init_queue(simp_blkdev_do_request, NULL);// This method will be analyzed in detail in 1.5
    simp_blkdev_disk = alloc_disk(1);// Allocate simp_blkdev_disk
  
    // Initialize simp_blkdev_disk
    strcpy(simp_blkdev_disk->disk_name, SIMP_BLKDEV_DISKNAME);// Device name
    simp_blkdev_disk->major = SIMP_BLKDEV_DEVICEMAJOR;// Major device number
    simp_blkdev_disk->first_minor = 0;// Minor device number
    simp_blkdev_disk->fops = &simp_blkdev_fops;// Block device operation function pointer
    simp_blkdev_disk->queue = simp_blkdev_queue; 
    // Set the size of the block device, size is the number of sectors, one sector is 512B
    set_capacity(simp_blkdev_disk, SIMP_BLKDEV_BYTES>>9);
    add_disk(simp_blkdev_disk);// Register simp_blkdev_disk
    return 0;
}

1.3 simp_blkdev_do_request

1. Call the scheduling algorithm’s elv_next_request method to get the next request to process 2. If it is a read, copy simp_blkdev_data to request.buffer 3. If it is a write, copy request.buffer to simp_blkdev_data 4. Call end_request to notify completion

static void simp_blkdev_do_request(struct request_queue *q) 
{
    struct request *req;
    while ((req = elv_next_request(q)) != NULL) {// Get the next request based on the scheduling algorithm
        switch (rq_data_dir(req)) {// Determine read or write
        case READ:
            memcpy(req->buffer, simp_blkdev_data + (req->sector << 9), 
            req->current_nr_sectors << 9);
            end_request(req, 1);// Completion notification
            break;
        case WRITE:
            memcpy(simp_blkdev_data + (req->sector << 9), req->buffer, 
            req->current_nr_sectors << 9); 
            end_request(req, 1);// Completion notification
            break;
        default:
             /* No default because rq_data_dir(req) is 1 bit */
             break;
        }
}

1.4 Unloading the Driver

static void __exit simp_blkdev_exit(void)
{
    del_gendisk(simp_blkdev_disk);// Unregister simp_blkdev_disk
    put_disk(simp_blkdev_disk);// Release simp_blkdev_disk
    blk_cleanup_queue(simp_blkdev_queue);// Release request queue
}

Don’t forget the following code

module_init(simp_blkdev_init);
module_exit(simp_blkdev_exit);

1.5 blk_init_queue

Looking at the code above, you may still not clearly understand how request_queue connects bio and block device drivers. Let’s take a deeper look.

simp_blkdev_queue = blk_init_queue(simp_blkdev_do_request, NULL);// Call blk_init_queue

struct request_queue *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
{
    return blk_init_queue_node(rfn, lock, NUMA_NO_NODE);// Jump to 1.5.1
}
EXPORT_SYMBOL(blk_init_queue);

// 1.5.1
struct request_queue *
blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
{
    struct request_queue *q;
    q = blk_alloc_queue_node(GFP_KERNEL, node_id, lock);
    if (!q)
        return NULL;

    q->request_fn = rfn;// This is simp_blkdev_do_request
    if (blk_init_allocated_queue(q) < 0) {// Jump to 1.5.2
        blk_cleanup_queue(q);
        return NULL;
    }
    return q;
}
EXPORT_SYMBOL(blk_init_queue_node);

// 1.5.2
int blk_init_allocated_queue(struct request_queue *q)
{
    ...
    blk_queue_make_request(q, blk_queue_bio);// Jump to 1.5.3
    if (elevator_init(q))// Initialize IO scheduling algorithm
        goto out_exit_flush_rq;
    return 0;
    ...
}
EXPORT_SYMBOL(blk_init_allocated_queue);

// 1.5.3
void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
{
    ...
    q->make_request_fn = mfn;// mfn is also blk_queue_bio
    ...
}
EXPORT_SYMBOL(blk_queue_make_request);

static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)// Complete how bio is inserted into request_queue
{
    // IO scheduling algorithm plays its role here
}
After the entire call is completed, two important methods of the current block device’s request_queue are bound
q->make_request_fn = blk_queue_bio;// Linux default implementation
q->request_fn = simp_blkdev_do_request;// Driver's own implementation
1.5.1 make_request_fn(struct request_queue *q, struct bio *bio)

submit_bio will call make_request_fn to encapsulate bio into request and insert it into request_queue, by default, it will use the Linux system’s implementation blk_queue_bio. If we replace make_request_fn, it will cause the IO scheduling algorithm to become ineffective, which is generally not modified.

1.5.2 request_fn(struct request_queue *q)

This method is generally implemented by the driver, which is simp_blkdev_do_request, to take the appropriate request from request_queue for processing, and will generally call the scheduling algorithm’s elv_next_request method to obtain a recommended request.

1.5.3 bio-block device

Through make_request_fn and request_fn, we have connected bio and block device drivers. Moreover, the IO scheduling algorithm will play its role in these two functions.

An Introduction to Block Device Drivers

I have dug two pits for myself: 1. Throughout the process, I was influenced by the IO scheduling algorithm. How does the IO scheduling algorithm play its role? 2. How does make_request_fn trigger request_fn?

2. Ultra-High-Speed Block Devices

Traditional block device access is through the magnetic head, and the IO scheduling algorithm can optimize the order of moving the magnetic head when multiple IO requests occur.

IO Scheduling Algorithm

If you are a librarian, and ten people come to borrow ten books from you, located in different corners of the library, you will definitely choose the shortest route to get those ten books. This is actually the IO scheduling algorithm.

Ultra-High-Speed Block Devices

If this library only has one window, and borrowers just have to say the book title, the book will fly out from the window, then there is no need for any librarian, and no need for any IO scheduling algorithm. This library is an ultra-high-speed block device.

The memory-based block device we wrote above is not an ultra-high-speed block device. Can we write a driver without intermediaries?

2.1 simp_blkdev_init

We need to rewrite the init code and not call blk_init_queue. Use the methods in 2.1.1 and 2.1.2 directly. After initialization, we will set make_request_fn to simp_blkdev_make_request.

static int __init simp_blkdev_init(void)
{
    int ret;
    // Initialize request queue
    simp_blkdev_queue = blk_alloc_queue(GFP_KERNEL);// 2.1.1
    // Bind simp_blkdev_make_request to request_queue's make_request_fn.
    blk_queue_make_request(simp_blkdev_queue, simp_blkdev_make_request);// 2.1.2
    simp_blkdev_disk = alloc_disk(1);// Allocate simp_blkdev_disk

    // Initialize simp_blkdev_disk
    strcpy(simp_blkdev_disk->disk_name, SIMP_BLKDEV_DISKNAME);// Device name
    simp_blkdev_disk->major = SIMP_BLKDEV_DEVICEMAJOR;// Device number
    simp_blkdev_disk->first_minor = 0;
    simp_blkdev_disk->fops = &amp;simp_blkdev_fops;// Block device operation function pointer
    simp_blkdev_disk->queue = simp_blkdev_queue; 
    set_capacity(simp_blkdev_disk, SIMP_BLKDEV_BYTES>>9);// Set the size of the block device, size is the number of sectors, one sector is 512B
    add_disk(simp_blkdev_disk);// Register simp_blkdev_disk
    return 0;

err_alloc_disk:
    blk_cleanup_queue(simp_blkdev_queue);
err_alloc_queue:
    return ret;
}

2.2 simp_blkdev_make_request

Bypassing intermediaries, directly copy simp_blkdev_data to bio’s page, and call bio_endio to notify read/write completion. From start to finish, request_queue and request are not used at all.

static int simp_blkdev_make_request(struct request_queue *q, struct bio *bio) {
    struct bio_vec *bvec;
    int i;
    void *dsk_mem;
    // Get the starting address of the block device memory, bi_sector represents the starting sector
    dsk_mem = simp_blkdev_data + (bio->bi_sector << 9);
    bio_for_each_segment(bvec, bio, i) {// Traverse each block
        void *iovec_mem;
        switch (bio_rw(bio)) {
            case READ:
            case READA:
                // Page represents high-end memory that cannot be accessed directly and needs to be mapped to a linear address
                iovec_mem = kmap(bvec->bv_page) + bvec->bv_offset;// Obtain the corresponding memory address by adding page number and offset
                memcpy(iovec_mem, dsk_mem, bvec->bv_len);// Copy data to memory
                kunmap(bvec->bv_page);// Return linear address
                break;
            case WRITE:
                iovec_mem = kmap(bvec->bv_page) + bvec->bv_offset; 
                memcpy(dsk_mem, iovec_mem, bvec->bv_len); 
                kunmap(bvec->bv_page);
                break;
            default:
                printk(KERN_ERR SIMP_BLKDEV_DISKNAME": unknown value of bio_rw: %lu\n", bio_rw(bio));
#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)
                bio_endio(bio, 0, -EIO);// Report error
#else
                bio_endio(bio, -EIO);// Report error
#endif
                return 0;
        }
        dsk_mem += bvec->bv_len;// Move address
    }

#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)
                bio_endio(bio, bio->bi_size, 0);
#else
                bio_endio(bio, 0);
#endif
                return 0;
}

2.2 No Intermediaries

Because we directly implement data access in make_request_fn, which is simp_blkdev_make_request, we have thus eliminated the request_queue and IO scheduling algorithm. No intermediaries, access speed is excellent.

An Introduction to Block Device Drivers

The zram device in the kernel is also a memory-based block device without intermediaries profiting from the price difference. The code is very similar; if interested, you can take a look.

3. Conclusion

After such a long time of study, breaking through layers of obstacles, I finally connected IO. However, the file system, IO scheduling algorithm, and each module are worth my in-depth study; the real challenge has just begun.

Code Reference

Write a Block Device Driver.pdf

Material Reference

“Linux Kernel Design and Implementation” “Complete Commentary on Linux Kernel” Linux.Generic.Block.Layer.pdf https://zhuanlan.zhihu.com/c_132560778

Read the original text to obtain the complete code

An Introduction to Block Device Drivers

Reply “ Basketball’s Big Belly” to enter the technical group chat
Reply “1024” to get 1000G of learning materials

Leave a Comment