Joel Becker | 11 Feb 00:07 2011

Re: [LSF/MM TOPIC] Utilizing T10/DIF in Filesystems

On Wed, Feb 09, 2011 at 12:00:35PM -0800, Darrick J. Wong wrote:
> I would like to talk about the status and future direction of T10/DIF support
> in the kernel.  Here's something resembling an abstract:
> Utilizing T10/DIF in Filesystems
> For quite some time, we've been discussing the inclusion of T10/DIF
> functionality into the kernel to associate a small amount of checksum/integrity
> data with each sector.  Now that actual hardware is appearing on the market, it
> is time to take another look at what we can do with this feature.

	I like it.

> 2. How do we expose a API that enables userspace to read and write  application
>    tags that go with a file's data blocks?

	Martin and I have been working on an interface for this.  I was
going to propose it for LSF, but since I haven't had time to put
together code, I figured it could wait.  However, it might be worth
discussing under this umbrella, as it can easily fit the DIF/DIX stuff.



sys_dio_*() - A direct interface to batched direct I/O.

Joel Becker
Martin Petersen


Direct I/O has a problem: it's coupled to the buffered I/O API.  It's an
advisory layer atop POSIX semantics.  The POSIX calls are fine for what
they are, but they interact with direct I/O in ways that are not
predictable.  In addition, asynchronous operations require a specialized
interface that frees some restrictions while imposing others.

Finally, the classic interfaces do not allow for expansion.  This
affects us directly when trying to pass Integrity information from an
application through to the hardware.  Currently there is no generic
facility to do so; Integrity can be attached after the payload has
entered the kernel, but not before.

We propose a new set of system calls, sys_dio_*(), that isolate direct
I/O in a new, batched API.  sys_dio_*() leverages various things we've
learned with block I/O.  It can batch up synchronous and asynchronous
operations, share file handles, and attach Integrity information to

The initial implementation works directly on block devices, bypassing
the pagecache completely.  This allows the most efficient operation for
block devices.  It does not preclude adding file operations later.

[The I/O Interface]

The I/O interface for sys_dio_*() is simple.  There are only three
actual I/O calls, sys_dio_submit(), sys_dio_wait(), and
sys_dio_collect().  They all work on a batch of requests.

All I/Os must be submitted at some point, so there is but one submission
interface, sys_dio_submit().  Any number of dios can be sent in one
batch.  These submissions are asynchronous and will be queued for
processing (or even perhaps complete) when sys_dio_submit() returns.

The sys_dio_wait() and sys_dio_collect() APIs handle synchronous and
asynchronous completion respectively.  If a process wishes to block for
one or more submitted I/Os, it passes a list to sys_dio_wait().  It will
not return until all I/Os in the list have completed.  Conversely,
sys_dio_collect() takes an empty array of request pointers.  It blocks
until some min_nr have been filled in by completed requests.


In today's world of thousands of disks, it is a significant penalty for
every process sharing disks to hold their own open descriptors.  It
takes up a lot of kernel memory.  sys_dio_*() contains an interface to
share handles among processes.

More specifically, sys_dio_open() converts a block device fd into a
sys_dio handle.  This handle is what gets passed to sys_dio_submit().
Thus, one process can open a block device and any coordinating
processes can use the handle however they wish to communicate it: shm,
tcp, $dontcare.  Unlike fd passing, it does not bloat each process's
task structures.

If more than one process in a context opens the same block device, they
will get the same handle.  The handle is just reference counted.  More
on contexts below.  In addition, the block device is claimed with
bd_claim(), giving the users of the context sole access to their device
(XXX should this be a flag to sys_dio_open()?).

The original fd can be closed as soon as sys_dio_open() has a reference.
The handle is released with sys_dio_close().  These handles are
automatically released when the opening process exits.  Note that the
lifetime of the handle is tied to the processes on the context that have
actually opened it.  If the opening process exits, other processes will
get errors from future submissions.


We don't want all processes in the system sharing everything.  Instead,
sys_dio_*() defines a context.  All processes requesting the same
context will share handles.

But wait, there's more!  A context is requested via a unsigned int key,
but the actual context returned is a file descriptor.  This file
descriptor can be polled for completions.  This way sys_dio operations
can be added to the poll loop.

	if (pollfd.fd == context_fd)
		sys_dio_collect(context_fd, collect_list, sizeof(collect_list)); 

[The API]

 * XXX Should we have flags?  What about O_CREAT|O_EXCL semantics on
 * key->context?
int sys_dio_init(unsigned int key, unsigned int flags);
int sys_dio_exit(int context);

 * XXX I'm providing int as the handle type because of my idea to have a
 * fd_table hanging off of the context.  But perhaps the handles should
 * be unsigned long or __u64 for future proofing?
int sys_dio_open(int context, int fd, int flags);
int sys_dio_close(int context, int handle);

 * All of these calls have read(2) semantics.  They return the number of
 * I/Os submitted/waited on/collected.  If they return less then
 * submitted, it means error (or possibly no more outstanding for
 * sys_dio_collect()).
 * XXX struct dio_request *requests or **requests?  The former is easy
 * to set up but a pain to use later.  The latter is a pain to set up
 * for submission but much nicer later.  I'm thinking most users would
 * have the requests independent in memory, so I chose **requests.
int sys_dio_submit(int context, struct dio_request **requests, int nr_requests);
int sys_dio_wait(int context, struct dio_request **waits, int nr_waits);
int sys_dio_collect(int context, struct dio_request **collects, int nr_collects);


"You look in her eyes, the music begins to play.
 Hopeless romantics, here we go again."
			jlbec <at>
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at>
More majordomo info at