4 Oct 2005 02:48
Re: Contribution to XFS on Linux <Support block sizes larger than the page size>
David Chinner <dgc <at> sgi.com>
2005-10-04 00:48:51 GMT
2005-10-04 00:48:51 GMT
On Sat, Oct 01, 2005 at 05:08:29PM -0400, Chandan Talukdar wrote: > Hi, > > Thanks for your responses. I have one more query: > > My filesystem development experience has been on systems with separate > buffer cache and page cache. And that makes solving this problem a _lot_ easier if the buffer cache supports scatter-gather multi-page constructs that you can map and unmap into kernel vm space. That way the filesystem simply needs to map buffers of filesystem block size and alignment, and the buffer cache does the rest... > But Linux has a unified file cache. And that is the real issue here - there is no greater-than-page-size construct for the filesystems to use when needing to do atomic I/O operations on more than one page at a time. > So, any > recommended reading for getting a feel of the differences in > implementation would be much appreciated. In a different life, XFS relies on a chunk cache that sits above the page cache to provide atomicity and coherency on operations that span multiple pages. The chunk cache contains only the currently active subset of the entire page cache, but the abstraction makes the filesystem block size independent of the system page size. That's the really hard bit about this - guaranteeing atomicity of access across the multiple pages in a filesystem block. There needs to be some way of enforcing this at all levels of operation (read, write, reclaim, etc), and when you have a buffer cache this is typically done simply by locking the buffer. Without a buffer cache and no other method of atomically aggregating pages together and operating on that aggregation, you have to lock each page individually before you can do any operation on the filesystem block. This is deadlock prone and very difficult to prove correct. However, I really don't think that reintroducing a buffer cache like construct for atomic aggregation is the way to go here because it makes many smart things harder to do (e.g. window based readahead) or involve substantially more overhead due to buffer setup and teardown. Perhaps doing something like making the fundamental unit of caching a pagevec rather than a page (i.e. page size independent) would be more appropriate way to abstract this. This would be deeply invasive, though, and as Andi Kleen wrote: >I don't see how you can make it work without major effort. It's a major effort ;) FWIW, one aspect of this multipage caching mechanism still exists in linux XFS - the pagebuf - which is needed because metadata buffers can be larger than a single page and XFS needs to guarantee both transactional and I/O atomicity for metadata buffers..... Cheers, Dave. -- -- Dave Chinner R&D Software Enginner SGI Australian Software Group
RSS Feed