Brent Welch | 9 Sep 2005 01:18
X-Face
Favicon

Re: pNFS some minor changes


>>>Marc Eshel said:
 > 
 > Brent Welch <welch <at> panasas.com> wrote on 09/08/2005 02:05:14 PM:
 > 
 > > 
 > > >>>"J. Bruce Fields" said:
 > >  > 
 > >  > On Wed, Sep 07, 2005 at 11:25:14AM -0700, Brent Welch wrote:
 > >  > > Keep in mind that some applications have single files that are
 > >  > > multi-terabytes in size, and so distributing them over many, many
 > >  > > servers may be just what you want to do.
 > >  > > 
 > >  > > I think Garth has put a parameter in that specifies how much
 > >  > > memory the client has for the returned layout.
 > >  > 
 > >  > I see that LAYOUTGET has a "maxcount" parameter, and the server can
 > >  > return TOOSMALL errors.  So a client can retry with increasingly 
 > large
 > >  > buffers.  Does a client that wants to interoperate with any pNFS 
 > server
 > >  > need to be prepared to retry with an arbitrarily large buffers?
 > > 
 > > Not necessarily.  Even if my multi-terabyte file is distributed over
 > > 1000's of data servers, I could get back a much smaller map that just
 > > provides a multi-gigabyte window into that file.   So, the model is not
 > > to retry until you get a multi-terabyte layout, but to do your I/O
 > > in smaller ranges of the file.  You may need to be creative in your
 > > layout definition to do that efficiently, but in the worst case of a
 > > 64K stripe unit spread over 10 million data servers, the server
 > > could give out a layout for 64 Meg that listed 1000 servers.
 > > Now, I wouldn't implement a layout like that, partly for this reason,
 > > but the client could make forward progress.  I would only expect 
 > TOOSMALL if
 > > the client-supplied buffer were just a handful of bytes or something.
 > > 
 > The only problem with this approach is that if you can not fit all the 
 > data servers in to one layout that describes striped file you have 
switch
 > to a one to one mapping of the file which can take many messages to 
 > describe for a very big file.
 > Marc.

Right, we don't use that for our really widely striped files.
Instead, we use a two-level scheme where you stripe the
first N gigabytes over M servers with a traditional striping
pattern, and then shift to another M servers for the next
N gigabytes, and so forth.  An advantage of this is that it
lets clients focus their attention on a smaller number of
data servers.  One client can't effectively draw data from 1000
servers at once, at least in our experience. And, we actually
do give out the complete map, even if it covers 1000 servers.
If you wanted to give out fewer than the complete set of servers
in the layout, then you'll supply the initial offset so
the client can do the math right.  Going back to the other
thread on "equivalent servers" and different layout and aggregation
schemes, this is yet another aggregation scheme for dealing 
with really large files.

--
Brent Welch
Software Architect, Panasas Inc
Accelerating Time to Results(tm) with Clustered Storage

www.panasas.com
welch <at> panasas.com

_______________________________________________
nfsv4 mailing list
nfsv4 <at> ietf.org
https://www1.ietf.org/mailman/listinfo/nfsv4


Gmane