<?xml version='1.0'encoding='utf-8'?>encoding='UTF-8'?> <!DOCTYPErfc> <?xml-stylesheet type='text/xsl' href='rfc2629.xslt'?>rfc [ <!ENTITY nbsp " "> <!ENTITY zwsp "​"> <!ENTITY nbhy "‑"> <!ENTITY wj "⁠"> ]> <rfc xmlns:xi="http://www.w3.org/2001/XInclude" category='std' docName='draft-ietf-nfsv4-layrec-04' number="9737" ipr='trust200902' obsoletes=''scripts='Common,Latin'updates="" sortRefs='true' submissionType='IETF' symRefs='true' tocDepth='3' tocInclude='true' consensus='true' version='3' xml:lang='en'> <front> <titleabbrev='LAYOUT_RECOVERY'> Reporting ofabbrev='Reporting Errors viaLAYOUTRETURNLAYOUTRETURN'>Reporting Errors in NFSv4.2</title>via LAYOUTRETURN</title> <seriesInfoname='Internet-Draft' value='draft-ietf-nfsv4-layrec-04'/>name='RFC' value='9737'/> <author fullname='Thomas Haynes' initials='T.' surname='Haynes'> <organization abbrev='Hammerspace'>Hammerspace</organization> <address> <email>loghyr@gmail.com</email> </address> </author> <author fullname='Trond Myklebust' initials='T.' surname='Myklebust'> <organization abbrev='Hammerspace'>Hammerspace</organization> <address> <email>trondmy@hammerspace.com</email> </address> </author> <dateyear='2024' month='November' day='21'/> <area>Transport</area> <workgroup>Network File System Version 4</workgroup>year='2025' month='February'/> <area>WIT</area> <workgroup>nfsv4</workgroup> <keyword>NFSv4</keyword> <abstract> <t> <!--[rfced] We note that "MDS" and "DS" are expanded as "metadata server" and "data server", respectively, in RFC 8435. May we expand these terms in the Abstract as shown below (option A) to match RFC 8435? After these terms are expanded, would you like to use the abbreviations? There are 37 instances of "metadata server" and 2 instances of "data server". If not, and it is desired to have the term written out, should "MDS" and "DS" simply be removed since they are not used elsewhere in the document (option B)? Please let us know your preference. Original: The Parallel Network File System (pNFS) allows for a file's metadata (MDS) and data (DS) to be on different servers. When the metadata server is restarted, the client can still modify the data file component. During the recovery phase of startup, the metadata server and the data servers work together to recover state (which files are open, last modification time, size, etc.). Perhaps A: The Parallel Network File System (pNFS) allows for a file's metadata and data to be on different servers (i.e., the metadata server (MDS) and the data server (DS)). or Perhaps B: The Parallel Network File System (pNFS) allows for a file's metadata and data to be on different servers. --> The Parallel Network File System (pNFS) allows for a file's metadata and data to be on different servers (i.e., the metadata server (MDS) and the data server (DS)). When the MDS is restarted, the client can still modify the data file component. <!--[rfced] Please clarify "which files are open, last modification time, size, etc.)". Are these files used by the servers during the recovery phase? Original: During the recovery phase of startup, the metadata server and the data servers work together to recover state (which files are open, last modification time, size, etc.). Perhaps: During the recovery phase of startup, the metadata server and the data servers work together to recover state (the files used are "open", "last modification time", "size", etc.). RFC EDITOR: needs AD approval --> During the recovery phase of startup, the MDS and the DSs work together to recover state. If the client has not encountered errors with the data files, then the state can berecovered, avoidingrecovered and the resilvering of the datafiles.files can be avoided. With any errors, there is no means by which the client can report errors to themetadata server.MDS. As such, themetadata serverMDS has to assume that a file needs resilvering. This document presents an extension toRFC8435RFC 8435 to allow the client to update the metadata via LAYOUTRETURN and avoid the resilvering. </t> </abstract><note removeInRFC='true'> <t> Discussion of this draft takes place on the NFSv4 working group mailing list (nfsv4@ietf.org), which is archived at <eref target='https://mailarchive.ietf.org/arch/browse/nfsv4/'/>. Working Group information can be found at <eref target='https://datatracker.ietf.org/wg/nfsv4/about/'/>. </t> </note></front> <middle> <section anchor='sec_intro' numbered='true'removeInRFC='false'toc='default'> <name>Introduction</name> <t> In the Network File Systemversion4version 4 (NFSv4) with a Parallel NFS (pNFS)Flexible File Layout (<xrefflexible file layout <xref target='RFC8435' format='default'sectionFormat='of'/>)sectionFormat='of'/> server, during recovery after a restart, there is no mechanism for the client to inform the metadata server (MDS) about an errorwhichthat occurred during a WRITE operation (seeSection 18.32 of<xref section="18.32" target='RFC8881' format='default' sectionFormat='of'/>)operationto the data servers (DSs) in the period of the outage. </t> <t> Using the process detailed in <xref target='RFC8178' format='default' sectionFormat='of'/>, the revisions in this document become an extension of NFSv4.2 <xref target='RFC7862' format='default' sectionFormat='of'/>. They are built on top of theexternal data representationExternal Data Representation (XDR) <xref target='RFC4506' format='default' sectionFormat='of'/> generated from <xref target='RFC7863' format='default' sectionFormat='of'/>. </t> <section anchor='sec_defs' numbered='true'removeInRFC='false'toc='default'> <name>Definitions</name> <t> SeeSection 1.1 of<xref section="1.1" target='RFC8435' format='default' sectionFormat='of'/> for a set of definitions. </t> </section> <section numbered='true'removeInRFC='false'toc='default'> <name>Requirements Language</name> <t> The key words'<bcp14>MUST</bcp14>', '<bcp14>MUST NOT</bcp14>', '<bcp14>REQUIRED</bcp14>', '<bcp14>SHALL</bcp14>', '<bcp14>SHALL NOT</bcp14>', '<bcp14>SHOULD</bcp14>', '<bcp14>SHOULD NOT</bcp14>', '<bcp14>RECOMMENDED</bcp14>', '<bcp14>NOT RECOMMENDED</bcp14>', '<bcp14>MAY</bcp14>',"<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>", "<bcp14>MAY</bcp14>", and'<bcp14>OPTIONAL</bcp14>'"<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as described inBCP 14BCP 14 <xreftarget='RFC2119' format='default' sectionFormat='of'/>target="RFC2119"/> <xreftarget='RFC8174' format='default' sectionFormat='of'/>target="RFC8174"/> when, and only when, they appear in all capitals, as shown here. </t> </section> </section> <section anchor='layout_state_recovery' numbered='true'removeInRFC='false'toc='default'> <name>Layout State Recovery</name> <t> Whena metadata serveran MDS restarts, clients are provided a grace recovery period where they are allowed to recover any state that they had established. With open files, the client can send an OPEN operation (seeSection 18.16 of<xref section="18.16" target='RFC8881' format='default' sectionFormat='of'/>)operationwith a claim type of CLAIM_PREVIOUS (seeSection 9.11 of<xref section="9.11" target='RFC8881' format='default' sectionFormat='of'/>). The client uses the RECLAIM_COMPLETE operation (seeSection 18.51 of<xref section="18.51" target='RFC8881' format='default' sectionFormat='of'/>)operationto notify themetadata serverMDS that it is done reclaiming state. </t> <t> The NFSv4Flexible File Layout Typeflexible file layout type allows for the client to mirror files (seeSection 8 of<xref section="8" target='RFC8435' format='default' sectionFormat='of'/>). Withclient sideclient-side mirroring, it is important for the client to inform themetadata serverMDS of any I/O errors encountered with one of the mirrors. This is the only way for themetadata serverMDS to determine if one or more of the mirrorsisare corrupt and then repair the mirrors via resilvering (seeSection 1.1 of<xref section="1.1" target='RFC8435' format='default' sectionFormat='of'/>). The client can use LAYOUTRETURN (seeSection 18.44 of<xref section="18.44" target='RFC8881' format='default' sectionFormat='of'/>) and the ff_ioerr4 structure (seeSection 9.1.1 of<xref section="9.1.1" target='RFC8435' format='default' sectionFormat='of'/>)structureto inform themetadata serverMDS of I/O errors. </t> <t> A problemis thatarises when themetadata serverMDS restarts and the client has errors it needs toreport, it can notreport but cannot do so.Section 12.7.4 of<xref section="12.7.4" target='RFC8881' format='default' sectionFormat='of'/> requires that the client <bcp14>MUST</bcp14> stop using layouts. While the intent there is that the client <bcp14>MUST</bcp14> stop doing I/O to the storage devices, it is also true that the layout stateids are no longer valid. The LAYOUTRETURN needs a layout stateid toproceedproceed, and the clientcan notcannot get a layout during grace recovery (seeSection 12.7.4 of<xref section="12.7.4" target='RFC8881' format='default' sectionFormat='of'/>) to recover layout state. As such, clients have no choice but to not recover files with I/O errors. In turn, themetadata serverMDS <bcp14>MUST</bcp14> assume that the mirrors are inconsistent and pick one for resilvering. It is a <bcp14>MUST</bcp14> because even if themetadata serverMDS can determine that the client did modify data during the outage, it <bcp14>MUST NOT</bcp14> assume those modifications were consistent. </t> <t> To fix this issue, themetadata serverMDS <bcp14>MUST</bcp14> acceptforthelrf_stateid in LAYOUTRETURN (see Section 18.44.1anonymous stateid of all zeros (see <xref section="8.2.3" target='RFC8881' format='default' sectionFormat='of'/>) for theanonymous stateid of all zeroslrf_stateid in LAYOUTRETURN (seeSection 8.2.3 of<xref section="18.44.1" target='RFC8881' format='default' sectionFormat='of'/>). The client can use this anonymous stateid to inform themetadata serverMDS of errors encountered. Themetadata serverMDS can then accurately resilver the file by picking the mirror(s) thatdodoes not have any associated errors. </t> <t> During the grace period, if the client sendsaan lrf_stateid in the LAYOUTRETURN with any value other than the anonymous stateid of all zeros, then themetadata serverMDS <bcp14>MUST</bcp14>nowrespond with an error of NFS4ERR_GRACE (seeSection of 15.1.9.2<xref section="15.1.9.2" target='RFC8881' format='default' sectionFormat='of'/>). After the grace period, if the client sendsaan lrf_stateid in the LAYOUTRETURN with a value of the anonymous stateid of all zeros, then themetadata serverMDS <bcp14>MUST</bcp14>nowrespond with an error of NFS4ERR_NO_GRACE (seeSection 15.1.9.3 of<xref section="15.1.9.3" target='RFC8881' format='default' sectionFormat='of'/>). </t> <t> Also, when themetadata serverMDS builds the reply to the LAYOUTRETURNwhen awith an lrf_stateid with the value of the anonymous stateid of allzeroszeros, it <bcp14>MUST NOT</bcp14> bump the seqid of the lorr_stateid. </t> <t> If themetadata serverMDS detects that the layout being returned in the LAYOUTRETURN does not match the current mirror instances found for the file, then it <bcp14>MUST</bcp14> ignore the LAYOUTRETURN and resilver the file in question. </t> <t> Themetadata serverMDS <bcp14>MUST</bcp14> resilver any fileswhichthat are neither explicitly recovered with a CLAIM_PREVIOUS nor have a reported error via a LAYOUTRETURN. The client has most likely restarted and lost any state. </t> <section anchor='sec_when_to_resilver' numbered='true'removeInRFC='false'toc='default'> <name>When to Resilver</name> <t> A write intent occurs when a client opens a file and gets a LAYOUTIOMODE4_RW from themetadata server.MDS. Themetadata serverMDS <bcp14>MUST</bcp14> track outstanding writeintentsintents, and when it restarts, it <bcp14>MUST</bcp14> track recovery of those write intents. The method that themetadata serverMDS uses to track write intents is implementation specific, i.e., outsideofthe scope of this document. </t> <t> The decision to resilver a file depends on how the client recovers the file before the grace period ends. If the client reclaims the file and reports no errors, themetadata serverMDS <bcp14>MUST NOT</bcp14> resilver the file. If the client reports an error on the file, then the file <bcp14>MUST</bcp14> be resilvered. If the client does not reclaim or report an error before the grace period ends, then under the old behavior, themetadata serverMDS <bcp14>MUST</bcp14> resilver the file. </t> <t> The resilvering process is broadly to: </t> <ol> <li> fence the file (seeSection 2.2 of<xref section="2.2" target='RFC8435' format='default' sectionFormat='of'/>), </li> <li> record the need to resilver, </li> <li> release the write intent, and </li> <li> once there are no write intents on the file, start the resilvering process. </li> </ol> <t> Themetadata serverMDS <bcp14>MUST NOT</bcp14> resilver a file if there are clients with outstanding writeintents. I.e.,intents, i.e., multiple clients might have the file open with write intents. Asitthe MDS <bcp14>MUST</bcp14> track write intents, it <bcp14>MUST</bcp14> also track the need toresilver. I.e.,resilver, i.e., if themetadata serverMDS restarts during the grace period, it <bcp14>MUST</bcp14> restart the file recovery if it replays the writeintentintent, or else it <bcp14>MUST</bcp14> start the resilvering if it replays the resilvering intent. </t> <t> Whether themetadata serverMDS prevents all I/O to the file until the resilvering isdone ordone, forces all I/O to go through themetadata serverMDS, or allows a proxy server to update the new data file as it is beingresliveredresilvered is all an implementation choice. The constraint is that themetadata serverMDS is responsible for the reconstruction of the data file and for the consistency of the mirrors. </t> <t> If themetadata serverMDS does allow the client access to the file during the resilvering, then the client <bcp14>MUST</bcp14> have the same layout (set of mirror instances) after themetadata serverMDS as before. One way that such a resilvering can occur is for a proxy server to be inserted into the layout. That server will be copying a good mirror instance to a new instance. As it gets I/O via the layout, it will be responsible for updating the copy it is performing. This requirement is that the proxy server <bcp14>MUST</bcp14> stay in the layout until the grace period is finished. </t> </section> <section anchor='sec_vers_mismatch' numbered='true'removeInRFC='false'toc='default'> <name>Version Mismatch Considerations</name> <t> Themetadata serverMDS has no expectations for the client to use this new functionality. Therefore, if the client does not use it, themetadata serverMDS will function normally. </t> <t> If the client does use the new functionality and themetadata serverMDS does not support it, then themetadata serverMDS <bcp14>MUST</bcp14> reply with a NFS4ERR_BAD_STATEID to the LAYOUTRETURN. If the client detects a NFS4ERR_BAD_STATEID error in this scenario, it should fall back to the old behavior of not reporting errors. </t> </section> </section> <section anchor='sec_security' numbered='true'removeInRFC='false'toc='default'> <name>Security Considerations</name> <t> There are no new security considerations beyond those in <xref target='RFC7862' format='default' sectionFormat='of'/>. </t> </section> <section anchor='sec_iana' numbered='true'removeInRFC='false'toc='default'> <name>IANA Considerations</name> <t>There areThis document has no IANAconsiderations for this document.actions. </t> </section> </middle> <back> <references> <name>References</name> <references> <name>Normative References</name> <xi:includexmlns:xi='http://www.w3.org/2001/XInclude' href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml'/>href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/> <xi:includexmlns:xi='http://www.w3.org/2001/XInclude' href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml'/>href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml"/> <xi:includexmlns:xi='http://www.w3.org/2001/XInclude' href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7862.xml'/>href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7862.xml"/> <xi:includexmlns:xi='http://www.w3.org/2001/XInclude' href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7863.xml'/>href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7863.xml"/> <xi:includexmlns:xi='http://www.w3.org/2001/XInclude' href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml'/>href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/> <xi:includexmlns:xi='http://www.w3.org/2001/XInclude' href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8178.xml'/>href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8178.xml"/> <xi:includexmlns:xi='http://www.w3.org/2001/XInclude' href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8435.xml'/>href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8435.xml"/> <xi:includexmlns:xi='http://www.w3.org/2001/XInclude' href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8881.xml'/>href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8881.xml"/> </references> </references> <sectionnumbered='true' removeInRFC='false'numbered='false' toc='default'> <name>Acknowledgments</name><t> Tigran Mkrtchyan, Jeff Layton,<t><contact fullname="Tigran Mkrtchyan"/>, <contact fullname="Jeff Layton"/>, andRick Macklem<contact fullname="Rick Macklem"/> provided reviews of thedocument. </t>document.</t> </section> </back> </rfc>