<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="no"?>
<?rfc subcompact="no"?>
<?rfc authorship="yes"?>
<?rfc tocappendix="yes"?><?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc [
<!ENTITY nbsp " ">
<!ENTITY zwsp "​">
<!ENTITY nbhy "‑">
<!ENTITY wj "⁠">
]>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" category="info" ipr='trust200902' tocInclude="true" obsoletes="" updates="" consensus="true" submissionType="IETF" xml:lang="en" version="3" docName="draft-ietf-rift-applicability-17" > number="9696" symRefs="true" sortRefs="true">
<front>
<!-- [rfced] Please note that the title of the document has been updated to
expand "RIFT" per Section 3.6 of RFC 7322 ("RFC Style Guide"). Please
review.
Original:
RIFT Applicability and Operational Considerations
Current:
Routing in Fat Trees (RIFT) Applicability and Operational Considerations
-->
<title abbrev='RIFT Applicability Statement'>RIFT Statement'>Routing in Fat Trees (RIFT) Applicability and Operational Considerations</title>
<seriesInfo name="RFC" value="9696"/>
<author fullname='Yuehua Wei' initials='Y.' surname='Wei' role='editor' >
<organization>ZTE Corporation</organization>
<address>
<postal>
<street>No.50, Software Avenue</street>
<city>Nanjing</city>
<region/>
<code>210012</code>
<country>China</country>
</postal>
<email>wei.yuehua@zte.com.cn</email>
</address>
</author>
<author fullname='Zheng (Sandy) Zhang' initials='Z.' surname='Zhang'>
<organization>ZTE Corporation</organization>
<address>
<postal>
<street>No.50, Software Avenue</street>
<city>Nanjing</city>
<region/>
<code>210012</code>
<country>China</country>
</postal>
<email>zhang.zheng@zte.com.cn</email>
</address>
</author>
<author fullname='Dmitry Afanasiev' initials='D.' surname='Afanasiev'>
<organization>Yandex</organization>
<address>
<postal>
<street/>
<city/>
<region/>
<code/>
<country/>
</postal>
<email>fl0w@yandex-team.ru</email>
</address>
</author>
<author fullname='Pascal Thubert' initials='P.' surname='Thubert'>
<organization abbrev='Cisco Systems'>Cisco Systems, Inc</organization>
<address>
<postal>
<street>Building D</street>
<street>45 Allee des Ormes - BP1200 </street>
<city>MOUGINS
<city>Mougins - Sophia Antipolis</city>
<code>06254</code>
<country>FRANCE</country>
<country>France</country>
</postal>
<phone>+33 497 23 26 34</phone>
<email>pthubert@cisco.com</email>
</address>
</author>
<author fullname='Tony Przygienda' initials='T.' surname='Przygienda'>
<organization abbrev='Juniper Networks'>Juniper Networks</organization>
<address>
<postal>
<street>1194 N. Mathilda Ave </street>
<city>Sunnyvale, CA</city> Ave</street>
<city>Sunnyvale</city>
<region>CA</region>
<code>94089</code>
<country>US</country>
<country>United States of America</country>
</postal>
<email>prz@juniper.net</email>
</address>
</author>
<date/>
<area>Routing</area>
<workgroup>RIFT WG</workgroup>
<keyword>RIFT</keyword>
<date month="December" year="2024"/>
<area>RTG</area>
<workgroup>rift</workgroup>
<!-- [rfced] Please insert any keywords (beyond those that appear in
the title) for use on https://www.rfc-editor.org/search. -->
<keyword>example</keyword>
<abstract>
<t>
This document discusses the properties, applicability applicability, and operational
considerations of RIFT Routing in Fat Trees (RIFT) in different network scenarios. It intends to provide scenarios
with the intention of providing a rough guide on how RIFT can be deployed
to simplify routing operations in Clos topologies and their variations.
</t>
</abstract>
</front>
<!-- ***** MIDDLE MATTER ***** -->
<middle>
<section><name>Introduction</name>
<t>This document discusses the properties and applicability of
<xref target='I-D.ietf-rift-rift'>"Routing target='RFC9692'>"RIFT: Routing in Fat Trees"</xref> in
different deployment scenarios and highlights the operational simplicity of the
technology compared to traditional routing solutions.
It also documents special considerations when RIFT is used with or without overlays and/or controllers, controllers and how RIFT identifies miscablings and reroutes around node and link failures.
</t>
</section>
<section><name>Terminology</name>
<t>This
<!-- [rfced] To avoid repetition and make the text more concise, we have
updated the following sentences in Section 2. Please let us know any
objections.
Original:
This document uses the terminology of <xref target='I-D.ietf-rift-rift'>RIFT</xref>. RIFT [RIFT]. The most
frequently used terminologies defined in RIFT are listed here. These terms
are consistent with definition in RIFT [RIFT]
Current:
This document uses the terminology defined in [RIFT]. The most
frequently used terms and their definitions from that document are listed
here.
-->
<t>This document uses the terminology defined in <xref target='I-D.ietf-rift-rift'>RIFT</xref> </t> target='RFC9692'/>.
The most frequently used terms and their definitions from that document are
listed here.</t>
<dl newline="true" spacing="normal">
<dt>Clos/Fat
<dt>Clos / Fat Tree:</dt>
<dd>
This document uses the terms Clos "Clos" and Fat Tree "Fat Tree" interchangeably
where it always refers to a folded spine-and-leaf topology with possibly multiple Points of Delivery (PoDs) and one or multiple Top of Fabric (ToF) planes.
Several modifications such as leaf-2-leaf
shortcuts and multiple level shortcuts are possible and described further in
the document.
</dd>
<dt>Crossbar:</dt>
<dd>
Physical arrangement of ports in a switching matrix without
implying any further scheduling or buffering disciplines.
</dd>
<dt>Directed Acyclic Graph (DAG):</dt>
<dd>A finite directed graph with no directed cycles (loops).
<!-- [rfced] What is "vice versa" referring to in this sentence?
Original:
If links in a Clos are considered as either being all directed
towards the top or vice versa, each of such two graphs is a DAG.
Perhaps:
If links in a Clos are considered as either being all directed
towards the top or bottom, each of such two graphs is a DAG.
-->
If links in a Clos are considered as either being all directed towards the top or vice versa, each
of two such graphs is a DAG.
</dd>
<dt>Disaggregation:</dt>
<dd>
Process
The process in which a node decides to
advertise more specific prefixes Southwards, southwards, either positively to
attract the corresponding traffic, traffic or negatively to repel it.
Disaggregation is performed to prevent traffic loss and suboptimal
routing to the more specific prefixes.</dd>
<dt>Leaf:</dt>
<dd>A node without southbound adjacencies. Level 0 implies a leaf in RIFT RIFT, but a leaf does not have to be level 0.
</dd>
<dt>LIE:</dt>
<dd>This is an acronym for a "Link Information Element" exchanged
on all the system's links running RIFT to form <em>ThreeWay</em>
adjacencies and carry information used to perform RIFT Zero Touch
Provisioning (ZTP) of levels.
</dd>
<dt>South Reflection:</dt>
<dd>Often abbreviated just as
"reflection", it South Reflection defines a mechanism where South Node TIEs
are "reflected" from the level south back up north to allow
nodes in the same level
without E-W East-West links to be aware of each other's node Topology
Information Elements (TIEs).</dd>
<dt>Spine:</dt>
<dd>Any nodes north of leaves and south of ToF nodes. Multiple
layers of spines in a PoD are possible.
</dd>
<dt>TIE:</dt>
<dd>This is an acronym for a "Topology Information Element". TIEs are
exchanged between RIFT nodes to describe parts of a network such as
links and address prefixes. A TIE has always has a direction and a
type. North TIEs (sometimes abbreviated as N-TIEs) are used when
dealing with TIEs in the northbound representation representation, and South-TIEs
(sometimes abbreviated as S-TIEs) are used for the southbound
equivalent. TIEs have different types types, such as node and prefix TIEs.
</dd>
</dl>
<!--End of Terminology-->
</section>
<section><name>Problem Statement of Routing in Modern IP Fabric Fat Tree Networks</name>
<t><xref target="CLOS">Clos</xref>
<!-- [rfced] We are unable to verify if the term "homonym" is used correctly in [FATTREE]. May we rephrase the following sentence for accuracy?
Original:
Clos [CLOS] topologies (called commonly a fat tree/network in modern
IP fabric considerations as homonym to the original definition of the
term <xref target="FATTREE">Fat Tree</xref>) Fat Tree [FATTREE]) have gained prominence in today's
networking, primarily as a result of the paradigm shift towards a
centralized data-center based architecture that deliver a majority of
computation and storage services.
Perhaps:
Clos [CLOS] topologies (commonly called a Fat Tree/network in modern
IP fabric considerations as a similar term for the original definition of the
term Fat Tree [FATTREE]) have gained prominence in today's
networking, primarily as a result of the paradigm shift towards a
centralized data-center-based architecture that delivers a majority of
computation and storage services.
-->
<t><xref target="CLOS">Clos</xref> topologies (commonly called a Fat Tree/network in modern IP fabric considerations as a homonym to the original definition of the term <xref target="FATTREE">Fat Tree</xref>) have gained prominence in today's networking, primarily as a result of the paradigm shift towards a centralized data-center-based architecture that delivers a majority of computation and storage services.
</t>
<t>Current routing protocols were geared towards a network with an
irregular topology with isotropic properties, properties and a low degree of connectivity.
When applied to Fat Tree topologies:
</t>
<ul>
<ul spacing="normal">
<li>They tend to need extensive configuration or provisioning
during initialization and adding or removing nodes from the
fabric.</li>
<li>For link state link-state routing protocols, all nodes including spine and leaf
spine-and-leaf nodes learn the entire network topology and routing
information, which is in fact, actually not needed on the leaf nodes during
normal operation. They flood significant amounts of duplicate link state
link-state information between spine
and leaf spine-and-leaf nodes during
topology updates and convergence events, requiring that additional
CPU and link bandwidth be consumed. This may impact the stability
and scalability of the fabric, make the fabric less reactive to
failures, and prevent the use of cheaper hardware at the lower
levels
(i.e. spine and leaf (i.e., spine-and-leaf nodes).
</li>
</ul>
</section>
<section><name>Applicability of RIFT to Clos IP Fabrics</name>
<t>
Further content of this document assumes that the reader is familiar with the
terms and concepts used in the <xref target='RFC2328'>OSPF (Open target='RFC2328'>Open Shortest Path First)</xref>, First
(OSPF)</xref>, <xref target='RFC5340'>OSPF for IPv6</xref> IPv6</xref>, and <xref target='ISO10589-Second-Edition'>IS-IS (Intermediate
target='ISO10589-Second-Edition'>Intermediate System to Intermediate System)</xref> System
(IS-IS)</xref> link-state
protocols. The sections of <xref target='I-D.ietf-rift-rift'>RIFT</xref> outline target='RFC9692'/> outlines the
requirements of routing in IP fabrics and RIFT protocol concepts.
</t>
<section><name>Overview of RIFT</name>
<t>
RIFT is a dynamic routing protocol that is tailored for use in Clos, Fat-Tree, Fat Tree, and other anisotropic topologies.
A
Therefore, a core property therefore of RIFT is that its operation is
sensitive to the structure of the fabric - -- it is anisotropic. RIFT acts as a link-state protocol when "pointing north", advertising southwards southward routes to northwards northward peers (parents) through flooding and database synchronization. When "pointing south", RIFT operates hop-by-hop like a distance- vector distance-vector protocol, typically advertising a fabric default route towards the Top of Fabric (ToF, ToF, aka superspine) superspine, to southwards southward peers (children).
</t>
<t>
The fabric default is typically the default route, route as described in
Section 6.3.8 <xref target='RFC9692' sectionFormat='bare' section='6.3.8'>
"Southbound Default Route Origination" Origination"</xref> of <xref target='I-D.ietf-rift-rift'>RIFT</xref>. target="RFC9692"/>.
The ToF nodes may alternatively originate more specific prefixes (P') southbound
instead of the default route. In such a scenario, all addresses carried within
the RIFT domain must be contained within P', and it is possible for a leaf that
acts as gateway to the Internet to advertise the default route instead.
</t>
<t>RIFT floods flat link-state information northbound only so that each level
obtains the full topology of the levels that are south of it. That information is never flooded
east-west
East-West or back south again. So again, so a top tier node has a full set of prefixes from
the Shortest Path First (SPF) calculation.
</t>
<t>In the southbound direction, the protocol operates like a "fully summarizing,
unidirectional" path-vector protocol or rather or, rather, a distance-vector with implicit split horizon. Routing information, normally just the default route, propagates one hop south and is "re-advertised" by nodes at next lower level.
</t>
<figure align='center' anchor='pic-rift'><name>RIFT overview</name> Overview</name>
<artwork align='center'><![CDATA[
+---------------+ +----------------+
| ToF | | ToF | LEVEL 2
+ ++------+--+--+-+ ++-+--+----+-----+
| | | | | | | | | ^
+ | | | +-------------------------+ |
Distance
Distance- | +-------------------+ | | | | |
Vector | | | | | | | | +
South | | | | +--------+ | | | Link-State
+ | | | | | | | | Flooding
| | | +----------------+ | | | North
v | | | | | | | | +
++---+-+ +------+ +-+----+ ++----++ |
|SPINE | |SPINE | | SPINE| | SPINE| | LEVEL 1
+ ++----++ ++---+-+ +-+--+-+ ++----++ |
+ | | | | | | | | | ^ N
Distance
Distance- | +-------+ | | +--------+ | | | E
Vector | | | | | | | | | +------>
South | +-------+ | | | +------+ | | | |
+ | | | | | | | | | +
v ++--++ +-+-++ ++--++ ++--++ +
|LEAF| |LEAF| |LEAF| |LEAF| LEVEL 0
+----+ +----+ +----+ +----+
]]></artwork> +----+]]></artwork>
</figure>
<t>A spine node has only has information necessary for its level, which is all
destinations south of the node based on SPF calculation, the default route, and
potentially disaggregated routes.
</t>
<t>RIFT
<!-- [rfced] May we specify "link-state" and "distance-vector" for clarity in
the following instances?
Original:
RIFT combines the advantage of both link-state and distance-vector...
RIFT also eliminates major disadvantages of link-state and distance-vector
with...
Perhaps:
RIFT combines the advantages of both link-state and distance-vector
protocols...
RIFT also eliminates major disadvantages of link-state and distance-vector
protocols...
-->
<t>RIFT combines the advantages of both link-state and distance-vector:
</t>
<ul>
<ul spacing="normal">
<li>Fastest possible convergence</li>
<li>Automatic detection of topology</li>
<li>Minimal routes/information on Top-of-Rack (ToR) switches, aka leaf nodes</li>
<li>High degree of ECMP</li>
<li>Fast de-commissioning decommissioning of nodes</li>
<li>Maximum propagation speed with flexible prefixes in an update</li>
</ul>
<t>So there
<t>There are two types of link-state database which databases that are "north representation"
North Topology Information Elements (N-TIEs) and "south representation" South
Topology Information Elements (S-TIEs). The N-TIEs contain a link-state
topology description of lower levels levels, and the S-TIEs carry simply carry default and
disaggregated routes for the lower levels.
</t>
<t>RIFT also eliminates major disadvantages of link-state and distance-vector with: with the following:
</t>
<t>
</t><ul>
<ul spacing="normal">
<li>Reduced and balanced flooding</li>
<li>Level constrained
<li>Level-constrained automatic neighbor discovery</li>
</ul><t>
</t>
<t>To achieve this, RIFT builds on the art of IGPs, not only OSPF and IS-IS but also MANET such as OSPF, IS-IS, Mobile Ad Hoc Network (MANET), and IoT (Internet Internet of Things), Things (IoT) to provide unique features:
</t>
<ul>
<ul spacing="normal">
<li>Automatic (positive or negative) route disaggregation of northwards northward routes upon fallen leaves</li>
<li>Recursive operation in the case of negative route
disaggregation </li>
<li>Anisotropic routing that extends a principle seen in the <xref target='RFC6550'>RPL</xref> target='RFC6550'>Routing Protocol for Low-Power and Lossy Networks (RPL)</xref> to wide superspines</li>
<li>Optimal flooding reduction that derives from the concept of a "multipoint relay" (MPR) found in <xref target='RFC3626'>OLSR</xref> target='RFC3626'>Optimized Link State Routing (OLSR)</xref> and
balances the flooding load over northbound links and nodes.</li> nodes</li>
</ul>
<t>Additional advantages that are unique to RIFT are listed below, the below. The details of which these advantages can be found in <xref target='I-D.ietf-rift-rift'>RIFT</xref>. target='RFC9692'>RIFT</xref>.
</t>
<ul>
<ul spacing="normal">
<li>True ZTP (Zero Touch Provisioning)</li> ZTP</li>
<li>Minimal blast radius on failures</li>
<li>Can utilize all paths through fabric without looping</li>
<li>Simple leaf implementation that can scale down to servers</li>
<li>Key-Value
<li>Key-value store</li>
<li>Horizontal links used for protection only</li>
<!-- [rfced] Some author comments are present in the XML. Please confirm that
no updates related to these comments are outstanding. Note that the
comments will be deleted prior to publication.
-->
<!--li>Supports non-equal cost multipath and can replace multi-chassis link aggregation group (MLAG or MC-LAG)</li-->
</ul>
</section>
<section><name>Applicable Topologies</name>
<t>
Albeit RIFT is specified primarily for "proper" Clos or Fat Tree topologies,
the protocol natively supports Points of Delivery (PoD) concepts, which, strictly speaking, are not found in the original Clos concept.
</t>
<t>Further, the specification explains and supports operations of multi-plane
Clos variants where the protocol recommends the use of inter-plane rings at the
Top-of-Fabric
ToF level to allow the reconciliation of topology view of different planes
to make the negative disaggregation Negative Disaggregation viable in case of failures within a plane.
These observations hold not only in case of RIFT but also in the generic
case of dynamic routing on Clos variants with multiple planes and failures
in bi-sectional bisectional bandwidth, especially on the leafs. leaves.
</t>
<section><name>Horizontal Links</name>
<t>
RIFT is not limited to pure Clos divided into PoD and multi-planes but
supports horizontal (East-West) links below the top of fabric ToF level. Those links
are used only for last resort northbound forwarding when a spine loses all its
northbound links or cannot compute a default route through them.
</t>
<t>
<!-- [rfced] May we update the following sentence for clarity? Additionally,
should "employed" be updated to "deployed"? We note that this is the only
instance of "employed" that appears in the document.
Original:
A full-mesh connectivity between nodes on the same level can be
employed and that allows N-SPF to provide for any node losing all its
northbound adjacencies (as long as any of the other nodes in the
level are northbound connected) to still participate in northbound
forwarding.
Perhaps:
A full-mesh connectivity between nodes on the same level can be
deployed, which allows North SPF (N-SPF) to provide for any node losing all its
northbound adjacencies (as long as any of the other nodes in the
level are northbound connected) and still participate in northbound
forwarding.
-->
<t>A full-mesh connectivity between nodes on the same level can be employed
and that allows North SPF (N-SPF) to provide for any node losing all its
northbound adjacencies (as long as any of the other nodes in the level are
northbound connected) to still participate in northbound forwarding.
</t>
<t>Note that a "ring" of horizontal links at any level below ToF does not provide a "ring-based protection" scheme since the SPF computation would have to deal necessarily with breaking of "loops", an application for which RIFT is not intended.
</t>
</section>
<section><name>Vertical Shortcuts</name>
<t>
Through relaxations of the specified adjacency forming rules, RIFT implementations can be extended to support vertical "shortcuts". The RIFT specification
itself does not provide the exact details since the resulting solution suffers from
either a much larger blast radius with increased flooding volumes or
bow tie problems in the case of maximum aggregation routing, bow-tie problems. routing.
</t>
</section>
<section><name>Generalizing to any Any Directed Acyclic Graph</name>
<t>
RIFT is an anisotropic routing protocol, meaning that it has a sense of direction (northbound, southbound, east-west) and that it East-West) and operates differently depending on the direction.
</t>
<t>
Since a DAG provides a sense of north (the
direction of the DAG) and of south (the reverse), it can be used to
apply RIFT——an RIFT -- an edge in the DAG that has only incoming vertices is a
ToF node.
</t><t>
There are a number of caveats though:
</t>
<ul>
<ul spacing="normal">
<li>The DAG structure must exist before RIFT starts, so there is a need for a companion protocol to establish the logical DAG structure.
</li>
<li>A generic DAG does not have a sense of east East and west. West. The operation specified for east-west East-West links and the southbound reflection between nodes are not applicable.
Also
Also, ZTP will derive a sense of depth that will eliminate some links. Variations of ZTP could be derived to meet specific objectives, e.g., make it so that most routers have at least 2 two parents to reach the ToF.
</li>
<li>
RIFT applies to any Destination-Oriented DAG (DODAG) where there's only one ToF node and the problem of disaggregation does not exist.
<!-- [rfced] Should "Link State" be specified as "link-state protocols" here?
Original:
In that case, RIFT operates very much like RPL [RFC6550], but using
Link State for southbound routes (downwards in RPL's terms).
Perhaps:
In that case, RIFT operates very much like RPL [RFC6550], but uses
link-state protocols for southbound routes (downwards in RPL's terms).
-->
In that case, RIFT
operates very much like RPL <xref target='RFC6550'/>, but using uses Link State for southbound routes (downwards in RPL's terms).
For an arbitrary DAG with multiple destinations (ToFs) (ToFs), the way disaggregation happens has to be considered.
</li>
<li>Positive disaggregation Disaggregation expects that most of the ToF nodes reach most of the leaves, so disaggregation is the exception as opposed to the rule. When this is no longer true, it makes sense to turn off disaggregation and route between the ToF nodes over a ring, a full mesh, a transit network, or a form of area zero. There Then again, this operation is similar to RPL operating as a single DODAG with a virtual root.
</li>
<li>
In order to aggregate and disaggregate routes, RIFT requires that all the ToF nodes share the full knowledge of the prefixes in the fabric. This can be achieved with a ring as suggested by <xref target='I-D.ietf-rift-rift'>"RIFT"</xref>, target='RFC9692'>RIFT</xref>, by some preconfiguration, or by using a synchronization with a common repository where all the active prefixes are registered.
</li>
</ul>
</section>
<section title="Reachability of Internal Nodes in the Fabric" anchor="onastick">
<t>RIFT does not require that nodes have reachable addresses in the fabric,
though it is clearly desirable for operational purposes. Under normal operating
conditions
conditions, this can be easily achieved by injecting the node's loopback
address into North and South Prefix TIEs or other implementation specific implementation-specific
mechanisms.
</t>
<t>
Special considerations arise when a node loses all northbound adjacencies, adjacencies
but is not at the top of the fabric. If a spine node loses all northbound links, the spine node doesn't advertise a default route. But if the level of the spine node is auto-determined by ZTP, it will "fall down" as depicted in <xref target='Fallen-spine'/>.
</t>
</section>
</section>
<section><name>Use Cases</name>
<section><name>Data Center Topologies</name>
<section><name>Data Center Fabrics</name>
<t>
<!-- [rfced] May we rephrase the following sentence for ease of the reader?
Original:
RIFT is suited for applying in data center (DC) IP fabrics underlay
routing, vast majority of which seem to be currently (and for the foreseeable
future) Clos architectures.
Perhaps:
RIFT is suited for applying underlay routing in data center (DC) IP
fabrics, with the vast majority of these IP fabrics being Clos architectures
(and will be for the foreseeable future).
-->
RIFT is suited for applying in data center (DC) IP fabrics underlay routing, vast majority of which seem to be currently (and
for
the foreseeable future)
Clos architectures. It significantly simplifies operation and deployment
of such fabrics as described in <xref target='opex'/> for environments compared
to
extensive proprietary provisioning and operational solutions.
</t>
</section>
<section><name>Adaptations to Other Proposed Data Center Topologies</name>
<figure align='center' anchor='levelshortcuts'><name>Level Shortcut</name>
<artwork align='center'><![CDATA[
. +-----+ +-----+
. | | | |
.+-+ S0 | | S1 |
.| ++---++ ++---++
.| | | | |
.| | +------------+ |
.| | | +------------+ |
.| | | | |
.| ++-+--+ +--+-++
.| | | | |
.| | A0 | | A1 |
.| +-+--++ ++---++
.| | | | |
.| | +------------+ |
.| | +-----------+ | |
.| | | | |
.| +-+-+-+ +--+-++
.+-+ | | |
. | L0 | | L1 |
. +-----+ +-----+
]]>
</artwork> +-----+]]></artwork>
</figure>
<t>
RIFT is not strictly limited to Clos topologies. The protocol only
requires a sense of "compass rose directionality" either achieved
through configuration or derivation of levels.
So,
So conceptually, shortcuts between levels could be included.
<xref target="levelshortcuts"/> depicts an example of a shortcut
between levels. In this example, sub-optimal suboptimal routing will
occur when traffic is sent from L0 to L1 via S0's
default route and back down through A0 or A1.
In order to avoid that, only default routes from A0 or A1
are used, all used. All leaves would be required to install each other's routes.
</t>
<t>
While various technical and operational challenges may require the use of such modifications,
discussion of those topics are is outside the scope of this document.
</t>
</section>
</section>
<section><name>Metro Networks</name>
<t>
The demand for bandwidth is increasing steadily, driven primarily by
environments close to
content producers (server farms connection via DC fabrics) but in
proximity to content consumers as well.
Consumers are often clustered in metro areas with their own network
architectures that can benefit
from simplified, regular Clos structures and hence structures. Thus, they can also benefit from RIFT.
</t>
</section>
<section><name>Building Cabling</name>
<t>
Commercial edifices are often cabled in topologies that are
either Clos or its isomorphic equivalents. The
Clos can grow rather high with many levels. That presents a challenge
for traditional routing protocols (except BGP<xref BGP <xref target='RFC4271'/> and by now Private Network-Network Interface (PNNI) <xref target='PNNI'/>, which is largely
phased-out PNNI<xref target='PNNI'/>) which by now) that do not support
an arbitrary number of levels levels, which RIFT does naturally. Moreover, due to the limited sizes of forwarding tables in network elements of building cabling, the minimum FIB size RIFT maintains under normal conditions is cost-effective in terms of hardware and operational costs.
</t>
</section>
<section><name>Internal Router Switching Fabrics</name>
<t>
It is common in high-speed communications switching and routing
devices to use switch fabrics which that are interconnection networks inside the devices connecting the input ports to their output ports. For example, a crossbar is one of the switch fabric techniques while a crossbar techniques, even though it is not feasible due to cost, head-of-line blocking blocking, or size trade-offs. And normally Normally, such fabrics are not self-healing or rely on 1:1 or 1+1 protection schemes schemes, but it is conceivable to use RIFT to operate Clos fabrics that can deal effectively with interconnections
or subsystem failures in such a module. RIFT is not IP specific and
hence any link addressing connecting internal device subnets is
conceivable.
</t>
</section>
<section><name>CloudCO</name>
<t>
The Cloud Central Office (CloudCO) is a new stage of the telecom Central Office. It takes the advantage of Software Defined Software-Defined Networking (SDN) and Network Function Virtualization (NFV) in conjunction with general purpose hardware to optimize current networks.
The following figure illustrates this architecture at a high level. It describes a single instance or macro-node of cloud CO CloudCO that provides a number of Value Added Services (VAS), value-added services (VASes), a Broadband Access Abstraction (BAA), and virtualized network services. An Access I/O module faces a Cloud CO CloudCO access node, node and the Customer Premises Equipments (CPEs) Equipment (CPE) behind it. A Network I/O module is facing the core network.
<!-- [rfced] To match [TR-384], may we update "leaf and spine fabric" to
"leaf-spine fabric"?
Original:
The two I/O modules are interconnected by a leaf and spine fabric [TR-384].
Perhaps:
The two I/O modules are interconnected by a leaf-spine fabric [TR-384].
-->
The two I/O modules are interconnected by a leaf and spine fabric <xref target='TR-384'/>.
</t>
<figure align='center' anchor='pic-CloudCO'><name>An example of CloudCO architecture</name> anchor='pic-CloudCO'><name>CloudCO Architecture Example</name>
<artwork align='center'><![CDATA[
+---------------------+ +----------------------+
| Spine | | Spine |
| Switch | | Switch |
+------+---+------+-+-+ +--+-+-+-+-----+-------+
| | | | | | | | | | | |
| | | | | +-------------------------------+ |
| | | | | | | | | | | |
| | | | +-------------------------+ | | |
| | | | | | | | | | | |
| | +----------------------+ | | | | | | | |
| | | | | | | | | | | |
| +---------------------------------+ | | | | | | |
| | | | | | | | | | | |
| | | +-----------------------------+ | | | | |
| | | | | | | | | | | |
| | | | | +--------------------+ | | | |
| | | | | | | | | | | |
+--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+
|L | | Leaf | | Leaf | | Leaf | | Leaf | |L |
|S | | Switch | | Switch | | Switch | | Switch| |S |
++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++
| | | | | | | | | | | | | |
| +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ |
| |Compute | |Compute | | Compute | |Compute| |
| |Node | |Node | | Node | |Node | |
| +--------+ +--------+ +----------+ +-------+ |
| || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || |
| |--------| |--------| |----------| |-------| |
| |--------| |--------| |----------| |-------| |
| || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || |
| |--------| |--------| |----------| |-------| |
| |--------| |--------| |----------| |-------| |
| || VAS7 || || VAS4 || || vIGMP || ||BAA || |
| |--------| |--------| |----------| |-------| |
| +--------+ +--------+ +----------+ +-------+ |
| |
++-----------+ +---------++
|Network I/O | |Access I/O|
+------------+ +----------+
]]>
</artwork> +----------+]]></artwork>
</figure>
<t>
The Spine-Leaf architecture deployed inside CloudCO meets the network requirements of being adaptable, agile, scalable scalable, and dynamic.
</t> dynamic.</t>
</section>
</section>
</section>
<section anchor='opex'><name>Operational Considerations</name>
<t>
RIFT presents the features for organizations building and operating
IP fabrics to simplify the operation and deployments while achieving
many desirable
properties of a dynamic routing protocol on such a substrate:
</t>
<ul>
<ul spacing="normal">
<li>
RIFT only floods routing information to the devices that need it.
</li>
<li>
RIFT allows for Zero Touch Provisioning ZTP within the protocol.
In its most extreme version, RIFT does not rely on any specific addressing
and for IP fabric can operate using <xref target='RFC4861'>IPv6 ND</xref> only. Neighbor Discovery (ND)</xref> only for IP fabric.
</li>
<li>
RIFT has provisions to detect common IP fabric miscabling scenarios.
</li>
<li>
RIFT negotiates automatically BFD negotiates Bidirectional Forwarding Detection (BFD) per link. This allows for IP and <xref target='RFC7130'>micro-BFD</xref> to replace Link Aggregation Groups (LAGs) which do that hide bandwidth
imbalances in case of constituent failures. Further automatic link validation
techniques similar to those in <xref target='RFC5357'/> could be supported as well.
</li>
<li>
RIFT inherently solves many problems associated with the use of
traditional routing topologies with dense meshes and high degrees of ECMP by
including automatic bandwidth balancing, flood reduction reduction, and automatic
disaggregation on failures while providing maximum aggregation of prefixes
in default scenarios. ECMP in RIFT eliminates the need for more Loop-Free Alternates Alternate (LFA) procedures.
</li>
<li>
<!-- [rfced] May we rephrase and break up the following sentence to
improve readability?
Original:
* RIFT reduces FIB size towards the bottom of the IP fabric where
most nodes reside and allows with that for cheaper hardware on the
edges and introduction of modern IP fabric architectures that
encompass e.g. server multi-homing.
Perhaps:
* RIFT reduces FIB size towards the bottom of the IP fabric where
most nodes reside. This allows for cheaper hardware on the
edges and introduction of modern IP fabric architectures that
encompass server multihoming and other mechanisms.
-->
RIFT reduces FIB size towards the bottom of the IP fabric where most nodes
reside and allows with that for cheaper hardware on the edges and introduction
of modern IP fabric architectures that encompass, e.g., server multihoming.
</li>
<li>
RIFT provides valley-free routing and with that is loop free. A valley-free path allows for reversal of direction at most once from a packet heading northbound to southbound while permitting traversal of horizontal links in the northbound phase. This allows for the use of any such valley-free path in bi-sectional bisectional fabric bandwidth between two destinations irrespective of their metrics which that can be used to balance load on the fabric in different ways. Valley-free routing eliminates the need for any specific micro-loop avoidance procedures for RIFT.
</li>
<li>
RIFT includes a key-value distribution mechanism
which
that allows for future applications
such as automatic provisioning of basic overlay services or automatic key
roll-overs
rollovers over whole fabrics.
</li>
<li>
RIFT is designed for minimum delay in case of prefix mobility on the fabric. In
conjunction with <xref target='RFC8505'/>, RIFT can differentiate anycast advertisements from mobility events and retain only the most recent advertisement in the latter case.
</li>
<li>
Many further operational and design points collected over many years of
routing protocol deployments have been incorporated in RIFT such as
fast flooding rates, protection of information lifetimes lifetimes, and operationally
recognizable remote ends of links and node names.
</li>
</ul>
<section><name>South Reflection</name>
<t>South reflection is a mechanism that where South Node TIEs are "reflected"
back up north to allow nodes in the same level without east-west East-West links to "see"
each other.
</t>
<t>For example, in Figure 4, <xref target='pic-suboptimal'/>, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs
from ToF21 to ToF22 separately. Respectively, Spine111\Spine112\Spine121\Spine122 reflects Node
S-TIEs from ToF22 to ToF21 separately. So separately, so ToF22 and ToF21 see each other's
node information as level 2 nodes.
</t>
<t>In an equivalent fashion, as the result of the south reflection between Spine121-Leaf121-Spine122
and Spine121-Leaf122-Spine122, Spine121 and Spine 122 knows know each other at
level 1.
</t>
<!-- [rfced] We note that the following instances of text are repeated at
the end of Sections 5.1 and following Figure 4 in Section 5.2. Should the text
in Section 5.2 be removed to avoid repetition?
Original (Section 5.1):
In an equivalent fashion, as the result of the south
reflection between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122,
Spine121 and Spine 122 knows each other at level 1.
Original (Section 5.2):
As shown in Figure 4, as the result of the south
reflection between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122,
Spine121 and Spine 122 knows each other at level 1.
-->
</section>
<section><name>Suboptimal Routing on Link Failures</name>
<figure align='center' anchor='pic-suboptimal'><name>Suboptimal routing upon link failure use case</name> Routing Upon Link Failure Use Case</name>
<artwork align='center'><![CDATA[
+--------+ +--------+
| ToF21 | | ToF22 | LEVEL 2
++--+-+-++ ++-+--+-++
| | | | | | | +
| | | | | | | linkTS8
+------------+ | +-+linkTS3+-+ | | | +-------------+
| | | | | | + |
| +---------------------------+ | linkTS7 |
| | | | + + + |
| | | +-------+linkTS4+------------+ |
| | | + + | | |
| | | +-------------+--+ | |
| | | | | linkTS6 | |
+-+----+-+ +-+----+-+ ++--------+ +-+----+-+
|Spine111| |Spine112| |Spine121 | |Spine122| LEVEL 1
+-+---+--+ +-+----+-+ +-+---+---+ +-+----+-+
| | | | | | | |
| +-------------+ | + ++XX+linkSL6+---+ +
| | | | linkSL5 | | linkSL8
| +-----------+ | | + +---+linkSL7+-+ | +
| | | | | | | |
+-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ +-+-----+ +-----+-+ +-+-----+
+ + + +
Prefix111 Prefix112 Prefix121 Prefix122
]]></artwork> Prefix122]]></artwork>
</figure>
<t>As shown in <xref target='pic-suboptimal'/>, as the result of the south reflection between
Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and Spine
122 knows know each other at level 1.</t>
<t>Without
<!-- [rfced] We have rephrased the following sentence and split it into two
for ease of the reader. Please let us know any objections.
Original:
Without disaggregation mechanism, when linkSL6 fails, the packet
from leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 then go
ago down through linkTS4 to linkSL8 to Leaf122 or go up through linkSL5 to
linkTS6 then go down through linkTS8 and linkSL8 to Leaf122 based on pure
default route.
It's
Current:
Without disaggregation mechanisms, the packet from leaf121 to
prefix122 will probably go up through linkSL5 to linkTS3 when linkSL6
fails. Then, the packet will go down through linkTS4 to linkSL8 to Leaf122 or
go up through linkSL5 to linkTS6, then go down through linkTS8 and linkSL8 to
Leaf122 based on the pure default route.
-->
<t>Without disaggregation mechanisms, the packet from
leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 when linkSL6 fails. Then, the packet will go
down through linkTS4 to linkSL8 to Leaf122 or go up through linkSL5 to linkTS6,
then go down through linkTS8 and linkSL8 to Leaf122 based on the pure default route.
This is the case of suboptimal routing or bow-tieing.</t> bow tying.</t>
<t>With disaggregation mechanism, when linkSL6 fails, mechanisms, Spine122 will detect the
failure according to the reflected node S-TIE from Spine121. Spine121 when linkSL6 fails. Based on the
disaggregation algorithm provided by RIFT, Spine122 will explicitly advertise
prefix122 in Disaggregated Prefix S-TIE PrefixTIEElement(prefix122, cost 1). The packet
from leaf121 to prefix122 will only be sent to linkSL7 following a longest-prefix
match to prefix 122 directly directly, then it will go down through linkSL8 to Leaf122 . Leaf122.
</t>
</section>
<section><name>Black-Holing on Link Failures</name>
<figure align='center' anchor='pic-blackhole'><name>Black-holing upon link failure use case</name> anchor='pic-blackhole'><name>Black-Holing Upon Link Failure Use Case</name>
<artwork align='center'><![CDATA[
+--------+ +--------+
| ToF 21 | | ToF 22 | LEVEL 2
++-+--+-++ ++-+--+-++
| | | | | | | +
| | | | | | | linkTS8
+--------------+ | +-+linkTS3+X+ | | | +--------------+
linkTS1 | | | | | + |
+ +-----------------------------+ | linkTS7 |
| | + | + + + |
| | linkTS2 +-------+linkTS4+X+----------+ |
| + + + + | | |
| linkTS5 +-+ +------------+--+ | |
| + | | | linkTS6 | |
+-+----+-+ +-+----+-+ ++-------+ +-+-----++
|Spine111| |Spine112| |Spine121| |Spine122| LEVEL 1
+-+---+--+ ++----+--+ +-+---+--+ +-+----+-+
| | | | | | | |
+ +---------------+ | + +---+linkSL6+---+ +
linkSL1 | | | linkSL5 | | linkSL8
+ +--+linkSL3+--+ | | + +---+linkSL7+-+ | +
| | | | | | | |
+-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ +-+-----+ +-----+-+ +-----+-+
+ + + +
Prefix111 Prefix112 Prefix121 Prefix122
]]></artwork> Prefix122]]></artwork>
</figure>
<t>This scenario illustrates a case when where double link failure occurs and with that
black-holing can happen.</t>
<t>Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, mechanisms,
the packet from leaf111 to prefix122 would suffer 50% black-holing based
on pure default route. route when linkTS3 and linkTS4 both fail. The packet is supposed to go up through linkSL1 to
linkTS1 and then go down through linkTS3 or linkTS4 will be dropped. The
packet is supposed to go up through linkSL3 to linkTS2 linkTS2, then go down through
linkTS3 or linkTS4 will be dropped as well. It's This is the case of black-holing.</t>
<t>With disaggregation mechanism, when linkTS3 and linkTS4 both fail, mechanisms, ToF22 will
detect the failure according to the reflected node S-TIE of ToF21 from
Spine111\Spine112.
Spine111\Spine112 when linkTS3 and linkTS4 both fail. Based on the disaggregation algorithm
provided by RIFT, ToF22 will explicitly originate an S-TIE with prefix 121 and
prefix 122, 122 that is flooded to spines 111, 112, 121 121, and 122.</t>
<t>The packet from leaf111 to prefix122 will not be routed to linkTS1 or
linkTS2. The packet from leaf111 to prefix122 will only be routed to linkTS5
or linkTS7 following a longest-prefix match to prefix122.</t>
</section>
<section><name>Zero Touch Provisioning (ZTP)</name>
<t>
RIFT is designed to require a very minimal configuration to simplify its operation and avoid human errors; based on that minimal information, Zero Touch Provisioning (ZTP) ZTP auto configures the key operational parameters of all the RIFT nodes, including the SystemID System ID of the node that must be unique in the RIFT network and the level of the node in the Fat Tree, which determines which peers are northwards northward "parents" and which are southwards southward "children".
</t>
<t>
ZTP is always on, but its decisions can be overridden when a network administrator prefers to impose its own configuration. In that case, it is the responsibility of the administrator to ensure that the configured parameters are correct,
in other words i.e., ensure that the SystemID System ID of each node is unique, unique and that the administratively set levels truly reflect the relative position of the nodes in the fabric.
<!-- [rfced] Is "and when not" referring to ZTP configuring the network?
Original:
It is recommended to let ZTP configure the network, and when not, it
is recommended to configure the level of all the nodes to avoid an undesirable
interaction between ZTP and the manual configuration.
Perhaps:
It is recommended to let ZTP configure the network, and when ZTP does
not configure the network, it is recommended to configure the level of all the
nodes to avoid an undesirable interaction between ZTP and the manual
configuration.
-->
It is
recommended to let ZTP configure the network, and when not, it is recommended to
configure the level of all the nodes to avoid an undesirable interaction between ZTP and the manual configuration.
</t>
<t>ZTP requires that the administrator points out the Top-of-Fabric (ToF) ToF nodes to set the
baseline from which the fabric topology is derived. The Top-of-Fabric ToF nodes are
configured with the TOP_OF_FABRIC flag flag, which are initial 'seeds' needed for
other ZTP nodes to derive their level in the topology. ZTP computes the level
of each node based on the Highest Available Level (HAL) of the potential parent(s) nearest
parent closest to that baseline, which represents the superspine. In a
fashion, RIFT can be seen as a distance-vector protocol that computes a set of
feasible successors towards the superspine and auto-configures autoconfigures the rest of the
topology.
</t>
<t>
The auto configuration autoconfiguration mechanism computes a global maximum of levels by
diffusion. The derivation of the level of each node happens then based on Link Information Elements (LIEs)
LIEs received from its
neighbors neighbors, whereas each node (with possibly possible exceptions
of configured leaves) tries to attach at the highest possible point in the
fabric. This guarantees that even if the diffusion front reaches a node from
"below" faster than from "above", it will greedily abandon already negotiated level
levels derived from nodes topologically below it and properly peer with nodes
above.
</t>
<t>
The achieved equilibrium can be disturbed massively by all nodes with the highest level either leaving or entering the domain (with some finer distinctions not explained further).
It is therefore recommended that each node is multi-homed multihomed towards nodes with respective HAL offerings. Fortunately, this is the natural state of things for the topology variants considered in RIFT.
</t>
<t>
A RIFT node may also be configured to confine it to the leaf role with the LEAF_ONLY flag. A leaf node can also be configured to support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In either case both cases, the node cannot be TOP_OF_FABRIC and its level cannot be configured. RIFT will fully determine the node's level after it is attached to the topology and ensure that the node is at the "bottom of the hierarchy" (southernmost).
</t>
</section>
<section><name>Miscabling</name>
<section><name>Miscabling Examples</name>
<figure align='center' anchor='single-plane-mis-cabling'><name>A single plane miscabling example</name> Single-Plane Miscabling Example</name>
<artwork align='center'><![CDATA[
+----------------+ +-----------------+
| ToF21 | +------+ ToF22 | LEVEL 2
+-------+----+---+ | +----+---+--------+
| | | | | | | | |
| | | +----------------------------+ |
| +---------------------------+ | | | |
| | | | | | | | |
| | | | +-----------------------+ | |
| | +------------------------+ | | |
| | | | | | | | |
+-+---+--+ +-+---+--+ | +--+---+-+ +--+---+-+
|Spine111| |Spine112| | |Spine121| |Spine122| LEVEL 1
+-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+
| | | | | | | | |
| +---------+ | link-M | +---------+ |
| | | | | | | | |
| +-------+ | | | | +-------+ | |
| | | | | | | | |
+-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+
|Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0
+-------+ +-------+ +-------+ +-------+
]]></artwork> +-------+]]></artwork>
</figure>
<t><xref target='single-plane-mis-cabling'/> shows a single plane single-plane miscabling example. It's a perfect Fat Tree fabric except for link-M connecting Leaf112 to ToF22.
</t>
<t>The RIFT control protocol can discover the physical links automatically and be is able to detect cabling that violates Fat Tree topology constraints.
It reacts accordingly to such miscabling attempts, at a minimum preventing adjacencies between nodes from being formed and traffic from being forwarded on those miscabled links. links at a minimum.
In such scenario, Leaf112 will in such scenario use link-M to derive its level (unless it is leaf) and can report links to Spine111 and Spine112 as miscabled unless the implementations
allows
allow horizontal links.
</t>
<t><xref target='multi-plane-mis-cabling'/> shows a multiple plane multi-plane miscabling example. Since Leaf112 and Spine121 belong to two different PoDs, the adjacency between Leaf112 and Spine121 can not cannot be formed. Link-W would be detected and prevented.
</t>
<figure align='center' anchor='multi-plane-mis-cabling'><name>A multiple plane miscabling example</name> Multiple Plane Miscabling Example</name>
<artwork align='center'><![CDATA[
+-------+ +-------+ +-------+ +-------+
|ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2
+-------+ +-------+ +-------+ +-------+
| | | | | | | |
| | | +-----------------+ | | |
| +--------------------------+ | | | |
| +------+ | | | +------+ |
| | +-----------------+ | | | | |
| | | +--------------------------+ | |
| A | | B | | A | | B |
+-----+--+ +-+---+--+ +--+---+-+ +--+-----+
|Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1
+-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+
| | | | | | | | |
| +---------+ | | | +---------+ |
| | | | link-W | | | |
| +-------+ | | | | +-------+ | |
| | | | | | | | |
+-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+
|Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0
+-------+ +-------+ +-------+ +-------+
+--------PoD#1----------+ +---------PoD#2---------+
]]></artwork> +---------PoD#2---------+]]></artwork>
</figure>
<t>RIFT provides an optional level determination procedure in its Zero Touch Provisioning ZTP mode. Nodes in the fabric without
their level configured determine it automatically. This However, this can have possibly possible counter-intuitive consequences however. consequences.
One extreme failure scenario is depicted in <xref target='Fallen-spine'/> target='Fallen-spine'/>, and it shows that if all northbound links of spine11 Spine11 fail at the same time,
spine11
Spine11 negotiates a lower level than Leaf11 and Leaf12.
</t>
<t>To prevent such scenario where leafs leaves are expected to act as switches, the LEAF_ONLY flag can be set for Leaf111 and Leaf112.
Since level -1 is invalid, Spine11 would not derive a valid level from the topology in <xref target='Fallen-spine'/>. It will be isolated from the whole fabric fabric,
and it would be up to the leafs leaves to declare the links towards such spine as miscabled.
</t>
<!-- [rfced] We note that Figures 8 and 9 have the same title of "Fallen
Spine". Is this intentional? If not, please let us know how we should
update to make these figures more distinct.
-->
<figure align='center' anchor='Fallen-spine'><name>Fallen spine</name> Spine</name>
<artwork align='center'><![CDATA[
+-------+ +-------+ +-------+ +-------+
|ToF A1| |ToF A2| |ToF A1| |ToF A2|
+-------+ +-------+ +-------+ +-------+
| | | | | |
| +-------+ | | |
+ + | | ====> | |
X X +------+ | +------+ |
+ + | | | |
+----+--+ +-+-----+ +-+-----+
|Spine11| |Spine12| |Spine12|
+-+---+-+ ++----+-+ ++----+-+
| | | | | |
| +---------+ | | |
| +-------+ | | +-------+ |
| | | | | |
+-+---+-+ +--+--+-+ +-----+-+ +-----+-+
|Leaf111| |Leaf112| |Leaf111| |Leaf112|
+-------+ +-------+ +-+-----+ +-+-----+
| |
| +--------+
| |
+-+---+-+
|Spine11|
+-------+
]]></artwork>
+-------+]]></artwork>
</figure>
</section>
<section><name>Miscabling considerations</name> Considerations</name>
<t>There are scenarios where operators may want to leverage ZTP and implement additional cabling constraints that go beyond the previously described topology violations. Enforcing cabling down to specific level, node, and port combinations might make it simpler for onsite staff to perform troubleshooting activities or replace optical transceivers and/or cabling as the physical layout will be consistent across the fabric. This is especially true for densely connected fabrics where it is difficult to physically manipulate those components. It is also easy to imagine other models, such as one where the strict port requirement is relaxed.
</t>
<t><xref target='miscalbe-cons'/> illustrates an example where the first port on Leaf1 must connect to the first port on Spine1, the second port on Leaf1 must connect to the first port on Spine2, and so on. Consider a case where (Leaf1, Port1) and (Leaf1, Port2) were reversed. RIFT would not consider this to be miscabled by default, default; however, an operator might want to.
</t>
<figure align='center' anchor='miscalbe-cons'><name>Fallen spine</name> Spine</name>
<artwork align='center'><![CDATA[
+--------+ +--------+ +--------+ +--------+
| Spine1 | | Spine2 | | Spine3 | | Spine4 |
+-1------+ +-1------+ +-1------+ +-1------+
+ + + +
| +----------+ | |
| | | |
| | +---------------------+ |
| | | |
| | | +--------------------------------+
| | | |
| | | |
| | | |
| | | |
+ + + +
+-1--2--3--4--+
| Leaf1 | ......
+-------------+
]]></artwork>
</figure>
<t>RIFT allows implementations to provide programmable plugins plug-ins that can adjust
ZTP operation or capture information during computation. While defining this
is outside the scope of this document, such a mechanism could be used to
extend the miscabling functionality.
</t>
<t>For other protocols to achieve this, it would require additional
operational overhead. Consider a fabric that is using unnumbered OSPF links, links;
it is still very likely that a miscabled link will form an adjacency. Each attempts
attempt to move cables to the correct port may result in the need for
additional troubleshooting as other links will become miscabled in the
process. Without automation to explicitly tell the operator which ports need
to be moved where, the process becomes manually intensive and error-prone very
quickly. Or if If the problem goes unnoticed, it will result in suboptimal
performance in the fabric.</t>
</section>
</section>
<section><name>Multicast and Broadcast Implementations</name>
<t>RIFT supports both multicast and broadcast implementations. While a
multicast implementation is preferred, there might cases where a broadcast
implementation is optimal or even required. For example, operating systems on
IoT devices and embedded devices may not have the required multicast
support. Another example is containers, which do support multicast in some
cases do support multicast, but tend to be very CPU-inefficient and difficult to tune.</t>
</section>
<section><name>Positive vs. Negative Disaggregation</name>
<t>
Disaggregation is the procedure whereby <xref target='I-D.ietf-rift-rift'>RIFT</xref>
target='RFC9692'>RIFT</xref> advertises a more specific route
southwards as an exception to the aggregated fabric-default
north. Disaggregation is useful when a prefix within the aggregation is
reachable via some of the parents but not the others at the same level of
the fabric. It is mandatory when the level is the ToF since a ToF node
that cannot reach a prefix becomes a black hole for that prefix. The hard
problem is to know which prefixes are reachable by whom.
</t>
<t>
In the general case, <xref target='I-D.ietf-rift-rift'>RIFT</xref> target='RFC9692'>RIFT</xref> solves
that problem by interconnecting the ToF nodes. So nodes so that the ToF nodes can
exchange the full list of prefixes that exist in the fabric and figure out
when a ToF node lacks reachability to some prefixes. This requires
additional ports at the ToF, typically 2 two ports per ToF node to form a
ToF-spanning ring. <xref target='I-D.ietf-rift-rift'>RIFT</xref> target='RFC9692'>RIFT</xref> also
defines the southbound reflection procedure that enables a parent to
explore the direct connectivity of its peers, meaning their own parents
and children; based on the advertisements received from the shared parents
and children, it may enable the parent to infer the prefixes its peers can
reach.
</t>
<t>
When a parent lacks reachability to a prefix, it may disaggregate the
prefix negatively, i.e., advertise that this parent can be used to reach
any prefix in the aggregation except that one. The Negative Disaggregation
signaling is simple and functions transitively from ToF to top-of-pod Top-of-Pod
(ToP) and then from ToP to Leaf.
But However, it is hard for a parent to
figure out which prefix it needs to disaggregate, disaggregate because it does not know
what it does not know; it results that the use of a spanning ring at the
ToF is required to operate the Negative Disaggregation. Also, though it
is only an implementation problem, the programming of the FIB is complex
compared to normal routes, routes and may incur recursions.
</t>
<t>
The more classical alternative is, for the parents that can reach a prefix
that peers at the same level cannot, to advertise a more specific route to
that prefix. This leverages the normal longest prefix match in the FIB, FIB
and does not require a special implementation. But as As opposed to the
Negative Disaggregation, the Positive Disaggregation is difficult and
inefficient to operate transitively.
</t>
<t>
Transitivity is not needed to by a grandchild if all its parents received the
Positive Disaggregation, meaning that they shall all avoid the black hole;
when that is the case, they collectively build a ceiling that protects the
grandchild. But until Until then, a parent that received a the Positive
Disaggregation may believe that some peers are lacking the reachability
and readvertise re-advertise too
early, early or defer and maintain a black hole situation
longer than necessary.
</t>
<t>
In a non-partitioned fabric, all the ToF nodes see one another through the
reflection and can figure out if one is missing a child. In that case case, it is
possible to compute the prefixes that the peer cannot reach and
disaggregate positively without a ToF-spanning ring. The ToF nodes can
also ascertain that the ToP nodes are connected each connected to at least a ToF
node that can still reach the prefix, meaning that the transitive
operation is not required.
</t>
<t>
The bottom line is that in a fabric that is partitioned (e.g., using
multiple planes) and/or where the ToP nodes are not guaranteed to always
form a ceiling for their children, it is mandatory to use the Negative
Disaggregation. On the other hand, in a highly symmetrical and fully
connected fabric, fabric (e.g., a canonical Clos Network), the Positive
Disaggregation methods allows
to save the complexity and cost associated
to the ToF-spanning ring.
</t>
<t>
Note that in the case of Positive Disaggregation, the first ToF node(s) nodes
that
announces announce a more-specific route attracts attract all the traffic for that
route and may suffer from a transient incast. A ToP node that defers
injecting the longer prefix in the FIB, in order to receive more
advertisements and spread the packets better, also keeps on sending a
portion of the traffic to the black hole in the meantime. In the case of
Negative Disaggregation, the last ToF node(s) nodes that injects inject the route may
also incur an incast issue; this problem would occur if a prefix that
becomes totally unreachable is disaggregated.
</t>
</section> <!-- Positive vs. Negative Disaggregation -->
<section><name>Mobile Edge and Anycast</name>
<t>
When a physical or a virtual node changes its point of attachment in the
fabric from a previous-leaf to a next-leaf, new routes must be installed
that supersede the old ones. Since the flooding flows northwards, the
nodes (if any) between the previous-leaf and the common parent are not
immediately aware that the path via the previous-leaf is obsolete, obsolete and a stale
route may exist for a while. The common parent needs to select the
freshest route advertisement in order to install the correct route via the
next-leaf. This requires that the fabric determines the sequence of the
movements of the mobile node.
</t>
<t>
On the one hand, a classical sequence counter provides a total order for a
while
while, but it will eventually wrap. On the other hand, a timestamp provides
a permanent order order, but it may miss a movement that happens too quickly vs. the granularity of the timing information.
It is not envisioned that an average fabric supports the <xref
target='IEEEstd1588'>Precision Time Protocol</xref> in the short term, term nor
that the precision available with the <xref target='RFC5905'>Network Time
Protocol</xref> (in the order of 100 to 200ms) 200 ms) may not be necessarily
enough to cover, e.g., the fast mobility of a Virtual Machine. Machine (VM).
</t>
<t>
Section 6.8.4 "Mobility"
<t>Section <xref target='RFC9692' sectionFormat='bare' section='6.8.4'>"Mobility"</xref> of <xref target='I-D.ietf-rift-rift'>RIFT</xref> target='RFC9692'/>
specifies a hybrid method that combines a sequence counter from the mobile
node and a timestamp from the network taken at the leaf when the route is
injected. If the timestamps of the concurrent advertisements are
comparable (i.e., more distant than the precision of the timing protocol),
then the timestamp alone is used to determine the relative freshness of
the routes. Otherwise, the sequence counter from the mobile node, node is used if available, it is used. available. One caveat is that the sequence counter must not wrap
within the precision of the timing protocol. Another is that the mobile
node may not even provide a sequence counter, counter; in which case case, the mobility
itself must be slower than the precision of the timing.
</t>
<t>
Mobility must not be confused with anycast. In both cases, a the same
address is injected in RIFT at different leaves. In the case of mobility,
only the freshest route must be conserved, conserved since the mobile node changed changes its
point of attachment for a leaf to the next. In the case of anycast, the
node may be either be multihomed (attached to multiple leaves in parallel) or
reachable beyond the fabric via multiple routes that are redistributed to
different
leaves; either leaves. Either way, in the case of anycast, the multiple routes are equally valid and
should be conserved. conserved in the case of anycast. Without further information
from the redistributed routing protocol, it is impossible to sort out a
movement from a redistribution that happens asynchronously on different
leaves. <xref target='I-D.ietf-rift-rift'>RIFT</xref> target='RFC9692'>RIFT</xref> expects that anycast addresses
are advertised within the timing precision, which is typically the case
with a low-precision timing and a multihomed node. Beyond that time
interval, RIFT interprets the lag as a mobility and only the freshest
route is retained.
</t>
<!--[rfced] To clarify the content found in RFC 8505, may we rephrase the
text around its citations as follows?
Original:
When using IPv6 [RFC8200], RIFT suggests to leverage [RFC8505] as the
IPv6 ND interaction between the mobile node and the leaf.
...
When using [RFC8505], the parallel registration of an
anycast address to multiple leaves is done with the same sequence
counter, whereas the sequence counter is incremented when the point
of attachment changes.
Perhaps:
When using IPv6 [RFC8200], RIFT suggests leveraging 6LoWPAN ND [RFC8505]
as the IPv6 ND interaction between the mobile node and the leaf.
...
When using 6LoWPAN ND [RFC8505], the parallel registration of an
anycast address to multiple leaves is done with the same sequence
counter, whereas the sequence counter is incremented when the point
of attachment changes.
-->
<t>
When using <xref target='RFC8200'>IPv6</xref>, RIFT suggests to leverage
<xref target='RFC8505'/> as the IPv6 ND interaction between the mobile
node and the leaf. This
provides not only provides a sequence counter but also a
lifetime and a security token that may be used to protect the ownership of
an address <xref target='RFC8928'/>. When using <xref target='RFC8505'/>,
the parallel registration of an anycast address to multiple leaves is done
with the same sequence counter, whereas the sequence counter is
incremented when the point of attachment changes. This way, it is
possible to differentiate a mobile node from a multihomed node, even when
the mobility happens within the timing precision. It is also possible for
a mobile node to be multihomed as well, e.g., to change only one of its
points of attachment.
</t>
</section> <!-- Mobile Edge and Anycast -->
<section anchor='v4ov6'><name>IPv4 over IPv6</name>
<t>RIFT allows advertising IPv4 prefixes over an IPv6 RIFT network. An
IPv6 Address Family (AF) configures via the usual Neighbor Discovery (ND) ND mechanisms and then
V4 can use V6 next-hops analogous to <xref target='RFC8950'/>. It is
expected that the whole fabric supports the same type of forwarding of address families
AFs on all the links. RIFT provides an indication whether a
node is v4 forwarding capable of V4-forwarding and implementations are possible where
different routing tables are computed per address family AF as long as the
computation remains loop-free.
</t>
<figure align='center' anchor='IPV4-o-IPV6'><name>IPv4 over IPv6</name>
<artwork align='center'><![CDATA[
+-----+ +-----+
+---+---+ | ToF | | ToF |
^ +--+--+ +-----+
| | | | |
| | +-------------+ |
| | +--------+ | |
+ | | | |
V6 +-----+ +-+---+
Forwarding |Spine| |Spine|
+ +--+--+ +-----+
| | | | |
| | +-------------+ |
| | +--------+ | |
| | | | |
v +-----+ +-+---+
+---+---+ |Leaf | | Leaf|
+--+--+ +--+--+
| |
IPv4 prefixes| |IPv4 prefixes
| |
+---+----+ +---+----+
| V4 | | V4 |
| subnet | | subnet |
+--------+ +--------+
]]></artwork>
</figure>
</section>
<section><name>In-Band Reachability of Nodes</name>
<t>RIFT doesn't precondition that nodes of the fabric have reachable addresses. But
addresses, but the operational reasons to reach the internal nodes may
exist. <xref target='In-band-reach'/> target='in-band-reach'/> shows an example that the
network management station (NMS) attaches to leaf1. Leaf1.
</t>
<figure align='center' anchor='In-band-reach'><name>In-Band reachability anchor='in-band-reach'><name>In-Band Reachability of node</name> Nodes</name>
<artwork align='center'><![CDATA[
+-------+ +-------+
| ToF1 | | ToF2 |
++---- ++ ++-----++
| | | |
| +----------+ |
| +--------+ | |
| | | |
++-----++ +--+---++
|Spine1 | |Spine2 |
++-----++ ++-----++
| | | |
| +----------+ |
| +--------+ | |
| | | |
++-----++ +--+---++
| Leaf1 | | Leaf2 |
+---+---+ +-------+
|
|NMS
]]></artwork>
|NMS]]></artwork>
</figure>
<t>If the NMS wants to access Leaf2, it simply works. Because works because the loopback address of Leaf2 is flooded in its Prefix North TIE.
</t>
<t>If the NMS wants to access Spine2, it simply also works too. Because because a spine node always advertises its loopback address in the Prefix North TIE. The NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ToF2-Spine2.
</t>
<t>If the NMS wants to access ToF2, ToF2's loopback address needs to be injected into its Prefix South TIE. This TIE must be seen by all nodes at the level below - -- the spine nodes in <xref target='In-band-reach'/> – target='miscalbe-cons'/> -- that must form a ceiling for all the traffic coming from below (south). Otherwise, the traffic from the NMS may follow the default route to the wrong ToF Node, e.g., ToF1.
</t>
<t>In the case of failure between ToF2 and spine nodes, ToF2's loopback address must be disaggregated recursively all the way to the leaves. In a partitioned ToF, even with recursive disaggregation disaggregation, a ToF node is only reachable within its plane.
</t>
<t>A possible alternative to recursive disaggregation is to use a ring that interconnects the ToF nodes to transmit packets between them for their loopback addresses only. The idea is that this is mostly control traffic and should not alter the load balancing load-balancing properties of the fabric.
</t>
</section>
<section><name>Dual Homing
<section><name>Dual-Homing Servers</name>
<!-- [rfced] We are unable to parse the following sentence (specifically, we
are unable to determine what "or the must" means). May we rephrase as
follows for clarity and specify "Top-of-Fabric"?
Original:
It has no configuration (unless it is a Top-of-Fabric at the top of
the topology or the must operate in the topology as leaf and/or support
leaf-2-leaf procedures) and it will fully configure itself after being
attached to the topology.
Perhaps:
It has no configuration (unless it is a ToF node at the top of the
topology or if it must operate in the topology as a leaf and/or support
leaf-2-leaf procedures), and it will fully configure itself after being
attached to the topology.
-->
<t>Each RIFT node may operate in Zero Touch Provisioning (ZTP) ZTP mode. It has no configuration (unless
it is a Top-of-Fabric ToF at the top of the topology or the must operate in
the topology as leaf and/or support leaf-2-leaf procedures) procedures), and it will
fully configure itself after being attached to the topology.
</t>
<figure align='center' anchor='dualhoming-servers'><name>Dual-homing servers</name> anchor='dualhoming-servers'><name>Dual-Homing Servers</name>
<artwork align='center'><![CDATA[
+---+ +---+ +---+
|ToF| |ToF| |ToF| ToF
+---+ +---+ +---+
| | | | | |
| +----------------+ | |
| +----------------+ |
| | | | | |
+----------+--+ +--+----------+
| ToR1 | | ToR2 | Spine
+--+------+---+ +--+-------+--+
+---+ | | | | | | +---+
| +-----------------+ | | |
| | | +-------------+ | |
| | | | | +-----------------+ |
| | | | +--------------+ | | |
| | | | | | | |
+---+ +---+ +---+ +---+
| | | | | | | |
+---+ +---+ ............. +---+ +---+
SV(1) SV(2) SV(n-1) SV(n) Leaf
]]></artwork> Leaf]]></artwork>
</figure>
<t>Sometimes,
<!-- [rfced] May we rephrase the sentence below as follows (i.e., specify
"ToR" and update "start on" to "startup")?
Original:
Sometimes, people may prefer to disaggregate from ToR to servers
from start on, i.e. the servers have couple tens of routes in FIB from start
on beside default routes to avoid breakages at rack level.
Perhaps:
Sometimes people may prefer to disaggregate from ToR nodes to
servers from startup, i.e., the servers have multiple routes in the FIB from
startup other than default routes to avoid breakages at the rack level.
-->
<t>Sometimes people may prefer to disaggregate from ToR to servers from start on, i.e. the servers have couple tens of routes in FIB from start on beside default routes to avoid breakages at rack level. Full disaggregation of the fabric could be achieved by configuration supported by RIFT.
</t>
</section>
<section><name>Fabric with A a Controller</name>
<t>There are many different ways to deploy the controller. One possibility is attaching a controller to the RIFT domain from ToF and another possibility is attaching a controller from the leaf.
</t>
<figure align='center' anchor='Fabric-controller'><name>Fabric with a controller</name> Controller</name>
<artwork align='center'><![CDATA[
+------------+
| Controller |
++----------++
| |
| |
+----++ ++----+
------- | ToF | | ToF |
| +--+--+ +-----+
| | | | |
| | +-------------+ |
| | +--------+ | |
| | | | |
+-----+ +-+---+
RIFT domain |Spine| |Spine|
+--+--+ +-----+
| | | | |
| | +-------------+ |
| | +--------+ | |
| | | | |
| +-----+ +-+---+
------- |Leaf | | Leaf|
+-----+ +-----+
]]></artwork> +-----+]]></artwork>
</figure>
<section><name>Controller Attached to ToFs</name>
<t>If a controller is attaching to the RIFT domain from ToF, it usually uses dual-homing connections. The loopback prefix of the controller should be advertised down by the ToF and spine to the leaves. If the controller loses the link to ToF, make sure the ToF withdraw withdraws the prefix of the controller.</t>
</section>
<section><name>Controller Attached to Leaf</name>
<t>If the controller is attaching from a leaf to the fabric, no special provisions are needed.
</t>
</section>
</section>
<section><name>Internet Connectivity Within Underlay</name>
<t>If global addressing is running without overlay, an external default route needs to be advertised through the RIFT fabric to achieve internet connectivity. For the purpose of forwarding of the entire RIFT fabric, an internal fabric prefix needs to be advertised in the South Prefix TIE by ToF and spine nodes.</t>
<section><name>Internet Default on the Leaf</name>
<t>In the case that the internet gateway is a leaf, the leaf node as the internet gateway needs to advertise a default route in its Prefix North TIE.</t>
</section>
<section><name>Internet Default on the ToFs</name>
<t>In the case that the internet gateway is a ToF, the ToF and spine nodes need to advertise a default route in the Prefix South TIE.</t>
</section>
</section>
<section><name>Subnet Mismatch and Address Families</name>
<figure align='center' anchor='subnet-mismatch'><name>subnet mismatch</name> anchor='subnet-mismatch'><name>Subnet Mismatch</name>
<artwork align='center'>
<![CDATA[ align='center'><![CDATA[
+--------+ +--------+
| | LIE LIE | |
| A | +----> <----+ | B |
| +---------------------+ |
+--------+ +--------+
X/24 Y/24
]]></artwork> Y/24]]></artwork>
</figure><t keepWithPrevious='true'></t>
<t>LIEs are exchanged over all links running RIFT to perform Link (Neighbor) Discovery. A node must NOT originate LIEs on an address family AF if it does not process received LIEs on that family.
LIEs on the same link are considered part of the same negotiation independent on from the address family AF they arrive on.
An implementation must be ready to accept TIEs on all addresses it used as the source of LIE frames.
</t>
<t>As shown in the above figure, without further checks <xref target='subnet-mismatch'/>, an adjacency of node nodes A
and B may form, form without further checks, but the forwarding between node nodes A and node B may fail
because subnet X mismatches with subnet Y.
</t>
<t>To prevent this this, a RIFT implementation should check for subnet mismatch just like e.g. in a way that is similar to how IS-IS does. This can lead to scenarios where an adjacency, despite the exchange of LIEs in both
address families
AFs, may end up having an adjacency in a single AF only. This is especially a consideration especially in scenarios relating to <xref target='v4ov6'/> scenarios. target='v4ov6'/>.
</t>
</section>
<section><name>Anycast Considerations</name>
<figure align='center' anchor='AnycastTL'><name>Anycast</name>
<artwork align='center'><![CDATA[
+ traffic
|
v
+------+------+
| ToF |
+---+-----+---+
| | | |
+------------+ | | +------------+
| | | |
+---+---+ +-------+ +-------+ +---+---+
| | | | | | | |
|Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1
+-+---+-+ ++----+-+ +-+---+-+ ++----+-+
| | | | | | | |
| +---------+ | | +---------+ |
| +-------+ | | | +-------+ | |
| | | | | | | |
+-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+
| | | | | | | |
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ ++------+ +-----+-+ +-----+-+
+ + + ^ +
PrefixA PrefixB PrefixA | PrefixC
|
+ traffic
]]></artwork> traffic]]></artwork>
</figure>
<t>If the traffic comes from ToF to Leaf111 or Leaf121 Leaf121, which has anycast prefix PrefixA, RIFT can deal with this case well. But However, if the traffic comes from Leaf122, it arrives to Spine21 or Spine22 at level LEVEL 1. But Additionally, Spine21 or Spine22 doesn't know another PrefixA attaching Leaf111. So Leaf111, so it will always get to Leaf121 and never get to Leaf111. If the intension intention is that the traffic should be offloaded to Leaf111, then use policy guided the policy-guided prefixes defined in <xref target='I-D.ietf-rift-rift'>RIFT</xref>. target='RFC9692'>RIFT</xref>.
</t>
</section>
<section><name>IoT Applicability</name>
<t>The design of RIFT inherits from RPL <xref target='RFC6550'/> the anisotropic design of a default route upwards (northwards); it (northwards) from RPL <xref target='RFC6550'/>. It also inherits the capability to inject external host routes at the Leaf level using Wireless ND (WiND) <xref target='RFC8505'/><xref target='RFC8505'/> <xref target='RFC8928'/> between a RIFT-agnostic host and a RIFT router. Both the RPL and the RIFT protocols are meant for a large scale, and WiND enables device mobility at the edge the same way in both cases.</t>
<t>The main difference between RIFT and RPL is that with RPL, there’s there's a single Root, root with RPL, whereas RIFT has many ToF nodes. This adds huge capabilities for leaf-2-leaf ECMP paths, paths but additional complexity with the need to disaggregate. Also Also, RIFT uses Link State link-state flooding northwards, northwards and is not designed for low-power operation.</t>
<t>Still
<t>Still, nothing prevents that the IP devices connected at the Leaf are IoT devices, which typically expose their address using WiND – which -- this is an upgrade from 6LoWPAN ND <xref target='RFC6775'/>.</t>
<t>A network that serves high speed/ speed / high power IoT devices should typically provide deterministic capabilities for applications such as high speed control loops or movement detection. The Fat Tree is highly reliable, and reliable and, in normal condition conditions, provides an equivalent multipath operation; but however, the ECMP doesn’t doesn't provide hard guarantees for either delivery or latency. As long as the fabric is non-blocking non-blocking, the result is the same; same, but there can be load unbalances resulting in incast and possibly congestion loss that will prevent the delivery within bounded latency.</t>
<t>This could be alleviated with Packet Replication, Elimination Elimination, and Reordering
Ordering Functions (PREOF) <xref target='RFC8655'/> leaf-2-leaf leaf-2-leaf, but PREOF is hard to provide at the scale of all flows, flows and the replication may increase the probability of the overload that it attempts to solve.</t>
<t>Note that the load balancing is not RIFT’s RIFT's problem, but it is key to serve IoT adequately.</t>
</section>
<section anchor='keys'><name>Key Management</name>
<t>
As outlined in Section 9 "Security Considerations" <xref target='RFC9692' sectionFormat='bare' section='9'>"Security Considerations"</xref> of <xref target='I-D.ietf-rift-rift'>RIFT</xref>, target="RFC9692"/>, either a private shared key or a public/private key pair is used to authenticate the adjacency.
Both the key distribution and key synchronization methods are out of
scope for this document. Both nodes in the adjacency must share the
same keys, key type, and algorithm for a given key ID. Mismatched
keys will not inter-operate interoperate as their security envelopes will be unverifiable.
</t>
<t>
Key roll-over rollover while the adjacency is active may be supported. The
specific mechanism is well documented in <xref target="RFC6518"/>.
As outlined in Section 9.9 "Host Implementations" <xref target='RFC9692' sectionFormat='bare' section='9.9'>"Host Implementations"</xref> of <xref target='I-D.ietf-rift-rift'>RIFT</xref>, target="RFC9692"/>, hosts as well as VMs act acting as RIFT devices are possible. KMP Key Management Protocols (KMPs), such as KV Key Value (KV) for key roll-over rollover in the fabric using fabric, use a symmetric key that can be changed easily when compromised. Wherein compromised; in which case, the symmetric key of a host is more likely to be compromised than of a an in-fabric networking node.
</t>
</section>
<section anchor='TTL-HopLimit'><name>TTL/HopLimit anchor='TTL-HopLimit'><name>TTL/Hop Limit of 1 vs. 255 on LIEs/TIEs</name>
<t>
The use of a packet's Time to Live (TTL) (IPv4) or Hop Limit (IPv6) to verify whether the packet was originated by an adjacent node on a connected link has been used in RIFT.RIFT RIFT.
RIFT explicitly requires the use of a TTL/HL value of 1 *or* or 255 when sending/receiving LIEs and TIEs so that implementers have a choice between the two.
</t>
<t>
TTL=1 or HL=1 protects against the information disseminating more than 1 hop in the fabric and should be the default unless configured otherwise. TTL=255 or HL=255 can lead RIFT TIE packet propagation to more than one hop (multicast (the multicast address is already in local subnetwork range) in case of implementation problems but does protect against a remote attack as well, and the receiving remote router will ignore such TIE packet unless the remote router is exactly 254 hops away and accepts only TTL=1 or HL=1. <xref target="RFC5082"/> defines a Generalized TTL Security Mechanism (GTSM). The GTSM is applicable to LIEs/TIEs LIE/TIE implementations that use a TTL or HL of 255. It provides a defense from infrastructure attacks based on forged protocol packets from outside the fabric.
</t>
</section>
</section>
<section anchor='Security'><name>Security Considerations</name>
<t>This document presents applicability of RIFT. As such, it does not
introduce any security considerations. However, there are a number
of security concerns at in <xref target='I-D.ietf-rift-rift'>RIFT</xref>.</t> target='RFC9692'></xref>.</t>
</section>
<section anchor="iana-tlv-class-reg-sec" title="IANA Considerations">
<t>This document has no IANA actions.</t>
</section>
<section title="Acknowledgments">
<t>
The authors would like to thank Jaroslaw Kowalczyk, Alvaro Retana, Jim Guichard
</middle>
<back>
<references><name>References</name>
<references><name>Normative References</name>
<!-- [rfced] References
a) Please review the following reference. We have added the following URL:
https://www.iso.org/standard/30932.html and Jeffrey Zhang updated the title to reflect the
accurate title of this ISO/IEC standard. Please let us know if you have any
objections.
Original:
[ISO10589-Second-Edition]
International Organization for Standardization,
"Intermediate system to Intermediate system intra-domain
routing information exchange protocol for use in
conjunction with the protocol for providing invaluable concepts the
connectionless-mode Network Service (ISO 8473)", November
2002.
Current:
[ISO10589-Second-Edition]
ISO/IEC, "Information technology - Telecommunications and content
information exchange between systems - Intermediate System
to Intermediate System intra-domain routeing information
exchange protocol for use in conjunction with the protocol
for providing the connectionless-mode network service (ISO
8473)", ISO/IEC 10589:2002, November 2002,
<https://www.iso.org/standard/30932.html>.
b) Please review the following reference. We found a URL
(https://www.broadband-forum.org/pdfs/tr-384-1-0-0.pdf) that matches the
information provided in this document.
</t>
</section>
<section anchor='Contributors'><name>Contributors</name>
<t>The reference and updated the reference as
follows. Please let us know any objections.
Original:
[TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central
Office Reference Architectural Framework", January 2018.
Current:
[TR-384] Broadband Forum Technical Report, "TR-384: Cloud Central
Office Reference Architectural Framework", TR-384, Issue
1, January 2018,
<https://www.broadband-forum.org/pdfs/tr-384-1-0-0.pdf>.
c) Please review the following people (listed reference. We found the following URL:
https://ieeexplore.ieee.org/document/6012836. We have added this URL to this
reference. Please let us know if you have any objections.
Original:
[CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in alphabetical order) contributed significantly Computer
Communication Environments", IEEE International Parallel &
Distributed Processing Symposium, 2011.
Current:
[CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer
Communication Environments", 2011 IEEE International
Parallel & Distributed Processing Symposium,
DOI 10.1109/IPDPS.2011.27, May 2011,
<https://ieeexplore.ieee.org/document/6012836>.
d) Please review. We found the following URL for this reference:
https://ieeexplore.ieee.org/document/6312192. We have added this URL to this
reference. Please let us know if you have any objections
Original:
[FATTREE] Leiserson, C. E., "Fat-Trees: Universal Networks for
Hardware-Efficient Supercomputing", 1985.
Current:
[FATTREE] Leiserson, C. E., "Fat-Trees: Universal Networks for
Hardware-Efficient Supercomputing", IEEE Transactions on
Computers, vol. C-34, no. 10, pp. 892-901,
DOI 10.1109/TC.1985.6312192, October 1985,
<https://ieeexplore.ieee.org/document/6312192>.
e) Please review. We found the following URL for this reference:
https://www.broadband-forum.org/download/af-pnni-0055.001.pdf. We have added
this URL to this reference. Additionally, please note that the content original date
for this reference was 2003. We were unable to find a version of this document
reference with that date. The version we found at the URL has a date of April
2002 and should be considered co-authors:</t>
<t>Jordan Head</t>
<t>Juniper Networks</t>
<t>Email: jhead@juniper.net</t>
<t>Tom Verhaeg</t>
<t>Juniper Networks</t>
<t>Email: tverhaeg@juniper.net</t>
</section>
</middle>
<back>
<displayreference target="I-D.ietf-rift-rift" to="RIFT"/>
<references><name>Normative References</name> updated the reference as follows for consistency. Please let us know
if you have any objections.
Original:
[PNNI] ATM Forum Technical Committee, "Private Network-Network
Interface Specification, Version 1.1 (PNNI 1.1), af-pnni-
0055.002", 2003.
Current:
[PNNI] The ATM Forum Technical Committee, "Private Network-
Network Interface - Specification Version 1.1 - (PNNI
1.1)", af-pnni-0055.001, April 2002,
<https://www.broadband-forum.org/download/af-pnni-
0055.001.pdf>.
-->
<reference anchor='ISO10589-Second-Edition'> anchor='ISO10589-Second-Edition' target="https://www.iso.org/standard/30932.html">
<front>
<title>Intermediate system
<title>Information technology - Telecommunications and information
exchange between systems - Intermediate System to Intermediate system System
intra-domain
routing routeing information exchange protocol for use in conjunction
with the protocol for providing the connectionless-mode Network Service network service (ISO
8473)</title>
<author>
<organization>International Organization for Standardization</organization>
<organization>ISO/IEC</organization>
</author>
<date month='Nov' month='November' year='2002'/>
</front>
<seriesInfo name="ISO/IEC" value="10589:2002"/>
</reference>
<reference anchor='TR-384'> anchor='TR-384' target="https://www.broadband-forum.org/pdfs/tr-384-1-0-0.pdf">
<front>
<title>TR-384
<title>TR-384: Cloud Central Office Reference Architectural Framework</title>
<author>
<organization>Broadband Forum Technical Report</organization>
</author>
<date month='Jan' month='January' year='2018'/>
</front>
<refcontent>TR-384, Issue 1</refcontent>
</reference>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2328.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2328.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.4861.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4861.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5082.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5082.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5340.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5340.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5357.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5357.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.6518.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.6518.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.6550.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.6550.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.6775.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.6775.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7130.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7130.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8655.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8655.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8950.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml3/reference.I-D.ietf-rift-rift.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8950.xml'/>
<!-- [I-D.ietf-rift-rift] IESG state: RFC Ed queue as of 09/24/24; companion document RFC 9692-->
<reference anchor="RFC9692" target="https://www.rfc-editor.org/info/rfc9692">
<front>
<title>RIFT: Routing in Fat Trees</title>
<author fullname="Tony Przygienda" initials="T." surname="Przygienda" role="editor">
<organization>Juniper Networks</organization>
</author>
<author fullname="Jordan Head" initials="J." surname="Head" role="editor">
<organization>Juniper Networks</organization>
</author>
<author fullname="Alankar Sharma" initials="A." surname="Sharma">
<organization>Hudson River Trading</organization>
</author>
<author fullname="Pascal Thubert" initials="P." surname="Thubert">
<organization>Individual</organization>
</author>
<author fullname="Bruno Rijsman" initials="B." surname="Rijsman">
<organization>Individual</organization>
</author>
<author fullname="Dmitry Afanasiev" initials="D." surname="Afanasiev">
<organization>Yandex</organization>
</author>
<date month="December" year="2024"/>
</front>
<seriesInfo name="RFC" value="9692"/>
<seriesInfo name="DOI" value="10.17487/RFC9692"/>
</reference>
<!--xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml3/reference.I-D.white-distoptflood.xml'/-->
</references>
<references><name>Informative References</name>
<reference anchor="IEEEstd1588" target="https://standards.ieee.org/standard/1588-2019.html" quoteTitle="true" derivedAnchor="IEEEstd1588" > target="https://ieeexplore.ieee.org/document/9120376">
<front>
<title> IEEE
<title>IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems
</title> Systems</title>
<author>
<organization> IEEE standard for Information Technology
</organization>
<organization>IEEE</organization>
</author>
<date/>
<date month="June" year="2020"/>
</front>
<seriesInfo name="IEEE Std" value="1588-2019"/>
<seriesInfo name="DOI" value="10.1109/IEEESTD.2020.9120376"/>
</reference>
<reference anchor="CLOS"> anchor="CLOS" target="https://ieeexplore.ieee.org/document/6012836">
<front>
<title>On Nonblocking Folded-Clos Networks in Computer Communication Environments</title>
<author initials="X." surname="Yuan">
<organization>IEEE International Parallel &
Distributed Processing Symposium</organization>
</author> surname="Yuan"/>
<date month="May" year="2011"/>
</front>
<seriesInfo name="IEEE" value="International
<refcontent>2011 IEEE International Parallel & Distributed Processing Symposium"/> Symposium</refcontent>
<seriesInfo name="DOI" value="10.1109/IPDPS.2011.27"/>
</reference>
<reference anchor="FATTREE"> anchor="FATTREE" target="https://ieeexplore.ieee.org/document/6312192">
<front>
<title>Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing</title>
<author initials="C. E." surname="Leiserson">
<organization>IEEE Transactions on Computers</organization>
</author>
<date month="October" year="1985"/>
</front>
<refcontent>IEEE Transactions on Computers, vol. C-34, no. 10, pp. 892-901</refcontent>
<seriesInfo name="DOI" value="10.1109/TC.1985.6312192"/>
</reference>
<reference anchor="PNNI"> anchor="PNNI" target="https://www.broadband-forum.org/download/af-pnni-0055.001.pdf">
<front>
<title>Private Network-Network Interface Specification, - Specification Version 1.1 - (PNNI 1.1), af-pnni-0055.002</title> 1.1)</title>
<author>
<organization>ATM
<organization>The ATM Forum Technical Committee</organization>
</author>
<date year="2003"/> month="April" year="2002"/>
</front>
<refcontent>af-pnni-0055.001</refcontent>
</reference>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.3626.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3626.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.4271.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4271.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5905.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5905.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8200.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8200.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8505.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8505.xml'/>
<xi:include href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8928.xml'/> href='https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8928.xml'/>
</references>
</references>
<section title="Acknowledgments" numbered="false">
<t>
The authors would like to thank <contact fullname="Jaroslaw Kowalczyk"/>, <contact fullname="Alvaro Retana"/>, <contact
fullname="Jim Guichard"/>, and <contact fullname="Jeffrey Zhang"/> for providing invaluable concepts and content for this document.
</t>
</section>
<section anchor='Contributors' numbered='false'><name>Contributors</name>
<t>
The following people contributed substantially to the content of this
document and should be considered coauthors:</t>
<contact fullname="Jordan Head">
<organization>Juniper Networks</organization>
<address>
<email>jhead@juniper.net</email>
</address>
</contact>
<contact fullname="Tom Verhaeg">
<organization>Juniper Networks</organization>
<address>
<email>tverhaeg@juniper.net</email>
</address>
</contact>
</section>
<!-- [rfced] The following terminology appears to be used inconsistently.
Please let us know how we should update for consistency.
North Prefix TIE vs. Prefix North TIE
South Prefix TIE vs. South North TIE -->
<!-- [rfced] Please review the "Inclusive Language" portion of the online
Style Guide <https://www.rfc-editor.org/styleguide/part2/#inclusive_language>
and let us know if any changes are needed. Updates of this nature typically
result in more precise language, which is helpful for readers.
For example, please consider whether the terms "black" and "natively".
In addition, please consider whether "traditional" should be updated for clarity.
While the NIST website
<https://www.nist.gov/nist-research-library/nist-technical-series-publications-author-instructions#table1>
indicates that this term is potentially biased, it is also ambiguous.
"Tradition" is a subjective term, as it is not the same for everyone. -->
<!-- [rfced] FYI - We have added expansions for the following abbreviations
per Section 3.6 of RFC 7322 ("RFC Style Guide"). Please review each
expansion in the document carefully to ensure correctness.
Bidirectional Forwarding Detection (BFD)
Key Management Protocol (KMP)
Mobile Ad Hoc Network (MANET)
Optimized Link State Routing (OLSR)
Private Network-Network Interface (PNNI)
Routing Protocol for Low-Power and Lossy Networks (RPL)
-->
</back>
</rfc>