The SNIA Networking Storage Forum celebrated St. Patrick’s Day by hosting a live webcast, “Ethernet-attached SSDs – Brilliant Idea or Storage Silliness?” Even though we didn’t serve green beer during the event, the response was impressive with hundreds of live attendees who asked many great questions – 25 to be exact. Our expert presenters have answered them all here:
Q. Has a prototype drive been built today that includes the Ethernet controller inside the NVMe SSD?
A. There
is an Interposing board that extends the length by a small amount. Integrated
functionality will come with volume and a business case. Some SSD vendors have
plans to offer SSDs with fully-integrated Ethernet controllers.
Q.
Costs seem to be the initial concern… true apples to apples between JBOF?
A. Difference
is between a PCIe switch and an Ethernet switch. Ethernet switches usually cost
more but provide more bandwidth than PCIe switches. An EBOF might cost more
than a JBOF with the same number of SSDs and same capacity, but the EBOF is
likely to provide more performance than the JBOF.
Q.
What are the specification names and numbers. Which standards groups are
involved?
A. The
Native NVMe-oF Drive Specification from the SNIA is the primary specification. A public review version is here. Within
that specification, multiple other standards are referenced from SFF, NVMe, and
DMTF.
Q.
How is this different than “Kenetic”, “Object Storage”, etc.
effort a few years ago?
Is there any true production quality open source available or planned (if so
when), if so, by whom and where?
A.
Kinetic drives were hard disks and thus did not need high speed Ethernet. In
fact, new lower-speed Ethernet was developed for this case. The pins chosen for
Kinetic would not accommodate the higher Ethernet speeds that SSDs need, so the
new standard re-uses the same lanes defined for PCIe for use by Ethernet.
Kinetic was a brand-new protocol and application interface rather than
leveraging an existing standard interface such as NVMe-oF.
Q.
Can OpenChannel SSDs be used as EBOF?
A. To
the extent that Open Channel can work over NVMe-oF it should work.
Q.
Define the Signal Integrity challenges of routing Ethernet at these speeds
compared to PCIe.
A. The
signal integrity of the SFF 8639 connector is considered good through 25Gb
Ethernet. The SFF 1002 connector has been tested to 50Gb speeds with good
signal integrity and may go higher. Ethernet is able to carry data with good
signal integrity much farther than a PCIe connection of similar speed.
Q.
Is there a way to expose Intel Optane DC Persistent Memory through NVMe-oF?
A. For
now, it would need to be a block-based NVMe device. Byte addressability might
be available in the future
Q.
Will there be interposer to send the Block IO directly over the Switch?
A.
For the
Ethernet Drive itself, there is a dongle available for standard PCIe SSDs to
become an Ethernet Drive that supports block IO over NVMe-oF.
Q.
Do NVMe drives fail? Where is HA implemented? I never saw VROC from Intel
adopted. So, does the user add latency when adding their own HA?
A. Drive
reliability is not impacted by the fact that it uses Ethernet. HA can be
implemented by dual port versions of Ethernet drives. Dual port dongles are
available today. For host or network-based data protection, the fact that
Ethernet Drives can act as a secondary location for multiple hosts, makes data
protection easier.
Q.
Ethernet is a contention protocol and TCP has overhead to deliver
reliability. Is there any work going on
to package something like Fibre Channel/QUIC or other solutions to eliminate
the downsides of Ethernet and TCP?
A.
FC-NVMe has been approved as a standard since 2017 and is available and
maturing as a solution. NVMe-oF on Ethernet can run on RoCE or TCP with the
option to use lossless Ethernet and/or congestion management to reduce
contention, or to use accelerator NICs to reduce TCP overhead. QUIC is growing
in popularity for web traffic but it’s not clear yet if QUIC will prove popular
for storage traffic.
Q.
Is Lenovo or other OEM’s building standard EBOF storage servers? Is OCP having
a work group on EBOF supporting hardware architecture and specification?
A. Currently,
Lenovo does not offer an EBOF. However,
many ODMs are offering JBOFs and a few are offering EBOFs. OCP is currently
focusing on NVMe SSD specifics, including form factor. While several JBOFs have been introduced into
OCP, we are not aware of an OCP EBOF specification per se. There are OCP
initiatives to optimize the form factors of SSDs and there are also OCP storage
designs for JBOF that could probably evolve into an Ethernet SSD enclosure with
minimal changes.
Q.
Is this an accurate statement on SAS latency. Where are you getting and quoting
your data?
A. SAS
is a transaction model, meaning the preceding transaction must complete before
the next transaction can be started (QD does ameliorate this to some degree but
end points still have to wait). With the initiator and target having to wait
for the steps to complete, overall throughput slows. SAS HDD = milliseconds per
IO (governed by seek and rotation); SAS SSD = 100s of microseconds (governed by
transaction nature); NVMe SSD = 10s of microseconds (governed by queuing
paradigm).
Q.
Regarding performance & scaling, a 50GbE has less bandwidth than a PCIe
Gen3 x4 connection. How is converting to Ethernet helping performance of the
array? Doesn’t it face the same bottleneck of the NICs connecting the JBOF/eBOF
to the rest of the network?
A. It
eliminates the JBOF’s CPU and NIC(s) from the data path and replaces them with
an Ethernet switch. Math: 1P 50G =
5GBps 1P 4X Gen 3 = 4 GBps,. because
PCIe Gen 3 = 8 Gbps per lane so a single
25GbE NIC is usually connected to 4 lanes of PCIe Gen3 and a single 50GbE NIC
is usually connected to 8 lanes of PCIe Gen3 (or 4 lanes of PCIe Gen4). But
that is half of the story: 2 other
dimensions to consider. First, getting all this BW (either way) out the JBOF
vs. an EBOF. Second, at the solution level, all these ports (connectivity) and
scaling (bandwidth) present their own challenges
Q.
What about Persistent Memory? Can you present Optane DC through NVMe-Of?
A. Interesting
idea!!! Today persistent memory DIMMs sit on memory bus so they would not
benefit directly from Ethernet architecture. But with the advent of CXL and PCIe
Gen 5, there may be a place for persistent memory in “bays” for a
more NUMA-like architecture
Q.
For those of us that use Ceph, this might be an interesting vertical
integration, but feels like there’s more to the latency to “finding”
and “balancing” the data on the arrays of Ethernet-attached NVMe. Has
there been any software suites to accompany this hardware changes and are
whitepapers published?
A. Ceph
nodes are generally slower (for like-to-like HW than non-Ceph storage solutions,
so Ceph might be less likely to benefit from Ethernet SSDs, especially NVMe-oF SSDs.
That said, if the cost model for ESSDs works out (really cheap Ceph nodes to
overcome “throwing HW at the problem”), one could look at Ceph
solutions using ESSDs, either via NVMe-oF or by creating ESSDs with a key-value
interface that can be accessed directly by Ceph.
Q.
Can the traditional Array functions be moved to the LAN switch layer, either
included in the switch (~the Cisco MDS and IBM SVC
“”experiment””) or connect the controller functionality to
the LAN switch backbone with the SSD’s in a separate VLAN?
A. Many
storage functions are software/firmware driven. Certainly, a LAN switch with a
rich X86 complex could do this…or…a server with a switch subsystem could. I can
see low level storage functions (RAID XOR, compression, maybe snapshots)
translated to switch HW, but I don’t see a clear path for high level functions
(dedupe, replication, etc) translated to switch HW. However, since hyperscale does not perform
many high-level storage functions at the storage node, perhaps enough can be
moved to switch HW over time.
Q.
ATA over Ethernet has been working for nearly 18 years now. What is the difference?
A. ATA
over Ethernet is more of a work group concept and has never gone mainstream (to
be honest, your question is the first time I heard this since 2001). In any
event, ATA does not take advantage of queuing nature of NVMe so it’s still held
hostage by transaction latency. Also, no
high availability (HA) in ATA (at least I am not aware of any HA standards for
ATA), which presents a challenge because HA at the box or storage controller
level does NOT solve the SPOF problem at the drive level.
Q.
Request for comment – Ethernet 10G, 25G, 50G, 100G per lane (all available
today), and Ethernet MAC speeds of 10G, 25G, 40G, 50G, 100G, 200G, 400G (all
available today), Ethernet is far more scalable compared to PCIe. Comparing Ethernet Switch relative cost to
PCIe switch, Ethernet Switch is far more economical. Why shouldn’t we switch?
A. Yes
Ethernet is more scalable than PCIe, but 3 things need to happen. 1) Solution
level orchestration has to happen (putting an EBOF behind an RBOF is okay but
only the first step); 2) The Ethernet
world has to start understanding how storage works (multipathing, ACLs,
baseband drive management, etc.); 3) Lower
cost needs to be proven–jury still out on cost (on paper, it’s a no brainer,
but costs of the Ethernet switch in the I/O Module can rival an X86 complex).
Note that Ethernet with 100Gb/s per lane is not yet broadly available as of Q2
2020.
Q. We’ve seen issues with single network infrastructure from an
availability perspective. Why would anyone put their business at risk in this
manner? Second question is how will this work with multiple vendor hosts or
drive vendors, each having different specifications?
A. Customers already connect their traditional storage
arrays to either single or dual fabrics, depending on their need for redundancy,
and an Ethernet drive can do the same, so there is no rule that an Ethernet SSD
must rely on a single network infrastructure. Some large cloud customers use
data protection and recovery at the application level that spans multiple drives
(or multiple EBOFS), providing high levels of data availability without needing
dual fabric connections to every JBOF or to every Ethernet drive. For the
second part of the question, it seems likely that all the Ethernet drives will
support a standard Ethernet interface and most of them will support the NVMe-oF
standard, so multiple host and drive vendors will interoperate using the same
specifications. This is already been happening through UNH plug fests at the
NIC/Switch level. Areas where Ethernet SSDs might use different specifications
might include a key-value or object interface, computational storage APIs, and
management tools (if the host or drive maker don’t follow one of the emerging SNIA
specifications).
Q. Will there be a Plugfest or certification test for Ethernet SSDs?
A. Those
Ethernet SSDs that use the NVMe-oF interface will be able to join the existing
UNH IOL plugfests for NVMe-oF. Whether there are plugfests for any other aspects
of Ethernet SSDs–such as key-value or computational storage APIs–likely
depends on how many customers want to use those aspects and how many SSD
vendors support them.
Q.
Do you anticipate any issues with mixing control (Redfish/Swordfish) and data
over the same ports?
A. No,
it should be fine to run control and data over the same Ethernet ports. The
only reason to run management outside of the data connection would be to
diagnose or power cycle an SSD that is still alive but not responding on its
Ethernet interface. If out-of-band management of power connections is required,
it could be done with a separate management Ethernet connection to the EBOF
enclosure.
Q.
We will require more Switch ports would it mean more investment to be spent
Also how is the management of Ethernet SSD’s done.
A. Deploying
Ethernet SSDs will require more Ethernet switch ports, though it will likely
decrease the needed number of other switch or repeater ports (PCIe, SAS, Fibre
Channel, InfiniBand, etc.). Also, there are models showing that Ethernet SSDs
have certain cost advantages over traditional storage arrays even after
including the cost of the additional Ethernet switch ports. Management of the Ethernet
SSDs can be done via standard Ethernet mechanisms (such as SNMP), through NVMe
commands (for NVMe-oF SSDs), and through the evolving DTMF Redfish/SNIA
Swordfish management frameworks mentioned by Mark Carlson during the webcast. You can find more information
on SNIA Swordfish here.
Q.
Is it assumed that Ethernet connected SSDs need to implement/support congestion
control management, especially for cases of overprovision in EBOF (i.e. EBOF
bandwidth is less than sum of the underlying SSDs under it)? If so – is that
standardized?
A. Yes,
but both NVMe/TCP and NVMe/RoCE protocols have congestion management as
part of the protocol, so it is baked in. The eSSDs can connect to either a switch inside the EBOF
enclosure or to an external Top-of-Rack (ToR) switch. That Ethernet switch may
or may not be oversubscribed, but either way the protocol-based congestion
management on the individual Ethernet SSDs will kick in if needed. But if the
application does not access all the eSSDs in the enclosure at the same time,
the aggregate throughput from the SSDs being used might not exceed the
throughput of the switch. If most or all of the SSDs in the enclosure will be
accessed simultaneously, then it could make sense to use a non-blocking switch
(that will not be oversubscribed) or rely on the protocol congestion
management.
Q.
Are the industry/standards groups developing application protocol (IOS layers 5
thru 7) to allow customers to use existing OS/Apps without modification? If so
when will these be available and via what delivery to the market such as new
IETF Application Protocol, Consortium,…?
A. Applications that can directly use individual SSDs can access a
NVMe-oF Ethernet SSD directly as block storage, without modification and
without using any other protocols. There are also software-defined storage
solutions that already manage and virtualize access to NVMe-oF arrays and they
could be modified to allow applications to access multiple Ethernet SSDs
without modifications to the applications. At higher levels of the IOS stack,
the computational storage standard under development within SNIA or a key-value
storage API could be other solutions to allow applications to access Ethernet
SSDs, though in some cases the applications might need to be modified to
support the new computational storage and/or key-value APIs.
Q.
In an eSSD implementation what system element implements advanced features like
data streaming and IO determinism? Maybe a better question is does the standard
support this at the drive level?
A. Any
features such as these that are already part of NVMe will work on Ethernet
drives.
Leave a Reply