Almost 800 people have already watched our webcast “Optimizing NVMe over Fabrics Performance with Different Ethernet Transports: Host Factors” where SNIA experts covered the factors impacting different Ethernet transport performance for NVMe over Fabrics (NVMe-oF) and provided data comparisons of NVMe over Fabrics tests with iWARP, RoCEv2 and TCP. If you missed the live event, watch it on-demand at your convenience.
The session generated a lot of questions, all answered here in
this blog. In fact, many of the questions have prompted us to continue this
discussion with future webcasts on NVMe-oF performance. Please follow us on Twitter @SNIANSF for upcoming dates.
Q. What factors will affect the performance of NVMe over RoCEv2
and TCP when the network between host and target is longer than typical Data Center environment? i.e., RTT > 100ms
A. For a large deployment with long distance, congestion
management and flow control will be the most critical considerations to make
sure performance is guaranteed. In a very large deployment, network topology,
bandwidth subscription to storage target, and connection ratio are all
important factors that will impact the performance of NVMe-oF.
Q. Were the RoCEv2 tests run on ‘lossless’ Ethernet and the TCP
tests run on ‘lossy’ Ethernet?
A. Both iWARP and RoCEv2 tests were run in a back to back
configuration without a switch in the middle, but with Link Flow Control turned
on.
Q. Just to confirm, this is with pure ROCEv2? No TCP, right?
ROCEv2 end 2 end (initiator 2 target)?
A. Yes, for RoCEv2 test, that was RoCEv2 Initiator to RoCEv2
target.
Q. How are the drives being preconditioned? Is it based on I/O
size or MTU size?
A. Storage is pre-conditioned by I/O size and type of the selected
workload. MTU size is not relevant. The
selected workload is applied until performance changes are time invariant –
i.e. until performance stabilizes within a range known as steady state. Generally, the workload is tracked by
specific I/O size and type to remain within a data excursion of 20% and a slope
of 10%.
Q. Are the 6 SSDs off a single Namespace, or multiple? If so, how
many Namespaces used?
A. Single namespace.
Q. What I/O generation tool was used for the test?
A. Calypso CTS IO Stimulus generator
which is based on libaio. CTS has same engine as fio and applies IOs to the
block IO level. Note vdbench and iometer
are java-based file system level and higher in the software stack.
Q. Given that NVMe SSD performance is high with low
latency, is it not that the performance bottleneck is shifted to the storage
controller?
A. Test I/Os are applied to the logical storage seen
by host on the target server in our attempt to normalize the host and target in
order to assess NIC-Wire-NIC performance. The storage controller is beneath
this layer and not applicable to this test. If we test the storage directly on
the target – not over the wire – then we can see impact of the controller and
controller related issues (such as garbage collection, over provisioning, table
structures, etc.)
Q. What are the specific characteristics of RoCEv2
that restrict it to ‘rack’ scale deployments?
In other words, what is restricting it from larger scale deployments?
A. RoCEv2 can, and does, scale beyond the rack if you
have one of three things:
- A lossless network with DCB (priority flow control)
- Congestion management with solutions like ECN
- Newer RoCEv2-capable adapters that support out of order packet receive and selective re-transmission
- How many CPU cores are needed (I’m willing to give)?
- Optane SSD or 3D NAND SSD?
- How deep should the Q-Depth be?
- Why do I need to care about MTU?
- The test used a dual socket server on target with IntelÒ XeonÒ Platinum 8280L processor with 28 cores. Target server only used one processor so that all the workloads were on a single NUMA node. 1-4% CPU utilization is the average of 28 cores.
- SSD-1 is Optane SSD, SSD-2 is 3D NAND.
- Normally QD is set to 32.
- You do not need to care about MTU, at least in our test, we saw minimal performance differences.
Leave a Reply