Optimizing NVMe over Fabrics Performance Q&A

Tom Friend

Oct 2, 2020

title of post
Almost 800 people have already watched our webcast “Optimizing NVMe over Fabrics Performance with Different Ethernet Transports: Host Factors” where SNIA experts covered the factors impacting different Ethernet transport performance for NVMe over Fabrics (NVMe-oF) and provided data comparisons of NVMe over Fabrics tests with iWARP, RoCEv2 and TCP. If you missed the live event, watch it on-demand at your convenience. The session generated a lot of questions, all answered here in this blog. In fact, many of the questions have prompted us to continue this discussion with future webcasts on NVMe-oF performance. Please follow us on Twitter @SNIANSF for upcoming dates. Q. What factors will affect the performance of NVMe over RoCEv2 and TCP when the network between host and target is longer than typical Data Center environment? i.e., RTT > 100ms A. For a large deployment with long distance, congestion management and flow control will be the most critical considerations to make sure performance is guaranteed. In a very large deployment, network topology, bandwidth subscription to storage target, and connection ratio are all important factors that will impact the performance of NVMe-oF. Q. Were the RoCEv2 tests run on ‘lossless’ Ethernet and the TCP tests run on ‘lossy’ Ethernet? A. Both iWARP and RoCEv2 tests were run in a back to back configuration without a switch in the middle, but with Link Flow Control turned on. Q. Just to confirm, this is with pure ROCEv2? No TCP, right? ROCEv2 end 2 end (initiator 2 target)? A. Yes, for RoCEv2 test, that was RoCEv2 Initiator to RoCEv2 target. Q. How are the drives being preconditioned? Is it based on I/O size or MTU size?  A. Storage is pre-conditioned by I/O size and type of the selected workload. MTU size is not relevant.  The selected workload is applied until performance changes are time invariant – i.e. until performance stabilizes within a range known as steady state.  Generally, the workload is tracked by specific I/O size and type to remain within a data excursion of 20% and a slope of 10%. Q. Are the 6 SSDs off a single Namespace, or multiple? If so, how many Namespaces used? A. Single namespace. Q. What I/O generation tool was used for the test? A. Calypso CTS IO Stimulus generator which is based on libaio. CTS has same engine as fio and applies IOs to the block IO level.  Note vdbench and iometer are java-based file system level and higher in the software stack. Q. Given that NVMe SSD performance is high with low latency, is it not that the performance bottleneck is shifted to the storage controller? A. Test I/Os are applied to the logical storage seen by host on the target server in our attempt to normalize the host and target in order to assess NIC-Wire-NIC performance. The storage controller is beneath this layer and not applicable to this test. If we test the storage directly on the target – not over the wire – then we can see impact of the controller and controller related issues (such as garbage collection, over provisioning, table structures, etc.) Q. What are the specific characteristics of RoCEv2 that restrict it to ‘rack’ scale deployments? In other words, what is restricting it from larger scale deployments? A. RoCEv2 can, and does, scale beyond the rack if you have one of three things:
  1. A lossless network with DCB (priority flow control)
  2. Congestion management with solutions like ECN
  3. Newer RoCEv2-capable adapters that support out of order packet receive and selective re-transmission
Your mileage will vary based upon features of different network vendors. Q. Is there an option to use some caching mechanism on host side? A. Host side has RAM cache per platform set up but is held constant among these tests. Q. Was there caching in the host? A. The test used host memory for NVMe over Fabrics. Q. Were all these topics from the description covered?  In particular, #2? We will cover the variables:
  1. How many CPU cores are needed (I’m willing to give)?
  2. Optane SSD or 3D NAND SSD?
  3. How deep should the Q-Depth be?
  4. Why do I need to care about MTU?
A. Cores – see TC/QD sweep to see optimal OIO.  Core Usage/Required can be inferred from this. Note incongruity of TC/QD to OIO 8, 16, 32, 48 in this case.
  1. The test used a dual socket server on target with IntelÒ XeonÒ Platinum 8280L processor with 28 cores. Target server only used one processor so that all the workloads were on a single NUMA node. 1-4% CPU utilization is the average of 28 cores.
  2. SSD-1 is Optane SSD, SSD-2 is 3D NAND.
  3. Normally QD is set to 32.
  4. You do not need to care about MTU, at least in our test, we saw minimal performance differences.
Q. The result of 1~4% of CPU utilization on target is based on single SSD? Do you expect to see much higher CPU utilization if the amount of SSD increases? A. CPU % is the target server for the 6 SSD LUN. Q. Is there any difference between the different transports and the sensitivity of lost packets? A. Theoretically, iWARP and TCP are more tolerant to packet lost. iWARP is based on TCP/IP, TCP provides flow control and congestion management that can still perform in a congested environment. In the event of packet loss, iWARP supports selective re-transmission and out of order packet receive, those technology can further improve the performance in a lossy network. While, RoCEv2 standard implementation does not tolerate packet loss and would require lossless network and would experience performance degradation when packet loss happens. Q. 1. When you mean offload TCP, is this both at Initiator and target side or just host initiator side? 2. Do you see any improvement with ADQ on TCP? A. RDMA iWARP in the test has a complete offload TCP engine on the network adapter on both Initiator and target side. Application Device Queues (ADQ) can significantly improve throughput, latency and most importantly latency jitter with dedicated CPU core allocated for NVMe-oF solutions. Q. Since the CPU utilization is extremely low on the host, any comments about the CPU role in NVMe-oF and the impact of offloading? A. NVMe-oF was designed to reduce the CPU load on target as shown in the test. On the initiator side CPU load will be a little bit higher. RDMA, as an offloaded technology, requires fairly minimal CPU utilization. NVMe over TCP still uses TCP stack in the kernel to do all the work, thus CPU still plays an important role. Also, the test was done with a high-end IntelÒ XeonÒ  Processor with very powerful processing capability, if a processor with less processing power is used, CPU utilization would be higher. Q. 1. What should be the ideal incapsulated data (inline date) size for best performance in a real-world scenario? 2. How could one optimize buffer copies at block level in NVMe-oF? A. 1. There is no simple answer to this question. The impact of incapsulated data size to performance in the real-world scenario is more complicated as switch is playing a critical role in the whole network. Whether there is a shallow buffer switch or a deep buffer switch, switch settings like policy, congestion management etc. would all impact the overall performance. 2. There are multiple explorations to improve the performance of NVMe-oF by reducing or optimizing buffer copies. One possible option is to use controller memory buffer introduced in NVMe Specification 1.2. Q. Is it possible to combine any of the NVMe-of technologies with SPDK – user space processing? A. SPDK currently supports all these Ethernet-based transports: iWarp, RoCEv2 and TCP. Q. You indicated that TCP is non-offloaded, but doesn’t it still use the ‘pseudo-standard’ offloads like Checksum, LSO, RSS, etc?  It just doesn’t have the entire TCP stack offloaded? A. Yes, stateless offloads are supported and used. Q. What is the real idea in using 4 different SSDs? Why didn’t you use 6 or 8 or 10? What is the message you are trying to relay? I understand that SSD1 is higher/better performing than SSD2. A. We used a six SSD LUN in both SSD-1 and SSD-2.  We compared higher performance – lower capacity Optane to lower performance – higher capacity NVMe.  Note NVMe is 10X capacity of Optane. Q. It looks like one of the key takeaways is that SSD specs matter. Can you explain (without naming brands) the main differences between SSD-1 and SSD-2? A. Manufacturer specs are only a starting point and actual performance depends on the workload. Large differences are seen for small block RND W workloads and large block SEQ R workloads. Q. What is the impact to the host CPU and memory during the tests? Wondering what minimum CPU and memory are necessary to achieve peak NVMe-oF performance, which leads to describe how much application workload one might be able to achieve. A. The test did not limit CPU core or memory to try the minimal configuration to achieve peak NVMe-oF performance. This might be an interesting topic we can cover in the future presentation.  (We measured target server CPU usage, not host / initiator CPU Usage). Q. Did you let the tests run for 2 hours and then take results? (basically, warm up the cache/SSD characterization)? A. We precondition with the TC/QD Sweep test then run the remaining 3 tests back to back to take advantage of the preconditioning done in the first test. Q. How do you check outstanding IOs? A. We use OIO = TC x QD in test settings and populate each thread with the QD jobs. We do not look at in flight OIO, but wait for all OIOs to complete and measure response times. Q. Where can we get the performance test specifications as defined by SNIA? A. You can find the test specification on the SNIA website here. Q. Have these tests been run using FC-NVMe. If so, how did they fare? A. We have not yet run tests your NVMe over Fibre Channel. Q. What tests did you use? FIO, VDBench, IOZone, or just DD or IOMeter? What was the CPU peak utilization? and what CPUs did you use? A. CTS IO generator which is similar to fio as both are based on libaio and test at the block level.  Vdbench, iozone and Iometer are java file system level.  DD is direct and lacks complex scripting.  Fio allows compiles scripting but not multiple variables per loop – i.e. requires iterative tests and post-test compilation vs. CTS which has multi variable – multi loop concurrency. Q. What test suites did you use for testing? A. Calypso CTS tests Q. I heard that iWARP is dead? A. No, iWARP is not dead. There are multiple Ethernet network adapter vendors supporting iWARP now. The adapter used in the test supports iWARP, RoCEv2 and TCP at the same time. Q. Can you post some recommendation on the switch setup and congestion? A. The test talked about in this presentation used back to back configuration without switch. We will have a presentation in the near future to take into account switch settings and will share more information at that time. Don’t forget to follow us on Twitter @SNIANSF for dates of upcoming webcasts.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

John Kim

Sep 19, 2018

title of post
In our RoCE vs. iWARP webcast, experts from the SNIA Ethernet Storage Forum (ESF) had a friendly debate on two commonly known remote direct memory access (RDMA) protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP. It turned out to be another very popular addition to our “Great Storage Debate” webcast series. If you haven’t seen it yet, it’s now available on-demand along with a PDF of the presentation slides. We received A LOT of questions related to Performance, Scalability and Distance, Multipathing, Error Correction, Windows and SMB Direct, DCB (Data Center Bridging), PFC (Priority Flow Control), lossless networks, and Congestion Management, and more. Here are answers to them all.  Q. Are RDMA NIC’s and TOE NIC’s the same? What are the differences?  A. No, they are not, though some RNICs include a TOE. An RNIC based on iWARP uses a TOE (TCP Offload Engine) since iWARP itself is fundamentally an upper layer protocol relative to TCP/IP (encapsulated in TCP/IP). The iWARP-based RNIC may or may not expose the TOE. If the TOE is exposed, it can be used for other purposes/applications that require TCP/IP acceleration. However, most of the time, the TOE is hidden under the iWARP verb’s API and thus is only used to accelerate TCP for iWARP. An RNIC based on RoCE usually does not have a TOE in the first place and is thus not capable of statefully offloading TCP/IP, though many of them do offer stateless TCP offloads. Q. Does RDMA use the TCP/UDP/IP protocol stack? A. RoCE uses UDP/IP while iWARP uses TCP/IP. Other RDMA protocols like OmniPath and InfiniBand don’t use Ethernet. Q. Can Software Defined Networking features like VxLANs be implemented on RoCE/iWARP NICs? A. Yes, most RNICs can also support VxLAN. An RNIC combined all the functionality of a regular NIC (like VxLAN offloads, checksum offloads etc.) along with RDMA functionality. Q. Do the BSD OS’s (e.g. FreeBSD) support RoCE and iWARP?   A. FreeBSD supports both iWARP and RoCE. Q. Any comments on NVMe over TCP? AThe NVMe over TCP standard is not yet finalized. Once the specification is finalized SNIA ESF will host a webcast on BrightTALK to discuss NVMe over TCP. Follow us @SNIAESF for notification of all our upcoming webcasts. Q. What layers in the OSI model would the RDMAP, DDP, and MPA map to for iWARP? A. RDMAP/DDP/MPA are stacking on top of TCP, so these protocols are sitting on top of Layer 4, Transportation Layer, based on the OSI model. Q. What’s the deployment percentages between RoCE and iWARP? Which has a bigger market share support and by how much? A. SNIA does not have this market share information. Today multiple networking vendors support both RoCE and iWARP. Historically more adapters supporting RoCE have been shipped than adapters supporting iWARP, but not all the iWARP/RoCE-capable Ethernet adapters deployed are used for RDMA. Q. Who will win RoCE or iWARP or InfiniBand? What shall we as customers choose if we want to have this today? A. As a vendor-neutral forum, SNIA cannot recommend any specific RDMA technology or vendor. Note that RoCE and iWARP run on Ethernet while InfiniBand (and OmniPath) do not use Ethernet. Q. Are there any best practices identified for running higher-level storage protocols (iSCSI/NFS/SMB etc.), on top of RoCE or iWARP? A. Congestion caused by dropped packets and retransmissions can degrade performance for higher-level storage protocols whether using RDMA or regular TCP/IP. To prevent this from happening a best practice would be to use explicit congestion notification (ECN), or better yet, data center bridging (DCB) to minimize congestion and ensure the best performance. Likewise, designing a fully non-blocking network fabric will also assist in preventing congestion and guarantee the best performance. Finally, by prioritizing the data flows that are using RoCE or iWARP, the network administrators can ensure bandwidth is available for the flows that require it the most. iWARP provides RDMA functionality over TCP/IP and inherits the loss resilience and congestion management from the underlying TCP/IP layer. Thus, it does not require specific best practices beyond those in use for TCP/IP including not requiring any specific host or switch configuration as well as out-of-the-box support across LAN/MAN/WAN networks. Q. On slide #14 of RoCE vs. iWARP presentation, the slide showed SCM being 1,000 times faster than NAND Flash, but the presenter stated 100 times faster. Those are both higher than I have heard. Which is the correct? A. Research on the Internet shows that both Intel and Micron have been boasting that 3D XPoint Memory is 1,000 times as fast as NAND flash. However, their tests also compared standard NAND flash based PCIe SSD to a similar SSDs based on 3D XPoint memory which was only 7-8 times faster. Due to this, we dug in a little further and found a great article by Jim Handy Why 3D XPoint SSDs Will Be Slow that could help explain the difference. Q. What is the significance of BTH+ and GRH header? A. BTH+ and GRH are both used within InfiniBand for RDMA implementations. With RoCE implementations of RDMA, packets are marked with EtherType header that indicates the packets are RoCE and ip.protocol_number within the IP Header is used to indicate that the packet is UDP. Both of these will identify packets as RoCE packets. Q. What sorts of applications are unique to the workstation market for RDMA, versus the server market? A. All major OEM vendors are shipping servers with CPU platforms that include integrated iWARP RDMA, as well as offering adapters that support iWARP and/or RoCE. Main applications of RDMA are still in the server area at this moment. At the time of this writing, workstation operating systems such as Windows 10 or Linux can use RDMA when running I/O-intensive applications such as video post-production, oil/gas and computer-aided design applications, for high-speed access to storage. DCB, PFC, lossless networks, and Congestion Management Q. Is slide #26 correct? I thought RoCE v1 was PFC/DCB and RoCE v2 was ECN/DCB subset. Did I get it backwards? A. Sorry for the confusion, you’ve got it correct. With newer RoCE-capable adapters, customers may choose to use ECN or PFC for RoCE v2. Q. I thought RoCE v2 did not need any DCB enabled network, so why this DCB congestion management for RoCE v2? A. RoCEv2 running on modern rNICs is known as Resilient RoCE due to it not needing a lossless network. Instead a RoCE congestion control mechanism is used to minimize packet by leveraging Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. RoCE v2 takes advantage of ECN to avoid congestion and packet loss. ECN-capable switches detect when a port is getting too busy and mark outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission (referred to as resilient RoCE). Q. Is iWARP a lossless or losssy protocol? A. iWARP utilizes the underlying TCP/IP layer for loss resilience. This happens at silicon speeds for iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality. Q. So it looks to me that iWARP can use an existing Ethernet network without modifications and RoCEv2 would need some fine-tuning. Is this correct? A. Generally iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE). However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion. iWARP delivers RDMA on top of the TCP/IP protocol and thus TCP provides congestion management and loss resilience for iWARP which, as a result, does not require a lossless Ethernet network. This is particularly useful in congested networks or long distance links. Q. Is this correct statement? Please clarify — RoCE v1 requires ECN, PFC but RoCEv2 requires only ECN or PFC? A. Remember, we called this presentation a “Great Storage Debate?” Here is an area where there are two schools of thoughts. Answer #1: It’s recommended to deploy RoCE (v1) with PFC which is part of the Ethernet Data Center Bridging (DCB) specification to implement a lossless network. With the release of RoCEv2, an alternative mechanism to avoid packet loss was introduced which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. Answer #2: Generally this is correct, iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE), and DCB. As such, and this is very important, an iWARP installation of a storage or server node is decoupled from the switch infrastructure upgrade. However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion, though in the case of an iWARP adapter, this benefit is insignificant, since all loss recovery and congestion management happen at the silicon speed of the underlying TOE. Q. Does RoCE v2 also require PFC or how will it handle lossy networks?  A. RoCE v2 does not require PFC but performs better with having either PFC or ECN activated. See the following question and answer for more details. Q. Can a RoCEv2 lossless network be achieved with ECN only (no PFC)? A. RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network. With modern RoCE-capable adapters, it only requires the use of ECN. ECN in of itself does not guarantee a lossless connection but can be used to minimize congestion and thus minimize packet loss. However, even with RoCE v2, a lossless connection (using PFC/DCB) can provide better performance and is often implemented with RoCEv2 deployments, either instead of ECN or alongside ECN. Q. In order to guarantee lossless, does ECN and PFC both have to be used? A. ECN can be used to avoid most packet loss, but PFC (part of DCB) is required for a truly lossless network. Q. Are there real deployments that use “Resilient RoCE” without PFC configured? A. To achieve better performance, PFC alone or both ECN and PFC are deployed in most iterations of RoCE in real deployments today. However, there are a growing number of deployments using Resilient RoCE with ECN alone that maintain high levels of performance. Q. For RoCEv2, can ECN be implemented without PFC? A. Yes, ECN can be implemented on it’s own within a RoCE v2 implementation without the need for PFC. Q. RoCE needs to have converged Ethernet, but no iWARP, correct? A. Correct. iWARP was standardized in IETF and built upon standard TCP/IP over Ethernet, so “Converged Ethernet” requirement doesn’t apply to iWARP. Q. It’s not clear from the diagram if TCP/IP is still needed for RoCE and iWARP. Is it? A. RoCE uses IP (UDP/IP) but not TCP. IWARP uses TCP/IP. Q. On slide #10, does this require any support on the switch? A. Yes, an enterprise switch with support for DCB would be required. Most enterprise switches do support DCB today. Q. Will you cover congestion mechanisms and which one ROCEv2 or iWARP work better for different workloads? A. With multiple vendors supporting RoCEv2 and iWARP at different speeds (10, 25, 40, 50, and 100Gb/s), we’d likely see a difference in performance from each adapter across different workloads. An apples-to-apples test of the specific workload would be required to provide an answer. If you are working with a specific vendor or OEM, we would suggest you ask the vendor/OEM for comparison data on the workload you plan on deploying. Performance, Scalability and Distance Q. For storage related applications, could you add a performance based comparison of Ethernet based RoCE / iWARP to FC-NVMe with similar link speeds (32Gbps FC to 40GbE for example)? A. We would like to see the results of this testing as well and due to the overwhelming request for data representing RoCE vs. iWARP this is something we will try to provide in the future. Q. Do you have some performance measurements which compare iWARP and RoCE? A. Nothing is available from SNIA ESF but a search on Google should provide you with the information you are looking for. For example, you can find this Microsoft blog. Q. Are there performance benchmarks between RoCE vs. iWARP? A. Debating which one is faster is beyond the scope of this webcast. Q. Can RoCE scale to 1000’s of Ceph nodes, assuming each node hosts 36 disks? A. RoCE has been successfully tested with dozens of Ceph nodes. It’s unknown if RoCE with Ceph can scale to 1000s of Ceph nodes. Q. Is ROCE limited in number of hops? A. No, there is no limit in the amount of hops, but as more hops are included, latencies increase and performance may become an issue. Q. Does RoCEv2 support long distance (100km) operation or is it only iWARP? A. Today the practical limit of RoCE while maintaining high performance is about 40km. As different switches and optics come to market, this distance limit may increase in the future. iWARP has no distance limit but with any high-performance networking solution, increasing distance leads to increasing latency due to the speed of light and/or retransmission hops. Since it is a protocol on top of basic TCP/IP, it can transfer data over wireless links to satellites if need be. Multipathing, Error Correction Q. Isn’t the Achilles heel of iWARP the handling of congestion on the switch? Sure TCP/IP doesn’t require lossless but doesn’t one need DCTCP, PFC, ETS to handle buffers filling up both point to point as well as from receiver to sender? Some vendors offload any TCP/IP traffic and consider RDMA “limited” but even if that’s true don’t they have to deal with the same challenges on the switch in regards to congestion management? A. TCP itself uses a congestion-avoidance algorithm, like TCP New Reno (RFC 6582), together with slow start and congestion window to avoid congestions. These mechanisms are not dependent on switches. So iWARP’s performance under network congestion should closely match that of TCP. Q. If you are using RoCE v2 with UDP, how is error correction implemented? A. Error correction is done by the RoCE protocol running on top of UDP. Q. How does multipathing works with RDMA? A. For single-port RNICs, multipathing, being network-based (Equal-cost Multi-path routing, ECMP) is transparent to the RDMA application. Both RoCE and iWARP transports achieve good network load balancing under ECMP. For multi-port RNICs, the RDMA client application can explicitly load-balance its traffic across multiple local ports. Some multi-port RNICs support link aggregation (a.k.a. bonding), in which case the RNIC transparently spreads connection load amongst physical ports. Q. Do RoCE and iWARP work with bonded NICs? A. The short answer is yes, but it will depend on individual NIC vendor’s implementation. Windows and SMB Direct Q. What is SMB Direct? A. SMB Direct is a special version of the SMB 3 protocol. It supports both RDMA and multiple active-active connections. You can find the official definition of SMB (Server Message Block) in the SNIA Dictionary. Q. Is there iSER support in Windows? A. Today iSER is supported in Linux and VMware but not in Windows. Windows does support both iWARP and RoCE for SMB Direct. Chelsio is now providing an iSER (iWARP) Initiator for Windows as part of the driver package, which is available at service.chelsio.com. The current driver is considered a beta, but will go GA by the end of September 2018. Q. When will iWARP or RoCE for NVMe-oF be supported on Windows? A. Windows does not officially support NVMe-oF yet, but if and when Windows does support it, we believe it will support it over both RoCE and iWARP. Q. Why is iWARP better for Storage Spaces Direct? A. iWARP is based on TCP, which deals with flow control and congestion management, so iWARP is scalable and ideal for a hyper-converged storage solution like Storage Spaces Direct. iWARP is also the recommended configuration from Microsoft in some circumstances. We hope that answers all your questions! We encourage you to check out the other “Great Storage Debate” in this webcast series. To date, our experts have had friendly, vendor-neutral debates on File vs. Block vs. Object StorageFibre Channel vs. iSCSIFCoE vs. iSCSI vs. iSER and Centralized vs. Distributed Storage. Happy debating!            

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

RoCE vs. iWARP – The Next “Great Storage Debate”

John Kim

Jul 16, 2018

title of post
By now, we hope you’ve had a chance to watch one of the webcasts from the SNIA Ethernet Storage Forum’s “Great Storage Debate” webcast series. To date, our experts have had friendly, vendor-neutral debates on File vs. Block vs. Object Storage, Fibre Channel vs. iSCSI, and FCoE vs. iSCSI vs. iSER. The goal of this series is not to have a winner emerge, but rather educate the attendees on how the technologies work, advantages of each, and common use cases. Our next great storage debate will be on August 22, 2018 where our experts will debate RoCE vs. iWARP. They will discuss these two commonly known RDMA protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP. Both are Ethernet-based RDMA technologies that can increase networking performance. Both reduce the amount of CPU overhead in transferring data among servers and storage systems to support network-intensive applications, like networked storage or clustered computing. Join us on August 22nd, as we’ll address questions like:
  • Both RoCE and iWARP support RDMA over Ethernet, but what are the differences?
  • Use cases for RoCE and iWARP and what differentiates them?
  • UDP/IP and TCP/IP: which RDMA standard uses which protocol, and what are the advantages and disadvantages?
  • What are the software and hardware requirements for each?
  • What are the performance/latency differences of each?
Get this on your calendar by registering now. Our experts will be on-hand to answer your questions on the spot. We hope to see you there! Visit snia.org to learn about the work SNIA is doing to lead the storage industry worldwide in developing and promoting vendor-neutral architectures, standards, and educational services that facilitate the efficient management, movement, and security of information.  

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Ethernet Networked Storage – FAQ

Fred Zhang

Dec 8, 2016

title of post

At our SNIA Ethernet Storage Forum (ESF) webcast “Re-Introduction to Ethernet Networked Storage,” we provided a solid foundation on Ethernet networked storage, the move to higher speeds, challenges, use cases and benefits. Here are answers to the questions we received during the live event.

Q. Within the iWARP protocol there is a layer called MPA (Marker PDU Aligned Framing for TCP) inserted for storage applications. What is the point of this protocol?

A. MPA is an adaptation layer between the iWARP Direct Data Placement Protocol and TCP/IP. It provides framing and CRC protection for Protocol Data Units.  MPA enables packing of multiple small RDMA messages into a single Ethernet frame.  It also enables an iWARP NIC to place frames received out-of-order (instead of dropping them), which can be beneficial on best-effort networks. More detail can be found in IETF RFC 5044 and IETF RFC 5041.

Q. What is the API for RDMA network IPC?

The general API for RDMA is called verbs. The OpenFabrics Verbs Working Group oversees the development of verbs definition and functionality in the OpenFabrics Software (OFS) code. You can find the training content from OpenFabrics Alliance here. General information about RDMA for Ethernet (RoCE) is available at the InfiniBand Trade Association website. Information about Internet Wide Area RDMA Protocol (iWARP) can be found at IETF: RFC 5040, RFC 5041, RFC 5042, RFC 5043, RFC 5044.

Q. RDMA requires TCP/IP (iWARP), InfiniBand, or RoCE to operate on with respect to NVMe over Fabrics. Therefore, what are the advantages of disadvantages of iWARP vs. RoCE?

A. Both RoCE and iWARP support RDMA over Ethernet. iWARP uses TCP/IP while RoCE uses UDP/IP. Debating which one is better is beyond the scope of this webcast, but you can learn more by watching the SNIA ESF webcast, “How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics.”

Q. 100Gb Ethernet Optical Data Center solution?

A. 100Gb Ethernet optical interconnect products were first available around 2011 or 2012 in a 10x10Gb/s design (100GBASE-CR10 for copper, 100GBASE-SR10 for optical) which required thick cables and a CXP and a CFP MSA housing. These were generally used only for switch-to-switch links. Starting in late 2015, the more compact 4x25Gb/s design (using the QSFP28 form factor) became available in copper (DAC), optical cabling (AOC), and transceivers (100GBASE-SR4, 100GBASE-LR4, 100GBASE-PSM4, etc.). The optical transceivers allow 100GbE connectivity up to 100m, or 2km and 10km distances, depending on the type of transceiver and fiber used.

Q. Where is FCoE being used today?

A. FCoE is primarily used in blade server deployments where there could be contention for PCI slots and only one built-in NIC. These NICs typically support FCoE at 10Gb/s speeds, passing both FC and Ethernet traffic via connect to a Top-of-Rack FCoE switch which parses traffic to the respective fabrics (FC and Ethernet). However, it has not gained much acceptance outside of the blade server use case.

Q. Why did iSCSI start out mostly in lower-cost SAN markets?

A. When it first debuted, iSCSI packets were processed by software initiators which consumed CPU cycles and showed higher latency than Fibre Channel. Achieving high performance with iSCSI required expensive NICs with iSCSI hardware acceleration, and iSCSI networks were typically limited to 100Mb/s or 1Gb/s while Fibre Channel was running at 4Gb/s. Fibre Channel is also a lossless protocol, while TCP/IP is lossey, which caused concerns for storage administrators. Now however, iSCSI can run on 25, 40, 50 or 100Gb/s Ethernet with various types of TCP/IP acceleration or RDMA offloads available on the NICs.

Q. What are some of the differences between iSCSI and FCoE?

A. iSCSI runs SCSI protocol commands over TCP/IP (except iSER which is iSCSI over RDMA) while FCoE runs Fibre Channel protocol over Ethernet. iSCSI can run over layer 2 and 3 networks while FCoE is Layer 2 only. FCoE requires a lossless network, typically implemented using DCB (Data Center Bridging) Ethernet and specialized switches.

Q. You pointed out that at least twice that people incorrectly predicted the end of Fibre Channel, but it didn’t happen. What makes you say Fibre Channel is actually going to decline this time?

A. Several things are different this time. First, Ethernet is now much faster than Fibre Channel instead of the other way around. Second, Ethernet networks now support lossless and RDMA options that were not previously available. Third, several new solutions–like big data, hyper-converged infrastructure, object storage, most scale-out storage, and most clustered file systems–do not support Fibre Channel. Fourth, none of the hyper-scale cloud implementations use Fibre Channel and most private and public cloud architects do not want a separate Fibre Channel network–they want one converged network, which is usually Ethernet.

Q. Which storage protocols support RDMA over Ethernet?

A. The Ethernet RDMA options for storage protocols are iSER (iSCSI Extensions for RDMA), SMB Direct, NVMe over Fabrics, and NFS over RDMA. There are also storage solutions that use proprietary protocols supporting RDMA over Ethernet.

 

 

 

 

 

 

 

 

 

 

 

 

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Ethernet RDMA Protocols Support for NVMe over Fabrics – Your Questions Answered

David Fair

Mar 21, 2016

title of post

Our recent SNIA Ethernet Storage Forum Webcast on How Ethernet RDMA Protocols iWARP and RocE Support NVMe over Fabrics generated a lot of great questions. We didn’t have time to get to all of them during the live event, so as promised here are the answers. If you have additional questions, please comment on this blog and we’ll get back to you as soon as we can.

Q. Are there still actual (memory based) submission and completion queues, or are they just facades in front of the capsule transport?

A. On the host side, they’re “facades” as you call them. When running NVMe/F, host reads and writes do not actually use NVMe submission and completion queues. That data just comes from and to RNIC RDMA queues. On the target side, there could be real NVMe submissions and completion queues in play. But the more accurate answer is that it is “implementation dependent.”

Q. Who places the command from NVMe queue to host RDMA queue from software standpoint?

A. This is managed by the kernel host software in code written to the NVMe/F specification. The idea is that any existing application that thinks it is writing to the existing NVMe host software will in fact cause the SQE entry to be encapsulated and placed in an RDMA send queue.

Q. You say “most enterprise switches” support NVMe/F over RDMA, I guess those are ‘new’ ones, so what is the exact question to ask a vendor about support in an older switch?

A. For iWARP, any switch that can handle Internet traffic will do. Mellanox and Intel have different answers for RoCE / RoCEv2. Mellanox says that for RoCE, it is recommended, but not required, that the switch support Priority Flow Control (PFC). Most new enterprise switches support PFC, but you should check with your switch vendor to be sure. Intel believes RoCE was architected around DCB. The name itself, RoCE, stands for “RDMA over Converged Ethernet,” i.e., Ethernet with DCB. Intel believes RoCE in general will require PFC (or some future standard that delivers equivalent capabilities) for efficient RDMA over Ethernet.

Q. Can you comment on when one should use RoCEv2 vs. iWARP?

A. We gave a high-level overview of some of the deployment considerations on slide 30. We refer you to some of the vendor links on slide 32 for “non-vendor neutral” perspectives.

Q. If you take RDMA out of equation, what is the key advantage of NVMe/F over other protocols? Is it that they are transparent to any application?

A. NVMe/F allows the application to bypass the SCSI stack and uses native NVMe commands across a network. Most other block storage protocols require using the SCSI protocol layer, translating the NVMe commands into SCSI commands. With NVMe/F you also gain parallelism, simplicity of the command set, a separation between administrative sessions and data sessions, and a reduction of latency and processing required for NVMe I/O operations.

Q. Is ROCE v1 compatible with ROCE v2?

A. Yes. Adapters speaking RoCEv2 can also maintain RDMA connections with adapters speaking RoCEv1 because RoCEv2 ports are backwards interoperable with RoCEv1. Most of the currently shipping NICs supporting RoCE support both RoCEv1 and RoCEv2.

Q. Are RoCE and iWARP the only way to use Ethernet as a fabric for NMVe/F?

A. Initially yes; only iWARP and RoCE are supported for NVMe over Ethernet. But the NVM Express Working Group is also targeting FCoE. We should have probably been clearer about that, though it is noted on slide 11.

Q. What about doing NVMe over Fibre Channel? Is anyone looking at, or doing this?

A. Yes. This is not in scope for the first spec release, but the NVMe WG is collaborating with the FCIA on this. So NVMe over Fibre Channel is expected as another standard in the near future, to be promoted by T11.

Q. Do RoCE and iWARP both use just IP addresses for management or is there a higher level addressing mechanism, and management?

A. RoCEv2 uses the RoCE Connection Manager, and iWARP uses TCP connection management. They both use IP for addressing.

Q. Are there other fabrics to run NVMe over fabrics? Can you do this over OmniPath or Infiniband?

A. InfiniBand is in scope for the first spec release. Also, there is a related effort by the FCIA to support NVMe over Fibre Channel in a standard that will be promoted by T11.

Q. You indicated NVMe stack is in kernel while RDMA is a user level verb. How are NVMe SQ/ CQ entries transferred from NVMe to RDMA and vice versa? Also, could smaller transfers in NVMe (e.g. SGL of 512B) combined to larger sizes before being sent to RDMA entries and vice versa?

A. NVMe/F supports multiple scatter gather entries to combine multiple incontinuous transfers, nevertheless, the protocol doesn’t support chaining multiple NVMe commands on the same command capsule. A command capsule contains only a single NVMe command. Please also refer to slide 18 from the presentation.

Q. 1) How do implementers and adopters today test NVMe deployments? 2) Besides latency, what other key performance indicators do implements and adopters look for to determine whether the NVMe deployment is performing well or not?

A. 1) Like any other datacenter specification, testing is done by debugging, interop testing and plugfests. Local NVMe is well supported and can be tested by anyone. NVMe/F can be tested using pre-standard drivers or solutions from various vendors. UNH-IOH is an organization with an excellent reputation for helping here. 2) Latency, yes. But also sustained bandwidth, IOPS, and CPU utilization, i.e., the “usual suspects.”

Q. If RoCE CM supports ECN, why can’t it be used to implement a full solution without requiring PFC?

A. Explicit Congestion Notification (ECN) is an extension to TCP/IP defined by the IETF. First point is that it is a standard for congestion notification, not congestion management. Second point is that it operates at L3/L4. It does nothing to help make the L2 subnet “lossless.” Intel and Mellanox agree that generally speaking, all RDMA protocols perform better in a “lossless,” engineered fabric utilizing PFC (or some future standard that delivers equivalent capabilities). Mellanox believes PFC is recommended but not strictly required for RoCE, so RoCE can be deployed with PFC, ECN, or both. In contrast, Intel believes that for RoCE / RoCEv2 to deliver the “lossless” performance users expect from an RDMA fabric, PFC is in general required.

Q. How involved are Ethernet RDMA efforts with the SDN/OCP community? Is there a coming example of RoCE or iWarp on an SDN switch?

A. Good question, but neither RoCEv2 nor iWARP look any different to switch hardware than any other Ethernet packets. So they’d both work with any SDN switch. On the other hand, it should be possible to use SDN to provide special treatment with respect to say congestion management for RDMA packets. Regarding the Open Compute Project (OCP), there are various Ethernet NICs and switches available in OCP form factors.

Q. Is there a RoCE v3?

A. No. There is no RoCEv3.

Q. iWARP and RoCE both fall back to TCP/IP in the lowest communication sense? So they are somewhat compatible?

A. They can speak sockets to each other. In that sense they are compatible. However, for the usage model we’re considering here, NVMe/F, RDMA is required. Because of L3/L4 differences, RoCE and iWARP RNICs cannot speak RDMA to each other.

Q. So in case of RDMA (ROCE or iWARP), the NVMe controller’s fabric port is Ethernet?

A. Correct. But it must be RDMA-enabled Ethernet.

Q. What if I am using soft RoCE, do I still need an RNIC?

A. Functionally, soft RoCE or soft iWARP should work on a regular NIC. Whether the performance is sufficient to keep up with NVMe SSDs without the hardware offloads is a different matter.

Q. How would the NVMe controller know that a command is placed in the submission queue by the Fabric host driver? Is the fabric host driver responsible for notifying the NVMe controller through remote doorbell trigger or the Fabric target driver should trigger the doorbell?

A. No separate notification by the host required. The fabric’s host driver simply sends a command capsule to notify its companion subsystem driver that there is a new command to be processed. The way that the subsystem side notifies the backend NVMe drive is out of the scope of the protocol.

Q. I am chair of ETSI NFV working group on NFV acceleration. We are working on virtual RDMA and how VM can benefit from hardware independent RDMA. One corner stone of this is virtual-RDMA pseudo device. But there is not yet consensus on minimal set of verbs to be supported: Do you think this minimal verb set can be identified? Last, the transport address space is not consistent between IB, Ethernet. How supporting transport independent RDMA?

A. You know, the NVM Express Working Group is working on exactly these questions. They have to define a “minimal verb set” since NVMe/F generates the verbs. Similarly, I’d suggest looking to the spec to see how they resolve the transport address space differences.

Q. What’s the plan for Linux submission of NVMe over Fabric changes? What releases are being targeted?

A. The Linux Driver WG in the NVMe WG expects to submit code upstream within a quarter of the spec being finalized. At this time it looks like the most likely Linux target will be kernel 4.6, but it could end up being kernel 4.7.

Q. Are NVMe SQ/CQ transferred transparently to RDMA Queues or can they be modified?

A. The method defined in the NVMe/F specification entails a transparent transfer. If you wanted to modify an SQE or CQE, do so before initiating an NVMe/F operation.

Q. How common are rNICs for recent servers? i.e. What’s a quick check I can perform to find out if my NIC is an rNIC?

A. rNICs are offered by nearly all major server vendors. The best way to check is to ask your server or NIC vendor if your NIC supports iWARP or RoCE.

Q. This is most likely out of the scope of this talk but could you perhaps share about 30K level on the differences between “NVMe controller” hardware versus “NVMeF” hardware. It’s most likely a combination of R-NIC+NVMe controller, but would be great to get your take on this.

A goal of the NVMe/F spec is that it work with all existing NVMe controllers and all existing RoCE and iWARP RNICs.  So on even a very low level, we can say “no difference.”  That said, of course, nothing stops someone from combining NVMe controller and rNIC hardware into one solution.

Q. Are there any example Linux targets in the distros that exercise RDMA verbs? An iWARP or iSER target in a distro?

A. iSER allows this using a LIO or TGT SCSI target.

Q. Is there a standard or IP for RDMA NIC?

A. The various RNICs are based on IBTA, IETF, and IEEE standards are shown on slide 26.

Q. What is the typical additional latency introduced comparing NVMe over Fabric vs. local NVMe?

A. In the 2014 IDF demo, the prototype NVMe/F stack matched the bandwidth of local NVMe with a latency penalty of only 8µs over a local iWARP connection. Other demonstrations have shown an added fabric latency of 3µs to 15µs. The goal for the final spec is under 10µs.

Q. How well is NVME over RDMA supported for Windows ?

A. It is not currently supported, but then the spec isn’t even finished. Contract Microsoft if you are interested in their plans.

Q. RDMA over Ethernet would not support Layer 2 switching? How do you deal with TCP over head?

A. L2 switching is supported by both iWARP and RoCE. Both flavors of RNICs have MAC addresses, etc. iWARP had to deal with TCP/IP in hardware, a TCP/IP Offload Engine or TOE. The TOE used in an iWARP RNIC is significantly constrained compared to a general purpose TOE and therefore can operate with very high performance. See the Chelsio website for proof points. RoCE does not use TCP so does not need to deal with TCP overhead.

Q. Does RDMA not work with fibre channel?

A. They are totally different Transports (L4) and Networks (L3). That said, the FCIA is working with NVMe, Inc. on supporting NVMe over Fibre Channel in a standard to be promoted by T11.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics

David Fair

Jan 6, 2016

title of post

NVMe (Non-Volatile Memory Express) over Fabrics is of tremendous interest among storage vendors, flash manufacturers, and cloud and Web 2.0 customers. Because it offers efficient remote and shared access to a new generation of flash and other non-volatile memory storage, it requires fast, low latency networks, and the first version of the specification is expected to take advantage of RDMA (Remote Direct Memory Access) support in the transport protocol.

Many customers and vendors are now familiar with the advantages and concepts of NVMe over Fabrics but are not familiar with the specific protocols that support it. Join us on January 26th for this live Webcast that will explore and compare the Ethernet RDMA protocols and transports that support NVMe over Fabrics and the infrastructure needed to use them. You’ll hear:

  • Why NVMe Over Fabrics requires a low-latency network
  • How the NVMe protocol is mapped to the network transport
  • How RDMA-capable protocols work
  • Comparing available Ethernet RDMA transports: iWARP and RoCE
  • Infrastructure required to support RDMA over Ethernet
  • Congestion management methods

The event is live, so please bring your questions. We look forward to answering them.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

New ESF Webcast: Benefits of RDMA in Accelerating Ethernet Storage Connectivity

David Fair

Jan 30, 2015

title of post

We’re kicking off our 2015 ESF Webcasts on March 4th with what we believe is an intriguing topic – how RDMA technologies can accelerate Ethernet Storage. Remote Direct Memory Access (RDMA) has existed for many years as an interconnect technology, providing low latency and high bandwidth in computing clusters. More recently, RDMA has gained traction as a method for accelerating storage connectivity and interconnectivity on Ethernet. In this Webcast, experts from Emulex, Intel and Microsoft will discuss:

  • Storage protocols that take advantage of RDMA
  • Overview of iSER for block storage
  • Deep dive of SMB Direct for file storage.
  • Benefits of available RDMA technologies to accelerate your Ethernet storage connectivity, both iWARP and RoCE

Register now. This live Webcast will provide attendees with a vendor-neutral look at RDMA technologies and should prove to be an interactive and informative event. I hope you’ll join us!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

The Performance Impact of NVMe and NVMe over Fabrics – Q&A

J Metz

Dec 3, 2014

title of post

More than 400 people have already seen our recent live ESF Webcast, “The Performance Impact of NVMe and NVMe over Fabrics.” If you missed it, it’s now available on-demand. It was a great session with a lot of questions from attendees. We did not have time to address them all – so here is a complete Q&A courtesy of our experts from Cisco, EMC and Intel. If you think of additional questions, please feel free to comment on this blog.

Q. Are you saying that just by changing the interface to NVMe for any SSD, one would greatly bump up the IOPS?

A. NVMe SSDs have higher IOPs than SAS or SATA SSDs due to several factors, including the low latency of PCIe and the efficiency of the NVMe protocol.

Q. How much of the speed of NVMe you have shown is due to simpler NVMe protocol vs. using Flash? i.e how would the SAS performance change when you are attaching SSD to SAS

A. The performance differences shown comparing NVMe to SAS to SATA were all using solid-state drives (all NAND Flash). Thus, the difference shown was due to the interface.

Q. Can you comment on the test conditions these results were obtained under and what are the main reasons NVMe outperforms the others?

A. The most important reason NVMe outperforms other interfaces is that it was architected for NVM – rather than inheriting the legacy of HDDs. NVMe is built on the foundation of a very efficient multi-queue model and a simple hardware automatable command set that results in very low latency and high performance. Details for IOPs and bandwidth comparisons are shown in the footnotes of the corresponding foils. For the efficiency tests, the detailed setup information was inadvertantly removed from the backup. This will be corrected.

Q. What is the IOPS difference between NVMe and SAS at the same queue depth?

A. At a queue depth of 32, for the particular devices shown with 4K random reads is NVMe =~ 267K IOPs and SAS =~ 149K IOPs. SAS does not improve when the queue depth is increased to 128. NVMe performance increases to ~ 472K IOPs at a queue depth of 128.

Q. Why not use PCIe directly instead of the NVMe layer on PCIe?

A. PCI Express is used directly. NVM Express is the standard software interface for high performance PCI Express storage devices. PCI Express does not define register, DMA, command set, or feature set for PCIe storage devices. NVM Express replaces proprietary software interfaces and drivers used previously by PCIe SSDs in the market.

Q. Is the Working Group considering adding things like enclosure identification in the transport abstraction so the host/client can identify where the NVMe drives reside?

A. The NVM Express organization is developing a Management Interface specification set for release in Q1’2015 that will enable standardized enclosure management. The intent is that these features could be used regardless of fabric type (PCIe, RDMA, etc).

Q. Are there APIs in the software interface for device query information and device RAID configuration?

A. NVMe includes an Identify Controller and Identify Namespace command that provides information about the NVMe subsystem, controllers, and namespaces. It is possible to create a RAID controller that uses the NVMe interface if desired. Higher level software APIs are typically defined by the OSV.

Q. 1. Are NVMe drivers today multi-threaded? 2. If I were to buy a NVMe device today can you suggest some list of vendors whose solutions are used today in data centers (i.e production and not proof of concept or proto)?

A. The NVM Express drivers are designed for multi-threading – each I/O queue may be owned/controlled by one thread without synchronization with other driver threads. A list of devices that have passed NVMe interoperability and conformance testing are on the NVMe Integrator’s List.

Q. When do you think the market will consolidate for NVMe/PCIE based SSDs and End of SATA era ?

A. By 2018, IDC predicts that Enterprise SSDs by Interface will be PCIe=38%, SAS=28%, and SATA=34%. By 2018, Samsung predicts over 70% of client SSDs will be PCIe. Based on forecasts like this, we expect strong growth for NVMe as the standard PCIe SSD interface in both Enterprise and Client segments.

Q. why can’t it be like a graphic card which does memory transactions?

A. SSDs of today have much longer latency than memory – where a read from a typical NAND page takes > 50 microseconds. However, as next generation NVM comes to market over the next few years, there may be blurring of the lines between storage and memory, where next generation NVM may be used as very fast storage (like NVMe) or as memory as in NVDIMM type of usage models.

Q. It seems that most NVMe drive vendors supply proprietary drivers for their drives. What’s the value of NVMe over proprietary interfaces given this? Will we eventually converge on the open source driver?

A. As the NVMe ecosystem matures, we would expect most implementations would use inbox drivers that are present in many OSes, like Windows, Linux, and Solaris. However, in some Enterprise applications, a vendor may have a value added feature that could be delivered via their own software driver. OEMs and customers will decide whether to use inbox drivers or a vendor specific driver based on whether the value provided by the vendor is significant.

Q. To create an interconnect to a scale-out storage system with many NVMe drives does that mean you would you need an aggregated fabric link (with multiple RDMA links) to provide enough bandwidth for multiple NVMe drives?

A. Depends on the speed of the fabric links and the number of NVMe drives. Ideally, the target system would be configured such that the front-end fabric and back-end NVMe drives were bandwidth balanced. Scaling out multi-drive subsystems on a fabric may require the use of fat-tree switch topologies which may be constructed using some form of link aggregation. The performance of the PCIe NVMe drives is expected to put high bandwidth demands on the front-end network interconnect. Each NVMe SFF8639 2.5” drive has a PCIe Gen3/x4 interconnect with the capability to product 3+ GB/s (24gbps) of sustained storage bandwidth. There are multiple production server systems with 4-8 NVMe sff8639 drive bays, which puts these platforms in the 200Gbps capability when used as NVMe over fabrics storage servers. The combination of PCIe NVMe drives and NVMe over fabric targets is going to have a significant impact on datacenter storage performance.

Q. In other forums we heard about NVMe extensions to deliver vendor specific value add features. Do we have any updates?

A. Each vendor is allowed to add their own vendor specific features and value. It would be best to discuss any vendor specific features with the appropriate vendor.

Q. Given that PCIe is not a scalable fabric at least from a storage perspective, do you see the need for SAS SSDs to increase or diminish over time? Or is your view that NVMe SSDs will populate the tier between DRAM and say, rotating media like SAS HDDs?

A. NVMe SSDs are the highest performance SSDs available today. If there is a box of NVMe SSDs, the most appropriate connection to that JBOD may may be Ethernet or another fabric that then fans out inside the JBOD to PCIe/NVMe SSDs.

Q. From a storage industry perspective, what deficiencies does NVMe have to displace SAS? Will that transition ever happen?

A. NVMe SSDs are seeing initial broad deployment primarily in server use cases that prize the high performance. Storage applications require a robust high availiability interface. NVMe has defined support for dual port, reservations, and other high availability features. NVMe will be used in storage applications as these high availability features mature in products.

Q. Will NVMe over fabrics allow to dma read/write the NVMe device directly (without going through system memory)?

A. The locality of the NVMe over Fabric buffers on the target-side are target implementation specific. That said, one could construct a target that used a 
pool of PCIe NVMe subsystem controller resident memory as the source and/or sink buffers of a fabric’s NIC’s NVMe data exchanges. This type of configuration will have the limitation of having to pre-determine fabric data to NVMe device locality else the data could end up in the wrong drive’s controller memory.

Q. Intel True scale fabric technology was based on Fulcrum ASIC. Could you please provide an input how Intel Omni Scale differs from Intel True Sclae fabric?

A.  In the context of NVMe over Fabrics, Intel Omni-Path fabric is a possible future fabric candidate for an NVMe over Fabrics definition. Specifics on the fabric itself are outside of the scope of NVMe over Fabrics definition. For information on Omni-Path file, please refer to http://www.intel.com/content/www/us/en/omni-scale/intel-omni-scale-fabric-demo.html?wapkw=omni-scale.

Q. Can the host side NVMe client be in user mode since it is using RDMA?

A. It is possible since RDMA QP communications allow for both user and kernel mode access to the RDMA verbs. However, there are implications to consider. The NVMe host software currently resides in multiple operating systems as a kernel level block-storage driver. The goal is for NVMe over Fabrics to share common NVMe code between multiple fabric types in order to provide a consistent and sustainable core NVMe software. When NVMe over Fabrics is moved to user level, it essentially becomes a separate single-fabric software solution that must be maintained independently of the multi-fabric kernel NVMe software. There are some performance advantages of having a user-level interface, such as not having to go through the O/S system calls and the ability to poll the completions. These have to be weighed against the loss of kernel resident functionality, such as upper level kernel storage software, and the cost of sustaining the software.

Q. Which role will or could play the InfiniBand Protocol in the NVMe concept?

A. InfiniBand™ is one of the supported RDMA fabrics for NVMe over RDMA. NVMe over RDMA will support the family of RDMA fabrics through use of a common set of RDMA verbs. This will allow users to select the RDMA fabric type based on their fabric requirements and not be limited to any one RDMA fabric type for NVMe over RDMA.

Q. Where is this experimental code for NVMeOF for Driver and FIO available?

A. FIO is a common Linux storage benchmarking tool and is available from multiple Internet sites. The driver used in the demo was developed specifically as a proof of concept and demonstration for the Intel IDF 2014. They were based on a pre-standardized implementation of the NVMe over RDMA wire protocol. The standard NVMe over RDMA wire protocol is currently being under definition in NVM Express, Inc.. Once the standard is complete, both Host and reference Target drivers for Linux will be developed.

Q. Was polling on the completion queue used on the target side in the prototype?

A. The target side POC implementation used a polling technique for both the NVMe over RDMA CQ and NVMe CQ. This was to minimize the latency by eliminating the interrupt latency on the target for both CQs. Depending on the O/S and both the RDMA and NVMe devices interrupt moderation settings, interrupt latency is typically around 2 microseconds. If polling is not the desired model, Intel processors enable another form of event signaling called Monitor/Mwait where latency is typically in the 500ns latency range.

Q. In the prototype over iWARP, did the remote device dma write/read the NVMe device directly, or did it go through remote system memory?

A. In the PoC, all NVMe commands and command data went through the remote system memory. Only the NVMe commands were accessed by the CPU, the data was not touched.

Q. Are there any dependencies between NVMEoF using RDMA and iWARP? Can standard software RDMA in Linux distributions be used without need for iWARP support?

A. As mentioned, the NVMe over RDMA will be RDMA type agnostic and work with all RDMA providers that support a common set of RDMA verbs.

Q. PCIe doesn’t support multi-host access to devices. Does NVMe over fabric require movement away from PCIe?

A. The NVMe 1.1 specification specifically added features for multi-host support – allowing NVMe subsystems to have multiple NVMe controllers and multiple fabric ports. This model is supported in PCI Express by multi-function/ multi-port PCIe drives (typically referred to as dual-port). Dependng on the fabric type, NVMe over Fabrics will extend to configurations with many hosts sharing a single NVMe subsystem consisting of multiple NVMe controllers.

Q. In light of the fact that NVMe over Fabrics reintroduces more of the SCSI architecture, can you compare and contrast NVMe over Fabrics with ‘SCSI Express’ (SAM/SPC/SBC/SOP/PQI)?

A. NVMe over Fabrics is not a SCSI model, it’s extending the NVMe model onto other fabric types. The goal is to maintain the simplicity of the NVMe model, such as the small amount of NVMe command types, multi-queue interface model, and efficient NVM oriented host and controller implementations. We chose the RDMA fabric as the first fabric because it too was architected with a small number of operations, multi-queue interface model, and efficient low-latency operations.

Q. Is there an open source for NVMe over Fabrics, which was used for the IDF demo? If not, can that be made available to others to see how it was done?

A. Most of the techniques used in the PoC drivers will be implemented in future open-source Host and referent Target drivers. The PoC was both a learning and demonstration vehicle. Due to the PoC drivers using a pre-standards based NVMe over RDMA protocol, we feel it’s best not to propagate the implementation.

Q. What is the overhead of the protocol? Did you try putting NVMe in front just DRAM? I’d assume you’ll get much better results, and understand the limitations of the protocol much better. In front of DRAM it won’t be NVM, but it will give good data regarding protocol latency.

A. The overhead of the protocol on the host-side matched the PCIe NVMe driver. On the target, the POC protocol efficiency was around 600ns of compute latency for a complete 4K I/O. For low-latency media, such as DRAM or next generation NVM, the reduced latency of a solution similar to the PoC will enable the effective use of the media’s low latency characteristics.

Q. Do you have FC performance comparison with NVMe?

A. We did implement an 100GBE/FCoE target with NVMe back-end storage for an Intel IDF 2012 demonstration. FCoE is a combination of two models, FCP and SCSI. Our experience with this target implementation showed that we were adding a significant amount of computational latency on both the host (initiator) and target FCoE/FC/SCSI storage stacks that reduced the performance and efficiency advantages gained in the back-end PCIe NVMe SSDs. A significant component of this computational latency was due to the multiple storage models and associated translations that occurred between the host application and back-end NVMe drives This experience led us down the path of enabling an end-to-end NVMe model through expanding the NVMe model onto a range of fabric types.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Subscribe to iWARP