Erik Smith

Apr 9, 2025

title of post

Have you ever wondered how RDMA (Remote Direct Memory Access) actually works? You’re not alone, that’s why the SNIA Data, Storage & Networking Community (DSN) hosted a live webinar, “Everything You Wanted to Know About RDMA But Were Too Proud to Ask” where our expert presenters, Michal Kalderon and Rohan Mehta explained how RDMA works and the essential role it plays for AI/ML workloads due to its ability to provide high-speed, low-latency data transfer. The presentation is on demand along with the webinar slides in the SNIA Educational Library.

The live audience was not “too proud” to ask questions, and our speakers have graciously answered all of them here. 

Q: Does the DMA chip ever reside on the newer CPUs or are they a separate chip?

A: Early system designs used an external DMA Controller. Modern systems have moved to an integrated DMA controller design. There is a lot of information about the evolution of the DMA Controller available online. The DMA Wikipedia article on this topic is a good place to start.   

Q: Slide 51: For RoCEv2, are you routing within the rack, across the data center, or off site as well?

A:  RoCEv2 operates over Layer 3 networks, and typically requires a lossless network environment achieved through mechanisms like PFC, DCQCN, and additional congestion control methods discussed in the webinar. These mechanisms are easier to implement and manage within controlled network domains, such as those found in data centers. While theoretically, RoCEv2 can span multiple data centers, it is not commonly done due to the complexity of maintaining lossless conditions across such distances. 

Q: You said that WQEs have opcodes. What do they specify and how are they defined?

 A: The WQE opcodes are the actual RDMA operations referred to in slide 16, 30
SEND, RECV, WRITE, READ, ATOMIC. For each one of these operations there are additional fields that can be set. 

Q: is there a mistake in slide 27? QP-3 on the receive side (should they be labeled QP-1, QP-2 and QP3) or am I misunderstanding something?

A: Correct, good catch. The updated deck is here.

Q: Is the latency deterministic after connection?

A: Similar to all network protocols, RDMA is subject to factors such as network congestion, interfering data packets, and other network conditions that can affect latency.

Q: How does the buffering works? If a server sends data and the client is unable to receive all the data due to buffer size limitations?

A: We will split the answer for the two groups of RDMA operations.

  • (a) Channel Semantics: Send and Recv. In this case, the application must make sure that the receiver has posted an RQ buffer before performing the send operation. This is typically done by the client side posting RQ buffers before initiating a connection request. If the send arrives and there is no RQ buffer posted, the packets are dropped and a RNR NAK message is sent to the sender. There is a configurable parameter for the QP called, rnr_retry_cnt, which specifies how many times the RNIC should try resending messages if it gets and RNR NAK.
  • (b) For Memory semantics, Read, Write, Atomic, this cannot occur since the data is placed on pre-registered memory in a specific location. 

Q: How/where are the packet headers handled?

A: The packet headers are handled by the R-NIC (RDMA NIC). 

Q: How Object (S3) over RDMA is implemented at high level, does it still involve HTTP/S?

A: There are proprietary solutions that provide Object (S3) over RDMA, but there is currently no industry standard available that could be used to create non-vendor specific implementations that would be interoperable with one another.    

Q: How much packet per second transport for single core x86 vs ARM Server single core over TCP/IP?

A: When measuring RDMA performance, you measure messages per second and not packets-per-second as opposed to TCP, you don’t have per-packet processing in the host. The performance depends more on the R-NIC than the host core, as it bypassed CPU processing. If you’d like a performance comparison between RDMA (RoCE) and TCP, please refer to the “NVMe-oF Looking Beyond Performance Hero Numbers” webinar.

Q: Could you clarify the reason for higher latency using interrupt method?

A: It depends upon the mode of operation:

Polling Mode: In polling mode, the CPU continuously checks (or "polls") the completion queue for new events. This constant checking eliminates the delay associated with waiting for an interrupt signal, leading to faster detection and handling of events. However, this comes at the cost of higher CPU utilization since the CPU is always active, even when there are no events to process.

Interrupt Mode: In interrupt mode, the CPU is notified of new events via interrupts. When an event occurs, an interrupt signal is sent to the CPU, which then stops its current task to handle the event. This method is more efficient in terms of CPU usage because the CPU can perform other tasks while waiting for events. However, the process of generating, delivering, and handling interrupts introduces additional latency compared to the immediate response of polling

Q: Slide 58: Does RDMA complement or compete with CXL?

A: They are not directly related. RDMA is a protocol used to perform Remote DMA operations (i.e., over a network of some kind) and CXL is a used to provide high-speed, coherent communication within a single system. CXL.mem allows devices to access memory directly within a single system or a small tightly coupled group of systems. If this question was specific to DMA, as opposed to RDMA, the answer would be slightly different. 

Q: The RDMA demonstration had MTU size on the RDMA NIC set to 1K. Does RDMA traffic benefit from setting the MTU size to a larger setting (3k-9k MTU size) or is that really dependent on the amount of traffic the RoCE application generates over the RDMA NIC?

A: RDMA traffic, similar to other protocols like TCP, can benefit from setting the MTU size to a larger setting. It can reduce packet processing overhead and improve throughput. 

Q: When app send read/write request, how app gets remote site RKEY info? Q2 - Is it possible to tweak LKEY pointing to same buffer for debugging any memory related issue? Fairly new to topic so apologies in advance if any query doesn't make sense.

A: The RKEYs can be exchanged using the channel semantics, where SEND/RECV are used. In this case there is no need for an r-key as the message will arrive to the first buffer posted in the RQ buffer on the peer.  LKEY refers to local memory. Every registered memory LKEY and RKEY will point to same location. The LKEY is used by the local RNIC to access the memory and RKEY will be provided to the remote application to access the memory. 

Q: What does SGE stand for?

A: Scatter Gather Element. An element inside an SGL – Scatter Gather List. Used to reference non-contiguous memory.

Thanks to our audience for all these great questions. We encourage you to join us at future SNIA DSN webinars. Follow us @SNIA and on LinkedIn for upcoming webinars. This webinar was part of the “Everything You Wanted to Know But Were Too Proud to Ask” SNIA webinar series. If you found the information in this RDMA webinar helpful, I encourage you to check out the many other ones we have produced. They are all available here on the SNIAVideo YouTube Channel.   

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Beyond NVMe-oF Performance Hero Numbers

Erik Smith

Jan 28, 2021

title of post
When it comes to selecting the right NVMe over Fabrics™ (NVMe-oF™) solution, one should look beyond test results that demonstrate NVMe-oF’s dramatic reduction in latency and consider the other, more important, questions such as “How does the transport really impact application performance?” and “How does the transport holistically fit into my environment?” To date, the focus has been on specialized fabrics like RDMA (e.g., RoCE) because it provides the lowest possible latency, as well as Fibre Channel because it is generally considered to be the most reliable. However, with the introduction of NVMe-oF/TCP this conversation must be expanded to also include considerations regarding scale, cost, and operations. That’s why the SNIA Networking Storage Forum (NSF) is hosting a webcast series that will dive into answering these questions beyond the standard answer “it depends.” The first in this series will be on March 25, 2021 “NVMe-oF: Looking Beyond Performance Hero Numbers” where SNIA experts with deep NVMe and fabric technology expertise will discuss the thought process you can use to determine pros and cons of a fabric for your environment, including:
  • Use cases driving fabric choices
  • NVMe transports and their strengths
  • Industry dynamics driving adoption
  • Considerations for scale, security, and efficiency
Future webcasts will dive deeper and cover operating and managing NVMe-oF, discovery automation, and securing NVMe-oF. I hope you will register today. Our expert panel will be available on March 25th to answer your questions.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Tim Lustig

Jan 15, 2020

title of post
We kicked-off our 2020 webcast program by diving into how The Storage Performance Development Kit (SPDK) fits in the NVMe landscape. Our SPDK experts, Jim Harris and Ben Walker, did an outstanding job presenting on this topic. In fact, their webcast, “Where Does SPDK Fit in the NVMe-oF Landscape” received at 4.9 rating on a scale of 1-5 from the live audience. If you missed the webcast, I highly encourage you to watch it on-demand. We had some great questions from the attendees and here are answers to them all: Q. Which CPU architectures does SPDK support? A. SPDK supports x86, ARM and Power CPU architectures. Q. Are there plans to extend SPDK support to additional architectures? A. If someone has interest in using SPDK on additional architectures, they may develop the necessary SPDK patches and submit them for review.  Please note that SPDK relies on the Data Plane Development Kit (DPDK) for some aspects of CPU architecture support, so DPDK patches would also be required. Q. Will SPDK NVMe-oF support QUIC?  What advantages does it have compared to RDMA and TCP transports? A. SPDK currently has implementations for all of the transports that are part of the NVMe and related specifications – RDMA, TCP and Fibre Channel (target only).  If NVMe added QUIC (a new UDP-based transport protocol for the Internet) as a new transport, SPDK would likely add support. QUIC could be a more efficient transport than TCP, since it is a reliable transport based on multiplexed connections over UDP. On that note, the SNIA Networking Storage Forum will be hosting a webcast on April 2, 2020. “QUIC – Will it Replace TCP/IP?” You can register for it here. Q. How do I map a locally attached NVMe SSD to an NVMe-oF subsystem? A. Use the bdev_nvme_attach_controller RPC to create SPDK block devices for the NVMe namespaces. You can then attach those block devices to an existing subsystem using the nvmf_subsystem_add_ns RPC. You can find additional details on SPDK nvmf RPCs here. Q. How can I present a regular file as a block device over NVMe-oF?

A. Use the bdev_aio_create RPC to create an SPDK block device for the desired file. You can then attach this block device to an existing subsystem using the nvmf_subsystem_add_ns RPC.  You can find additional details on SPDK nvmf RPCs here.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Author of NVMe™/TCP Spec Answers Your Questions

J Metz

Mar 27, 2019

title of post

900 people have already watched our SNIA Networking Storage Forum webcast, What NVMe™/TCP Means for Networked Storage? where Sagi Grimberg, lead author of the NVMe/TCP specification, and J Metz, Board Member for SNIA, explained what NVMe/TCP is all about. If you haven’t seen the webcast yet, check it out on-demand.

Like any new technology, there’s no shortage of areas for potential confusion or questions. In this FAQ blog, we try to clear up both.

Q. Who is responsible for updating NVMe Host Driver?

A. We assume you are referring to the Linux host driver (independent OS software vendors are responsible for developing their own drivers). Like any device driver and/or subsystem in Linux, the responsibility of maintenance is on the maintainer(s) listed under the MAINTAINERS file. The responsibility of contributing is shared by all the community members.

Q. What is the realistic timeframe to see a commercially available NVME over TCP driver for targets? Is one year from now (2020) fair?

A. Even this year commercial products are coming to market. The work started even before the spec was fully ratified, but now that it has been, we expect wider NVMe/TCP support available. Q. Does NVMe/TCP work with 400GbE infrastructure? A. As of this writing, there is no reason to believe that upper layer protocols such as NVMe/TCP will not work with faster Ethernet physical layers like 400GbE. Q. Why is NVMe CQ in the controller and not on the Host? A. The example that was shown in the webcast assumed that the fabrics controller had an NVMe backend. So the controller backend NVMe device had a local completion queue, and on the host sat the “transport completion queue” (in NVMe/TCP case this is the TCP stream itself). Q. So, SQ and CQ streams run asynchronously from each other, with variable ordering depending on the I/O latency of a request? A. Correct. For a given NVMe/TCP connection, stream delivery is in-order, but commands and completions can arrive (and be processed by the NVMe controller) in any order. Q. What TCP ports are used? Since we have many NVMe queues, I bet we need a lot of TCP ports. A. Each NVMe queue will consume a unique source TCP port. Common NVMe host implementations will create a number of NVMe queues in the same order of magnitude of the number of CPU cores. Q. What is the max size of Data PDU supported? Are there any restrictions in parallel writes? A. The maximum size of an H2CData PDU (MAXH2CDATA) is negotiated and can be as large as 4GB. It is recommended that it will be no less than 4096 bytes. Q. Is immediate data negotiated between host and target? A. The in-capsule data size (IOCCSZ) is negotiated on an NVMe level. In NVMe/TCP the admin queue command capsule size is 8K by default. In addition, the maximum size of the H2CData PDU is negotiated during the connection initialization. Q. Is NVMe/TCP hardware infrastructure cost lower? A. This can vary widely, but we assume you are referring to Ethernet hardware infrastructure. Plus, NVMe/TCP does not require RDMA capable NIC so the variety of implementations is usually wider which typically drives down cost. Q. What are the plans for the major OS suppliers to support NVMe over TCP (Windows, Linux, VMware)? A. Unfortunately, we cannot comment on their behalf, but Linux already supports NVMe/TCP which should find its way to the various distributions soon. We are working with others to support NVMe/TCP, but suggest asking them directly. Q. Where does the overhead occur for NVMe/TCP packetization, is it dependent on the CPU, or does the network adapter offload that heavy lifting? And what is the impact of numerous, but extremely small transfers? A. Indeed a software NVMe/TCP implementation will introduce an overhead resulting from the TCP stream processing. However, you are correct that common stateless offloads such as Large Receive Offload and TCP Segmentation Offload are extremely useful both for large and for small 4K transfers. Q. What do you mean Absolute Latency is higher than RDMA by “several” microseconds? <10us, tens of microseconds, or 100s of microseconds? A. That depends on various aspects such as the CPU model, the network infrastructure, the controller implementation, services running on top etc. Remote access to raw NVMe devices over TCP was measured to add a range between 20-35 microseconds with Linux in early testing, but the degrees of variability will affect this. Q. Will Wireshark support NVMe/TCP soon? Is an implementation in progress? A. We most certainly hope so, it shouldn’t be difficult, but we are not aware of an ongoing implementation in progress. Q. Are there any NVMe TCP drivers out there? A. Yes, Linux and SPDK both support NVMe/TCP out-of-the-box, see: https://nvmexpress.org/welcome-nvme-tcp-to-the-nvme-of-family-of-transports/ Q. Do you recommend a dedicated IP network for the storage traffic or can you use the same corporate network with all other LAN traffic? A. This really depends on the use case, the network utilization and other factors. Obviously if the network bandwidth is fully utilized to begin with, it won’t be very efficient to add the additional NVMe/TCP “load” on the network, but that alone might not be the determining factor. Otherwise it can definitely make sense to share the same network and we are seeing customers choosing this route. It might be useful to consider the best practices for TCP-based storage networks (iSCSI has taught valuable lessons), and we anticipate that many of the same principles will apply to NVMe/TCP. The AQM, buffer etc. tuning settings is very dependent on the traffic pattern and needs to be developed based on the requirements. Base configuration is determined by the vendors. Q. On slide 28: no, TCP needs the congestion feedback, mustn’t need to be a drop (could be ecn, latency variance etc) A. Yes, you are correct. The question refers to how that feedback is received, though, and in the most common (traditional) TCP methods it’s done via drops. Q. How can you find out/check what TCP stack (drop vs. zero-buffer) your network is using? A. The use/support of DCTCP is mostly driven by the OS. The network needs to support and have ECN enabled and correctly configured for the traffic of interest. So the best way to figure this out is to talk to the network team. The use of ECN,etc. needs to be developed between server and network team Q. On slide 33: drop is signal of overloaded network; congestion on-set is when there is a standing Q (latency already increases). Current state of the art is to always overload the network (switches). A. ECN is used to signal before drop happens to make it more efficient. Q. Is it safe to assume that most current switches on the market today support DCTCP/ECN and that we can mix/match switches from vendors across product families? A. Most modern ASICS support ECN today. Mixing different product lines needs to be carefully planned and tested. AQM, Buffers etc. need to be fine-tuned across the platforms. Q. Is there a substantial cost savings by implementing all of what is needed to support NVMe over TCP versus just sticking with RDMA? Much like staying with Fibre Channel instead of risking performance with iSCSI not being and staying implemented correctly. Building the separately supported network just seems the best route. A. By “sticking with RDMA” you mean that you have already selected RDMA, which means you already made the investments to make it work for your use case. We agree that changing what currently works reliably and meets the targets might be an unnecessary risk. NVMe/TCP brings a viable option for Ethernet fabrics which is easily scalable and allows you to utilize a wide variety of both existing and new infrastructure while still maintaining low latency NVMe access. Q. It seems that with multiple flavors of TCP and especially congestion management (DCTCP, DCQCN?) is there a plan for commonality in ecosystem to support a standard way to handle congestion management? Is that required in the switches or also in the HBAs? A. DCTCP is an approach for L3 based congestion management, whereas DCQCN is a combination of PFC and ECN for RoCEv2(UDP) based communication. So both of these are two different approaches. Q. Who are the major players in terms of marketing this technology among storage vendors? A. The key organization to find out about NVMe/TCP (or all NVMe-related material, in fact), is NVM Express® Q. Can I compare the NVMe over TCP to iSCSI? A. Easy, you can download upstream kernel and test both of the in-kernel implementations (iSCSI and NVMe/TCP). Alternatively you can reach out to a vendor that supports any of the two to test it as well. You should expect NVMe/TCP to run substantially faster for pretty much any workload. Q. Is network segmentation crucial as “go to” architecture with host to storage proximity objective to accomplish objective of manage/throttled close to near loss-less connectivity? A. There is a lot to unpack in this question. Let’s see if we can break it down a little. Generally speaking, best practice is to keep the storage as close to the host as possible (and is reasonable). Not only does this reduce latency, but it reduces the variability in latency (and bandwidth) that can occur at longer distances. In cases where storage traffic shares bandwidth (i.e., links) with other types of traffic, the variable nature of different applications (some are bursty, others are more long-lived) can create unpredictability. Since storage – particularly block storage – doesn’t “like” unpredictability, different methods are used to regain some of that stability as scales increase. A common and well-understood best practice is to isolate storage traffic from “regular” Ethernet traffic. As different workloads tend to be either “North-South” but increasingly “East-West” across the network topologies, this network segmentation becomes more important. Of course, it’s been used as a typical best practice for many years with protocols such as iSCSI, so this is not new. In environments where the variability of congestion can have a profound impact on the storage performance, network segmentation will, indeed, become crucial as a “go-to” architecture. Proper techniques at L2 and L3 will help determine how close to a “lossless” environment can be achieved, of course, as well as properly configured QoS mechanisms across the network. As a general rule of thumb, though, network segmentation is a very powerful tool to have for reliable storage delivery. Q. How close are we to shared NVMe storage either over Fiber or TCP? A. There are several shared storage products available on the market for NVMe over Fabrics, but as of this writing (only 3 months after the ratification of the protocol) no major vendors have announced NVMe over TCP shared storage capabilities. A good place to look for updates is on the NVM Express website for interoperability and compliance products. [https://nvmexpress.org/products/] Q. AQM -> DualQ work in IETF for coexisting L4S (DCTCP) and legacy TCP. Ongoing work @ chip merchants A. Indeed a lot of advancements around making TCP evolve as the speeds and feeds increase. This is yet another example that shows why NVMe/TCP is, and will remain, relevant in the future. Q. Are there any major vendors who are pushing products based on these technologies? A. We cannot comment publicly on any vendor plans. You would need to ask a vendor directly for a concrete timeframe for the technology. However, several startups have made public announcements on supporting NVMe/TCP. Lightbits Labs, to give one example, will have a high-performance low-latency NVMe/TCP-based software-defined-storage solution out very soon.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Networking Questions for Ethernet Scale-Out Storage

Fred Zhang

Dec 7, 2018

title of post
Unlike traditional local or scale-up storage, scale-out storage imposes different and more intense workloads on the network. That’s why the SNIA Networking Storage Forum (NSF) hosted a live webcast “Networking Requirements for Ethernet Scale-Out Storage.” Our audience had some insightful questions. As promised, our experts are answering them in this blog. Q. How does scale-out flash storage impact Ethernet networking requirements? A. Scale-out flash storage demands higher bandwidth and lower latency than scale-out storage using hard drives. As noted in the webcast, it’s more likely to run into problems with TCP Incast and congestion, especially with older or slower switches. For this reason it’s more likely than scale-out HDD storage to benefit from higher bandwidth networks and modern datacenter Ethernet solutions–such as RDMA, congestion management, and QoS features. Q. What are your thoughts on NVMe-oF TCP/IP and availability? A. The NVMe over TCP specification was ratified in November 2018, so it is a new standard. Some vendors already offer this as a pre-standard implementation. We expect that several of the scale-out storage vendors who support block storage will support NVMe over TCP as a front-end (client connection) protocol in the near future. It’s also possible some vendors will use NVMe over TCP as a back-end (cluster) networking protocol. Q. Which is better: RoCE or iWARP? A. SNIA is vendor-neutral and does not directly recommend one vendor or protocol over another. Both are RDMA protocols that run on Ethernet, are supported by multiple vendors, and can be used with Ethernet-based scale-out storage. You can learn more about this topic by viewing our recent Great Storage Debate webcast “RoCE vs. iWARP” and checking out the Q&A blog from that webcast. Q. How would you compare use of TCP/IP and Ethernet RDMA networking for scale-out storage? A. Ethernet RDMA can improve the performance of Ethernet-based scale-out storage for the front-end (client) and/or back-end (cluster) networks. RDMA generally offers higher throughput, lower latency, and reduced CPU utilization when compared to using normal (non-RDMA) TCP/IP networking. This can lead to faster storage performance and leave more storage node CPU cycles available for running storage software. However, high-performance RDMA requires choosing network adapters that support RDMA offloads and in some cases requires modifications to the network switch configurations. Some other types of non-Ethernet storage networking also offer various levels of direct memory access or networking offloads that can provide high-performance networking for scale-out storage. Q. How does RDMA networking enable latency reduction? A. RDMA typically bypasses the kernel TCP/IP stack and offloads networking tasks from the CPU to the network adapter. In essence it reduces the total path length which consequently reduces the latency. Most RDMA NICs (rNICs) perform some level of networking acceleration in an ASIC or FPGA including retransmissions, reordering, TCP operations flow control, and congestion management. Q. Do all scale-out storage solutions have a separate cluster network? A. Logically all scale-out storage systems have a cluster network. Sometimes it runs on a physically separate network and sometimes it runs on the same network as the front-end (client) traffic. Sometimes the client and cluster networks use different networking technologies.        

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

RDMA for Persistent Memory over Fabrics – FAQ

John Kim

Nov 14, 2018

title of post
In our most recent SNIA Networking Storage Forum (NSF) webcast Extending RDMA for Persistent Memory over Fabrics, our expert speakers, Tony Hurson and Rob Davis outlined extensions to RDMA protocols that confirm persistence and additionally can order successive writes to different memories within the target system. Hundreds of people have seen the webcast and have given it a 4.8 rating on a scale of 1-5! If you missed, it you can watch it on-demand at your convenience. The webcast slides are also available for download. We had several interesting questions during the live event. Here are answers from our presenters:  Q. For the RDMA Message Extensions, does the client have to qualify a WRITE completion with only Atomic Write Response and not with Commit Response? A. If an Atomic Write must be confirmed persistent, it must be followed by an additional Commit Request. Built-in confirmation of persistence was dropped from the Atomic Request because it adds latency and is not needed for some application streams. Q. Why do you need confirmation for writes? From my point of view, the only thing required is ordering. A. Agreed, but only if the entire target system is non-volatile! Explicit confirmation of persistence is required to cover the “gap” between the Write completing in the network and the data reaching persistence at the target. Q. Where are these messages being generated? Does NIC know when the data is flushed or committed? A. They are generated by the application that has reserved the memory window on the remote node. It can write using RDMA writes to that window all it wants, but to guarantee persistence it must send a flush. Q. How is RPM presented on the client host? A. The application using it sees it as memory it can read and write. Q. Does this RDMA commit response implicitly ACK any previous RDMA sends/writes to same or different MR? A. Yes, the new Commit (and Verify and Atomic Write) Responses have the same acknowledgement coalescing properties as the existing Read Response. That is, a Commit Response is explicit (non-coalesced); but it coalesces/implies acknowledgement of prior Write and/or Send Requests. Q. Does this one still have the current RMDA Write ACK? A. See previous general answer. Yes. A Commit Response implicitly acknowledges prior Writes. Q. With respect to the Race Hazard explained to show the need for explicit completion response, wouldn’t this be the case even with a non-volatile Memory, if the data were to be stored in non-volatile memory. Why is this completion status required only on the non-volatile case? A. Most networked applications that write over the network to volatile memory do not require explicit confirmation at the writer endpoint that data has actually reached there. If so, additional handshake messages are usually exchanged between the endpoint applications. On the other hand, a writer to PERSISTENT memory across a network almost always needs assurance that data has reached persistence, thus the new extension. Q. What if you are using multiple RNIC with multiple ports to multiple ports on a 100Gb fabric for server-to-server RDMA? How is order kept there…by CPU software or ‘NIC teaming plus’? A. This would depend on the RNIC vendor and their implementation. Q. What is the time frame for these new RDMA messages to be available in verbs API? A. This depends on the IBTA standards approval process which is not completely predicable, roughly sometime the first half of 2019. Q. Where could I find more details about the three new verbs (what are the arguments)? A. Please poll/contact/Google the IBTA and IETF organizations towards the end of calendar year 2018, when first drafts of the extension documents are expected to be available. Q. Do you see this technology used in a way similar to Hyperconverged systems now use storage or could you see this used as a large shared memory subsystem in the network? A. High-speed persistent memory, in either NVDIMM or SSD form factor, has enormous potential in speeding up hyperconverged write replication. It will require however substantial re-write of such storage stacks, moving for example from traditional three-phase block storage protocols (command/data/response) to an RDMA write/confirm model. More generally, the RDMA extensions are useful for distributed shared PERSISTENT memory applications. Q. What would be the most useful performance metrics to debug performance issues in such environments? A. Within the RNIC, basic counts for the new message types would be a baseline. These plus total stall times encountered by the RNIC awaiting Commit Responses from the local CPU subsystem would be useful. Within the CPU platform basic counts of device write and read requests targeting persistent memory would be useful. Q. Do all the RDMA NIC’s have to update their firmware to support this new VERB’s? What is the expected performance improvement with the new Commit message? A. Both answers would depend on the RNIC vendor and their implementation. Q. Will the three new verbs be implemented in the RNIC alone, or will they require changes in other places (processor, memory controllers, etc.)? A. The new Commit request requires the CPU platform and its memory controllers to confirm that prior write data has reached persistence. The new Atomic Write and Verify messages however may be executed entirely within the RNIC. Q. What about the future of NVMe over TCP – this would be much simpler for people to implement. Is this a good option? A. Again this would depend on the NIC vendor and their implementation. Different vendors have implemented various tests for performance. It is recommended that readers do their own due diligence.  

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Introducing the Networking Storage Forum

John Kim

Oct 9, 2018

title of post

At SNIA, we are dedicated to staying on top of storage trends and technologies to fulfill our mission as a globally recognized and trusted authority for storage leadership, standards, and technology expertise. For the last several years, the Ethernet Storage Forum has been working hard to provide high quality educational and informational material related to all kinds of storage.

From our “Everything You Wanted To Know About Storage But Were Too Proud To Ask” series, to the absolutely phenomenal (and required viewing) “Storage Performance Benchmarking” series to the “Great Storage Debates” series, we’ve produced dozens of hours of material.

Technologies have evolved and we’ve come to a point where there’s a need to understand how these systems and architectures work – beyond just the type of wire that is used. Today, there are new systems that are bringing storage to completely new audiences. From scale-up to scale-out, from disaggregated to hyperconverged, RDMA, and NVMe-oF – there is more to storage networking than just your favorite transport. For example, when we talk about NVMe™ over Fabrics, the protocol is broader than just one way of accomplishing what you need. When we talk about virtualized environments, we need to examine the nature of the relationship between hypervisors and all kinds of networks. When we look at “Storage as a Service,” we need to understand how we can create workable systems from all the tools at our disposal. Bigger Than Our Britches As I said, SNIA’s Ethernet Storage Forum has been working to bring these new technologies to the forefront, so that you can see (and understand) the bigger picture. To that end, we realized that we needed to rethink the way that our charter worked, to be even more inclusive of technologies that were relevant to storage and networking. So… Introducing the Networking Storage Forum. In this group we’re going to continue producing top-quality, vendor-neutral material related to storage networking solutions. We’ll be talking about:
  • Storage Protocols (iSCSI, FC, FCoE, NFS, SMB, NVMe-oF, etc.)
  • Architectures (Hyperconvergence, Virtualization, Storage as a Service, etc.)
  • Storage Best Practices
  • New and developing technologies
… and more! Generally speaking, we’ll continue to do the same great work that we’ve been doing, but now our name more accurately reflects the breadth of work that we do. We’re excited to launch this new chapter of the Forum. If you work for a vendor, are a systems integrator, university or someone who manages storage, we welcome you to join the NSF. We are an active group that honestly has a lot of fun. If you’re one of our loyal followers, we hope you will continue to keep track of what we’re doing. And if you’re new to this Forum, we encourage you to take advantage of the library of webcasts, white papers, and published articles that we have produced here. There’s a wealth of un-biased, educational information there, we don’t think you’ll find anywhere else! If there’s something that you’d like to hear about – let us know! We are always looking to hear about headaches, concerns, and areas of confusion within the industry where we can shed some light. Stay current with all things NSF:    

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Oh What a Tangled Web We Weave: Extending RDMA for PM over Fabrics

John Kim

Oct 8, 2018

title of post
For datacenter applications requiring low-latency access to persistent storage, byte-addressable persistent memory (PM) technologies like 3D XPoint and MRAM are attractive solutions. Network-based access to PM, labeled here Persistent Memory over Fabrics (PMoF), is driven by data scalability and/or availability requirements. Remote Direct Memory Access (RDMA) network protocols are a good match for PMoF, allowing direct RDMA data reads or writes from/to remote PM. However, the completion of an RDMA Write at the sending node offers no guarantee that data has reached persistence at the target. Join the Networking Storage Forum (NSF) on October 25, 2018 for out next live webcast, Extending RDMA for Persistent Memory over Fabrics. In this webcast, we will outline extensions to RDMA protocols that confirm such persistence and additionally can order successive writes to different memories within the target system. Learn:
  • Why we can’t just treat PM just like traditional storage or volatile memory
  • What happens when you write to memory over RDMA
  • Which programming model and protocol changes are required for PMoF
  • How proposed RDMA extensions for PM would work
We believe this webcast will appeal to developers of low-latency and/or high-availability datacenter storage applications and be of interest to datacenter developers, administrators and users. I encourage you to register today. Our NSF experts will be on hand to answer you questions. We look forward to your joining us on October 25th.  

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

John Kim

Sep 19, 2018

title of post
In our RoCE vs. iWARP webcast, experts from the SNIA Ethernet Storage Forum (ESF) had a friendly debate on two commonly known remote direct memory access (RDMA) protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP. It turned out to be another very popular addition to our “Great Storage Debate” webcast series. If you haven’t seen it yet, it’s now available on-demand along with a PDF of the presentation slides. We received A LOT of questions related to Performance, Scalability and Distance, Multipathing, Error Correction, Windows and SMB Direct, DCB (Data Center Bridging), PFC (Priority Flow Control), lossless networks, and Congestion Management, and more. Here are answers to them all.  Q. Are RDMA NIC’s and TOE NIC’s the same? What are the differences?  A. No, they are not, though some RNICs include a TOE. An RNIC based on iWARP uses a TOE (TCP Offload Engine) since iWARP itself is fundamentally an upper layer protocol relative to TCP/IP (encapsulated in TCP/IP). The iWARP-based RNIC may or may not expose the TOE. If the TOE is exposed, it can be used for other purposes/applications that require TCP/IP acceleration. However, most of the time, the TOE is hidden under the iWARP verb’s API and thus is only used to accelerate TCP for iWARP. An RNIC based on RoCE usually does not have a TOE in the first place and is thus not capable of statefully offloading TCP/IP, though many of them do offer stateless TCP offloads. Q. Does RDMA use the TCP/UDP/IP protocol stack? A. RoCE uses UDP/IP while iWARP uses TCP/IP. Other RDMA protocols like OmniPath and InfiniBand don’t use Ethernet. Q. Can Software Defined Networking features like VxLANs be implemented on RoCE/iWARP NICs? A. Yes, most RNICs can also support VxLAN. An RNIC combined all the functionality of a regular NIC (like VxLAN offloads, checksum offloads etc.) along with RDMA functionality. Q. Do the BSD OS’s (e.g. FreeBSD) support RoCE and iWARP?   A. FreeBSD supports both iWARP and RoCE. Q. Any comments on NVMe over TCP? AThe NVMe over TCP standard is not yet finalized. Once the specification is finalized SNIA ESF will host a webcast on BrightTALK to discuss NVMe over TCP. Follow us @SNIAESF for notification of all our upcoming webcasts. Q. What layers in the OSI model would the RDMAP, DDP, and MPA map to for iWARP? A. RDMAP/DDP/MPA are stacking on top of TCP, so these protocols are sitting on top of Layer 4, Transportation Layer, based on the OSI model. Q. What’s the deployment percentages between RoCE and iWARP? Which has a bigger market share support and by how much? A. SNIA does not have this market share information. Today multiple networking vendors support both RoCE and iWARP. Historically more adapters supporting RoCE have been shipped than adapters supporting iWARP, but not all the iWARP/RoCE-capable Ethernet adapters deployed are used for RDMA. Q. Who will win RoCE or iWARP or InfiniBand? What shall we as customers choose if we want to have this today? A. As a vendor-neutral forum, SNIA cannot recommend any specific RDMA technology or vendor. Note that RoCE and iWARP run on Ethernet while InfiniBand (and OmniPath) do not use Ethernet. Q. Are there any best practices identified for running higher-level storage protocols (iSCSI/NFS/SMB etc.), on top of RoCE or iWARP? A. Congestion caused by dropped packets and retransmissions can degrade performance for higher-level storage protocols whether using RDMA or regular TCP/IP. To prevent this from happening a best practice would be to use explicit congestion notification (ECN), or better yet, data center bridging (DCB) to minimize congestion and ensure the best performance. Likewise, designing a fully non-blocking network fabric will also assist in preventing congestion and guarantee the best performance. Finally, by prioritizing the data flows that are using RoCE or iWARP, the network administrators can ensure bandwidth is available for the flows that require it the most. iWARP provides RDMA functionality over TCP/IP and inherits the loss resilience and congestion management from the underlying TCP/IP layer. Thus, it does not require specific best practices beyond those in use for TCP/IP including not requiring any specific host or switch configuration as well as out-of-the-box support across LAN/MAN/WAN networks. Q. On slide #14 of RoCE vs. iWARP presentation, the slide showed SCM being 1,000 times faster than NAND Flash, but the presenter stated 100 times faster. Those are both higher than I have heard. Which is the correct? A. Research on the Internet shows that both Intel and Micron have been boasting that 3D XPoint Memory is 1,000 times as fast as NAND flash. However, their tests also compared standard NAND flash based PCIe SSD to a similar SSDs based on 3D XPoint memory which was only 7-8 times faster. Due to this, we dug in a little further and found a great article by Jim Handy Why 3D XPoint SSDs Will Be Slow that could help explain the difference. Q. What is the significance of BTH+ and GRH header? A. BTH+ and GRH are both used within InfiniBand for RDMA implementations. With RoCE implementations of RDMA, packets are marked with EtherType header that indicates the packets are RoCE and ip.protocol_number within the IP Header is used to indicate that the packet is UDP. Both of these will identify packets as RoCE packets. Q. What sorts of applications are unique to the workstation market for RDMA, versus the server market? A. All major OEM vendors are shipping servers with CPU platforms that include integrated iWARP RDMA, as well as offering adapters that support iWARP and/or RoCE. Main applications of RDMA are still in the server area at this moment. At the time of this writing, workstation operating systems such as Windows 10 or Linux can use RDMA when running I/O-intensive applications such as video post-production, oil/gas and computer-aided design applications, for high-speed access to storage. DCB, PFC, lossless networks, and Congestion Management Q. Is slide #26 correct? I thought RoCE v1 was PFC/DCB and RoCE v2 was ECN/DCB subset. Did I get it backwards? A. Sorry for the confusion, you’ve got it correct. With newer RoCE-capable adapters, customers may choose to use ECN or PFC for RoCE v2. Q. I thought RoCE v2 did not need any DCB enabled network, so why this DCB congestion management for RoCE v2? A. RoCEv2 running on modern rNICs is known as Resilient RoCE due to it not needing a lossless network. Instead a RoCE congestion control mechanism is used to minimize packet by leveraging Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. RoCE v2 takes advantage of ECN to avoid congestion and packet loss. ECN-capable switches detect when a port is getting too busy and mark outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission (referred to as resilient RoCE). Q. Is iWARP a lossless or losssy protocol? A. iWARP utilizes the underlying TCP/IP layer for loss resilience. This happens at silicon speeds for iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality. Q. So it looks to me that iWARP can use an existing Ethernet network without modifications and RoCEv2 would need some fine-tuning. Is this correct? A. Generally iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE). However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion. iWARP delivers RDMA on top of the TCP/IP protocol and thus TCP provides congestion management and loss resilience for iWARP which, as a result, does not require a lossless Ethernet network. This is particularly useful in congested networks or long distance links. Q. Is this correct statement? Please clarify — RoCE v1 requires ECN, PFC but RoCEv2 requires only ECN or PFC? A. Remember, we called this presentation a “Great Storage Debate?” Here is an area where there are two schools of thoughts. Answer #1: It’s recommended to deploy RoCE (v1) with PFC which is part of the Ethernet Data Center Bridging (DCB) specification to implement a lossless network. With the release of RoCEv2, an alternative mechanism to avoid packet loss was introduced which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. Answer #2: Generally this is correct, iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE), and DCB. As such, and this is very important, an iWARP installation of a storage or server node is decoupled from the switch infrastructure upgrade. However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion, though in the case of an iWARP adapter, this benefit is insignificant, since all loss recovery and congestion management happen at the silicon speed of the underlying TOE. Q. Does RoCE v2 also require PFC or how will it handle lossy networks?  A. RoCE v2 does not require PFC but performs better with having either PFC or ECN activated. See the following question and answer for more details. Q. Can a RoCEv2 lossless network be achieved with ECN only (no PFC)? A. RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network. With modern RoCE-capable adapters, it only requires the use of ECN. ECN in of itself does not guarantee a lossless connection but can be used to minimize congestion and thus minimize packet loss. However, even with RoCE v2, a lossless connection (using PFC/DCB) can provide better performance and is often implemented with RoCEv2 deployments, either instead of ECN or alongside ECN. Q. In order to guarantee lossless, does ECN and PFC both have to be used? A. ECN can be used to avoid most packet loss, but PFC (part of DCB) is required for a truly lossless network. Q. Are there real deployments that use “Resilient RoCE” without PFC configured? A. To achieve better performance, PFC alone or both ECN and PFC are deployed in most iterations of RoCE in real deployments today. However, there are a growing number of deployments using Resilient RoCE with ECN alone that maintain high levels of performance. Q. For RoCEv2, can ECN be implemented without PFC? A. Yes, ECN can be implemented on it’s own within a RoCE v2 implementation without the need for PFC. Q. RoCE needs to have converged Ethernet, but no iWARP, correct? A. Correct. iWARP was standardized in IETF and built upon standard TCP/IP over Ethernet, so “Converged Ethernet” requirement doesn’t apply to iWARP. Q. It’s not clear from the diagram if TCP/IP is still needed for RoCE and iWARP. Is it? A. RoCE uses IP (UDP/IP) but not TCP. IWARP uses TCP/IP. Q. On slide #10, does this require any support on the switch? A. Yes, an enterprise switch with support for DCB would be required. Most enterprise switches do support DCB today. Q. Will you cover congestion mechanisms and which one ROCEv2 or iWARP work better for different workloads? A. With multiple vendors supporting RoCEv2 and iWARP at different speeds (10, 25, 40, 50, and 100Gb/s), we’d likely see a difference in performance from each adapter across different workloads. An apples-to-apples test of the specific workload would be required to provide an answer. If you are working with a specific vendor or OEM, we would suggest you ask the vendor/OEM for comparison data on the workload you plan on deploying. Performance, Scalability and Distance Q. For storage related applications, could you add a performance based comparison of Ethernet based RoCE / iWARP to FC-NVMe with similar link speeds (32Gbps FC to 40GbE for example)? A. We would like to see the results of this testing as well and due to the overwhelming request for data representing RoCE vs. iWARP this is something we will try to provide in the future. Q. Do you have some performance measurements which compare iWARP and RoCE? A. Nothing is available from SNIA ESF but a search on Google should provide you with the information you are looking for. For example, you can find this Microsoft blog. Q. Are there performance benchmarks between RoCE vs. iWARP? A. Debating which one is faster is beyond the scope of this webcast. Q. Can RoCE scale to 1000’s of Ceph nodes, assuming each node hosts 36 disks? A. RoCE has been successfully tested with dozens of Ceph nodes. It’s unknown if RoCE with Ceph can scale to 1000s of Ceph nodes. Q. Is ROCE limited in number of hops? A. No, there is no limit in the amount of hops, but as more hops are included, latencies increase and performance may become an issue. Q. Does RoCEv2 support long distance (100km) operation or is it only iWARP? A. Today the practical limit of RoCE while maintaining high performance is about 40km. As different switches and optics come to market, this distance limit may increase in the future. iWARP has no distance limit but with any high-performance networking solution, increasing distance leads to increasing latency due to the speed of light and/or retransmission hops. Since it is a protocol on top of basic TCP/IP, it can transfer data over wireless links to satellites if need be. Multipathing, Error Correction Q. Isn’t the Achilles heel of iWARP the handling of congestion on the switch? Sure TCP/IP doesn’t require lossless but doesn’t one need DCTCP, PFC, ETS to handle buffers filling up both point to point as well as from receiver to sender? Some vendors offload any TCP/IP traffic and consider RDMA “limited” but even if that’s true don’t they have to deal with the same challenges on the switch in regards to congestion management? A. TCP itself uses a congestion-avoidance algorithm, like TCP New Reno (RFC 6582), together with slow start and congestion window to avoid congestions. These mechanisms are not dependent on switches. So iWARP’s performance under network congestion should closely match that of TCP. Q. If you are using RoCE v2 with UDP, how is error correction implemented? A. Error correction is done by the RoCE protocol running on top of UDP. Q. How does multipathing works with RDMA? A. For single-port RNICs, multipathing, being network-based (Equal-cost Multi-path routing, ECMP) is transparent to the RDMA application. Both RoCE and iWARP transports achieve good network load balancing under ECMP. For multi-port RNICs, the RDMA client application can explicitly load-balance its traffic across multiple local ports. Some multi-port RNICs support link aggregation (a.k.a. bonding), in which case the RNIC transparently spreads connection load amongst physical ports. Q. Do RoCE and iWARP work with bonded NICs? A. The short answer is yes, but it will depend on individual NIC vendor’s implementation. Windows and SMB Direct Q. What is SMB Direct? A. SMB Direct is a special version of the SMB 3 protocol. It supports both RDMA and multiple active-active connections. You can find the official definition of SMB (Server Message Block) in the SNIA Dictionary. Q. Is there iSER support in Windows? A. Today iSER is supported in Linux and VMware but not in Windows. Windows does support both iWARP and RoCE for SMB Direct. Chelsio is now providing an iSER (iWARP) Initiator for Windows as part of the driver package, which is available at service.chelsio.com. The current driver is considered a beta, but will go GA by the end of September 2018. Q. When will iWARP or RoCE for NVMe-oF be supported on Windows? A. Windows does not officially support NVMe-oF yet, but if and when Windows does support it, we believe it will support it over both RoCE and iWARP. Q. Why is iWARP better for Storage Spaces Direct? A. iWARP is based on TCP, which deals with flow control and congestion management, so iWARP is scalable and ideal for a hyper-converged storage solution like Storage Spaces Direct. iWARP is also the recommended configuration from Microsoft in some circumstances. We hope that answers all your questions! We encourage you to check out the other “Great Storage Debate” in this webcast series. To date, our experts have had friendly, vendor-neutral debates on File vs. Block vs. Object StorageFibre Channel vs. iSCSIFCoE vs. iSCSI vs. iSER and Centralized vs. Distributed Storage. Happy debating!            

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

RoCE vs. iWARP – The Next “Great Storage Debate”

John Kim

Jul 16, 2018

title of post
By now, we hope you’ve had a chance to watch one of the webcasts from the SNIA Ethernet Storage Forum’s “Great Storage Debate” webcast series. To date, our experts have had friendly, vendor-neutral debates on File vs. Block vs. Object Storage, Fibre Channel vs. iSCSI, and FCoE vs. iSCSI vs. iSER. The goal of this series is not to have a winner emerge, but rather educate the attendees on how the technologies work, advantages of each, and common use cases. Our next great storage debate will be on August 22, 2018 where our experts will debate RoCE vs. iWARP. They will discuss these two commonly known RDMA protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP. Both are Ethernet-based RDMA technologies that can increase networking performance. Both reduce the amount of CPU overhead in transferring data among servers and storage systems to support network-intensive applications, like networked storage or clustered computing. Join us on August 22nd, as we’ll address questions like:
  • Both RoCE and iWARP support RDMA over Ethernet, but what are the differences?
  • Use cases for RoCE and iWARP and what differentiates them?
  • UDP/IP and TCP/IP: which RDMA standard uses which protocol, and what are the advantages and disadvantages?
  • What are the software and hardware requirements for each?
  • What are the performance/latency differences of each?
Get this on your calendar by registering now. Our experts will be on-hand to answer your questions on the spot. We hope to see you there! Visit snia.org to learn about the work SNIA is doing to lead the storage industry worldwide in developing and promoting vendor-neutral architectures, standards, and educational services that facilitate the efficient management, movement, and security of information.  

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Subscribe to RDMA