Optimizing NVMe over Fabrics Performance Q&A

Tom Friend

Oct 2, 2020

title of post

Almost 800 people have already watched our webcast “Optimizing NVMe over Fabrics Performance with Different Ethernet Transports: Host Factors” where SNIA experts covered the factors impacting different Ethernet transport performance for NVMe over Fabrics (NVMe-oF) and provided data comparisons of NVMe over Fabrics tests with iWARP, RoCEv2 and TCP. If you missed the live event, watch it on-demand at your convenience.

The session generated a lot of questions, all answered here in this blog. In fact, many of the questions have prompted us to continue this discussion with future webcasts on NVMe-oF performance. Please follow us on Twitter @SNIANSF for upcoming dates.

Q. What factors will affect the performance of NVMe over RoCEv2 and TCP when the network between host and target is longer than typical Data Center environment? i.e., RTT > 100ms

A. For a large deployment with long distance, congestion management and flow control will be the most critical considerations to make sure performance is guaranteed. In a very large deployment, network topology, bandwidth subscription to storage target, and connection ratio are all important factors that will impact the performance of NVMe-oF.

Q. Were the RoCEv2 tests run on 'lossless' Ethernet and the TCP tests run on 'lossy' Ethernet?

A. Both iWARP and RoCEv2 tests were run in a back to back configuration without a switch in the middle, but with Link Flow Control turned on.

Q. Just to confirm, this is with pure ROCEv2? No TCP, right? ROCEv2 end 2 end (initiator 2 target)?

A. Yes, for RoCEv2 test, that was RoCEv2 Initiator to RoCEv2 target.

Q. How are the drives being preconditioned? Is it based on I/O size or MTU size? 

A. Storage is pre-conditioned by I/O size and type of the selected workload. MTU size is not relevant.  The selected workload is applied until performance changes are time invariant - i.e. until performance stabilizes within a range known as steady state.  Generally, the workload is tracked by specific I/O size and type to remain within a data excursion of 20% and a slope of 10%.

Q. Are the 6 SSDs off a single Namespace, or multiple? If so, how many Namespaces used?

A. Single namespace.

Q. What I/O generation tool was used for the test?

A. Calypso CTS IO Stimulus generator which is based on libaio. CTS has same engine as fio and applies IOs to the block IO level.  Note vdbench and iometer are java-based file system level and higher in the software stack.

Q. Given that NVMe SSD performance is high with low latency, is it not that the performance bottleneck is shifted to the storage controller?

A. Test I/Os are applied to the logical storage seen by host on the target server in our attempt to normalize the host and target in order to assess NIC-Wire-NIC performance. The storage controller is beneath this layer and not applicable to this test. If we test the storage directly on the target - not over the wire - then we can see impact of the controller and controller related issues (such as garbage collection, over provisioning, table structures, etc.)

Q. What are the specific characteristics of RoCEv2 that restrict it to 'rack' scale deployments?  In other words, what is restricting it from larger scale deployments?

A. RoCEv2 can, and does, scale beyond the rack if you have one of three things:

  1. A lossless network with DCB (priority flow control)
  2. Congestion management with solutions like ECN
  3. Newer RoCEv2-capable adapters that support out of order packet receive and selective re-transmission

Your mileage will vary based upon features of different network vendors.

Q. Is there an option to use some caching mechanism on host side?

A. Host side has RAM cache per platform set up but is held constant among these tests. 

Q. Was there caching in the host?

A. The test used host memory for NVMe over Fabrics.

Q. Were all these topics from the description covered?  In particular, #2?
We will cover the variables:

  1. How many CPU cores are needed (I’m willing to give)?
  2. Optane SSD or 3D NAND SSD?
  3. How deep should the Q-Depth be?
  4. Why do I need to care about MTU?

A. Cores - see TC/QD sweep to see optimal OIO.  Core Usage/Required can be inferred from this. Note incongruity of TC/QD to OIO 8, 16, 32, 48 in this case.  

  1. The test used a dual socket server on target with IntelÒ XeonÒ Platinum 8280L processor with 28 cores. Target server only used one processor so that all the workloads were on a single NUMA node. 1-4% CPU utilization is the average of 28 cores.
  2. SSD-1 is Optane SSD, SSD-2 is 3D NAND.
  3. Normally QD is set to 32.
  4. You do not need to care about MTU, at least in our test, we saw minimal performance differences.

Q. The result of 1~4% of CPU utilization on target is based on single SSD? Do you expect to see much higher CPU utilization if the amount of SSD increases?

A. CPU % is the target server for the 6 SSD LUN.

Q. Is there any difference between the different transports and the sensitivity of lost packets?

A. Theoretically, iWARP and TCP are more tolerant to packet lost. iWARP is based on TCP/IP, TCP provides flow control and congestion management that can still perform in a congested environment. In the event of packet loss, iWARP supports selective re-transmission and out of order packet receive, those technology can further improve the performance in a lossy network. While, RoCEv2 standard implementation does not tolerate packet loss and would require lossless network and would experience performance degradation when packet loss happens.

Q. 1. When you mean offload TCP, is this both at Initiator and target side or just host initiator side?
2. Do you see any improvement with ADQ on TCP?

A. RDMA iWARP in the test has a complete offload TCP engine on the network adapter on both Initiator and target side. Application Device Queues (ADQ) can significantly improve throughput, latency and most importantly latency jitter with dedicated CPU core allocated for NVMe-oF solutions.

Q. Since the CPU utilization is extremely low on the host, any comments about the CPU role in NVMe-oF and the impact of offloading?

A. NVMe-oF was designed to reduce the CPU load on target as shown in the test. On the initiator side CPU load will be a little bit higher. RDMA, as an offloaded technology, requires fairly minimal CPU utilization. NVMe over TCP still uses TCP stack in the kernel to do all the work, thus CPU still plays an important role. Also, the test was done with a high-end IntelÒ XeonÒ  Processor with very powerful processing capability, if a processor with less processing power is used, CPU utilization would be higher.

Q. 1. What should be the ideal incapsulated data (inline date) size for best performance in a real-world scenario? 2. How could one optimize buffer copies at block level in NVMe-oF?

A. 1. There is no simple answer to this question. The impact of incapsulated data size to performance in the real-world scenario is more complicated as switch is playing a critical role in the whole network. Whether there is a shallow buffer switch or a deep buffer switch, switch settings like policy, congestion management etc. would all impact the overall performance. 2. There are multiple explorations to improve the performance of NVMe-oF by reducing or optimizing buffer copies. One possible option is to use controller memory buffer introduced in NVMe Specification 1.2.

Q. Is it possible to combine any of the NVMe-of technologies with SPDK - user space processing?

A. SPDK currently supports all these Ethernet-based transports: iWarp, RoCEv2 and TCP.

Q. You indicated that TCP is non-offloaded, but doesn't it still use the 'pseudo-standard' offloads like Checksum, LSO, RSS, etc?  It just doesn't have the entire TCP stack offloaded?

A. Yes, stateless offloads are supported and used.

Q. What is the real idea in using 4 different SSDs? Why didn't you use 6 or 8 or 10? What is the message you are trying to relay? I understand that SSD1 is higher/better performing than SSD2.

A. We used a six SSD LUN in both SSD-1 and SSD-2.  We compared higher performance - lower capacity Optane to lower performance - higher capacity NVMe.  Note NVMe is 10X capacity of Optane.

Q. It looks like one of the key takeaways is that SSD specs matter. Can you explain (without naming brands) the main differences between SSD-1 and SSD-2?

A. Manufacturer specs are only a starting point and actual performance depends on the workload.  Large differences are seen for small block RND W workloads and large block SEQ R workloads.

Q. What is the impact to the host CPU and memory during the tests? Wondering what minimum CPU and memory are necessary to achieve peak NVMe-oF performance, which leads to describe how much application workload one might be able to achieve.

A. The test did not limit CPU core or memory to try the minimal configuration to achieve peak NVMe-oF performance. This might be an interesting topic we can cover in the future presentation.  (We measured target server CPU usage, not host / initiator CPU Usage).

Q. Did you let the tests run for 2 hours and then take results? (basically, warm up the cache/SSD characterization)?

A. We precondition with the TC/QD Sweep test then run the remaining 3 tests back to back to take advantage of the preconditioning done in the first test.

Q. How do you check outstanding IOs?

A. We use OIO = TC x QD in test settings and populate each thread with the QD jobs. We do not look at in flight OIO, but wait for all OIOs to complete and measure response times.

Q. Where can we get the performance test specifications as defined by SNIA?

A. You can find the test specification on the SNIA website here.

Q. Have these tests been run using FC-NVMe. If so, how did they fare?

A. We have not yet run tests your NVMe over Fibre Channel.

Q. What tests did you use? FIO, VDBench, IOZone, or just DD or IOMeter? What was the CPU peak utilization? and what CPUs did you use?

A. CTS IO generator which is similar to fio as both are based on libaio and test at the block level.  Vdbench, iozone and Iometer are java file system level.  DD is direct and lacks complex scripting.  Fio allows compiles scripting but not multiple variables per loop - i.e. requires iterative tests and post-test compilation vs. CTS which has multi variable - multi loop concurrency.

Q. What test suites did you use for testing?

A. Calypso CTS tests

Q. I heard that iWARP is dead?

A. No, iWARP is not dead. There are multiple Ethernet network adapter vendors supporting iWARP now. The adapter used in the test supports iWARP, RoCEv2 and TCP at the same time.

Q. Can you post some recommendation on the switch setup and congestion?

A. The test talked about in this presentation used back to back configuration without switch. We will have a presentation in the near future to take into account switch settings and will share more information at that time. Don’t forget to follow us on Twitter @SNIANSF for dates of upcoming webcasts.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Optimizing NVMe over Fabrics Performance Q&A

Tom Friend

Oct 2, 2020

title of post
Almost 800 people have already watched our webcast “Optimizing NVMe over Fabrics Performance with Different Ethernet Transports: Host Factors” where SNIA experts covered the factors impacting different Ethernet transport performance for NVMe over Fabrics (NVMe-oF) and provided data comparisons of NVMe over Fabrics tests with iWARP, RoCEv2 and TCP. If you missed the live event, watch it on-demand at your convenience. The session generated a lot of questions, all answered here in this blog. In fact, many of the questions have prompted us to continue this discussion with future webcasts on NVMe-oF performance. Please follow us on Twitter @SNIANSF for upcoming dates. Q. What factors will affect the performance of NVMe over RoCEv2 and TCP when the network between host and target is longer than typical Data Center environment? i.e., RTT > 100ms A. For a large deployment with long distance, congestion management and flow control will be the most critical considerations to make sure performance is guaranteed. In a very large deployment, network topology, bandwidth subscription to storage target, and connection ratio are all important factors that will impact the performance of NVMe-oF. Q. Were the RoCEv2 tests run on ‘lossless’ Ethernet and the TCP tests run on ‘lossy’ Ethernet? A. Both iWARP and RoCEv2 tests were run in a back to back configuration without a switch in the middle, but with Link Flow Control turned on. Q. Just to confirm, this is with pure ROCEv2? No TCP, right? ROCEv2 end 2 end (initiator 2 target)? A. Yes, for RoCEv2 test, that was RoCEv2 Initiator to RoCEv2 target. Q. How are the drives being preconditioned? Is it based on I/O size or MTU size?  A. Storage is pre-conditioned by I/O size and type of the selected workload. MTU size is not relevant.  The selected workload is applied until performance changes are time invariant – i.e. until performance stabilizes within a range known as steady state.  Generally, the workload is tracked by specific I/O size and type to remain within a data excursion of 20% and a slope of 10%. Q. Are the 6 SSDs off a single Namespace, or multiple? If so, how many Namespaces used? A. Single namespace. Q. What I/O generation tool was used for the test? A. Calypso CTS IO Stimulus generator which is based on libaio. CTS has same engine as fio and applies IOs to the block IO level.  Note vdbench and iometer are java-based file system level and higher in the software stack. Q. Given that NVMe SSD performance is high with low latency, is it not that the performance bottleneck is shifted to the storage controller? A. Test I/Os are applied to the logical storage seen by host on the target server in our attempt to normalize the host and target in order to assess NIC-Wire-NIC performance. The storage controller is beneath this layer and not applicable to this test. If we test the storage directly on the target – not over the wire – then we can see impact of the controller and controller related issues (such as garbage collection, over provisioning, table structures, etc.) Q. What are the specific characteristics of RoCEv2 that restrict it to ‘rack’ scale deployments? In other words, what is restricting it from larger scale deployments? A. RoCEv2 can, and does, scale beyond the rack if you have one of three things:
  1. A lossless network with DCB (priority flow control)
  2. Congestion management with solutions like ECN
  3. Newer RoCEv2-capable adapters that support out of order packet receive and selective re-transmission
Your mileage will vary based upon features of different network vendors. Q. Is there an option to use some caching mechanism on host side? A. Host side has RAM cache per platform set up but is held constant among these tests. Q. Was there caching in the host? A. The test used host memory for NVMe over Fabrics. Q. Were all these topics from the description covered?  In particular, #2? We will cover the variables:
  1. How many CPU cores are needed (I’m willing to give)?
  2. Optane SSD or 3D NAND SSD?
  3. How deep should the Q-Depth be?
  4. Why do I need to care about MTU?
A. Cores – see TC/QD sweep to see optimal OIO.  Core Usage/Required can be inferred from this. Note incongruity of TC/QD to OIO 8, 16, 32, 48 in this case.
  1. The test used a dual socket server on target with IntelÒ XeonÒ Platinum 8280L processor with 28 cores. Target server only used one processor so that all the workloads were on a single NUMA node. 1-4% CPU utilization is the average of 28 cores.
  2. SSD-1 is Optane SSD, SSD-2 is 3D NAND.
  3. Normally QD is set to 32.
  4. You do not need to care about MTU, at least in our test, we saw minimal performance differences.
Q. The result of 1~4% of CPU utilization on target is based on single SSD? Do you expect to see much higher CPU utilization if the amount of SSD increases? A. CPU % is the target server for the 6 SSD LUN. Q. Is there any difference between the different transports and the sensitivity of lost packets? A. Theoretically, iWARP and TCP are more tolerant to packet lost. iWARP is based on TCP/IP, TCP provides flow control and congestion management that can still perform in a congested environment. In the event of packet loss, iWARP supports selective re-transmission and out of order packet receive, those technology can further improve the performance in a lossy network. While, RoCEv2 standard implementation does not tolerate packet loss and would require lossless network and would experience performance degradation when packet loss happens. Q. 1. When you mean offload TCP, is this both at Initiator and target side or just host initiator side? 2. Do you see any improvement with ADQ on TCP? A. RDMA iWARP in the test has a complete offload TCP engine on the network adapter on both Initiator and target side. Application Device Queues (ADQ) can significantly improve throughput, latency and most importantly latency jitter with dedicated CPU core allocated for NVMe-oF solutions. Q. Since the CPU utilization is extremely low on the host, any comments about the CPU role in NVMe-oF and the impact of offloading? A. NVMe-oF was designed to reduce the CPU load on target as shown in the test. On the initiator side CPU load will be a little bit higher. RDMA, as an offloaded technology, requires fairly minimal CPU utilization. NVMe over TCP still uses TCP stack in the kernel to do all the work, thus CPU still plays an important role. Also, the test was done with a high-end IntelÒ XeonÒ  Processor with very powerful processing capability, if a processor with less processing power is used, CPU utilization would be higher. Q. 1. What should be the ideal incapsulated data (inline date) size for best performance in a real-world scenario? 2. How could one optimize buffer copies at block level in NVMe-oF? A. 1. There is no simple answer to this question. The impact of incapsulated data size to performance in the real-world scenario is more complicated as switch is playing a critical role in the whole network. Whether there is a shallow buffer switch or a deep buffer switch, switch settings like policy, congestion management etc. would all impact the overall performance. 2. There are multiple explorations to improve the performance of NVMe-oF by reducing or optimizing buffer copies. One possible option is to use controller memory buffer introduced in NVMe Specification 1.2. Q. Is it possible to combine any of the NVMe-of technologies with SPDK – user space processing? A. SPDK currently supports all these Ethernet-based transports: iWarp, RoCEv2 and TCP. Q. You indicated that TCP is non-offloaded, but doesn’t it still use the ‘pseudo-standard’ offloads like Checksum, LSO, RSS, etc?  It just doesn’t have the entire TCP stack offloaded? A. Yes, stateless offloads are supported and used. Q. What is the real idea in using 4 different SSDs? Why didn’t you use 6 or 8 or 10? What is the message you are trying to relay? I understand that SSD1 is higher/better performing than SSD2. A. We used a six SSD LUN in both SSD-1 and SSD-2.  We compared higher performance – lower capacity Optane to lower performance – higher capacity NVMe.  Note NVMe is 10X capacity of Optane. Q. It looks like one of the key takeaways is that SSD specs matter. Can you explain (without naming brands) the main differences between SSD-1 and SSD-2? A. Manufacturer specs are only a starting point and actual performance depends on the workload. Large differences are seen for small block RND W workloads and large block SEQ R workloads. Q. What is the impact to the host CPU and memory during the tests? Wondering what minimum CPU and memory are necessary to achieve peak NVMe-oF performance, which leads to describe how much application workload one might be able to achieve. A. The test did not limit CPU core or memory to try the minimal configuration to achieve peak NVMe-oF performance. This might be an interesting topic we can cover in the future presentation.  (We measured target server CPU usage, not host / initiator CPU Usage). Q. Did you let the tests run for 2 hours and then take results? (basically, warm up the cache/SSD characterization)? A. We precondition with the TC/QD Sweep test then run the remaining 3 tests back to back to take advantage of the preconditioning done in the first test. Q. How do you check outstanding IOs? A. We use OIO = TC x QD in test settings and populate each thread with the QD jobs. We do not look at in flight OIO, but wait for all OIOs to complete and measure response times. Q. Where can we get the performance test specifications as defined by SNIA? A. You can find the test specification on the SNIA website here. Q. Have these tests been run using FC-NVMe. If so, how did they fare? A. We have not yet run tests your NVMe over Fibre Channel. Q. What tests did you use? FIO, VDBench, IOZone, or just DD or IOMeter? What was the CPU peak utilization? and what CPUs did you use? A. CTS IO generator which is similar to fio as both are based on libaio and test at the block level.  Vdbench, iozone and Iometer are java file system level.  DD is direct and lacks complex scripting.  Fio allows compiles scripting but not multiple variables per loop – i.e. requires iterative tests and post-test compilation vs. CTS which has multi variable – multi loop concurrency. Q. What test suites did you use for testing? A. Calypso CTS tests Q. I heard that iWARP is dead? A. No, iWARP is not dead. There are multiple Ethernet network adapter vendors supporting iWARP now. The adapter used in the test supports iWARP, RoCEv2 and TCP at the same time. Q. Can you post some recommendation on the switch setup and congestion? A. The test talked about in this presentation used back to back configuration without switch. We will have a presentation in the near future to take into account switch settings and will share more information at that time. Don’t forget to follow us on Twitter @SNIANSF for dates of upcoming webcasts.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Optimizing NVMe over Fabrics Performance with Different Ethernet Transports: Host Factors

Tom Friend

Aug 11, 2020

title of post

NVMe over Fabrics technology is gaining momentum and getting more traction in data centers, but there are three kinds of Ethernet based NVMe over Fabrics transports: iWARP, RoCEv2 and TCP.

How do we optimize NVMe over Fabrics performance with different Ethernet transports? That will be the discussion topic at our SNIA Networking Storage Forum Webcast, “Optimizing NVMe over Fabrics Performance with Different Ethernet Transports: Host Factorson September 16, 2020.

Setting aside the considerations of network infrastructure, scalability, security requirements and complete solution stack, this webcast will explore the performance of different Ethernet-based transports for NVMe over Fabrics at the detailed benchmark level. We will show three key performance indicators: IOPs, Throughput, and Latency with different workloads including: Sequential Read/Write, Random Read/Write, 70%Read/30%Write, all with different data sizes. We will compare the result of three Ethernet based transports: iWARP, RoCEv2 and TCP.

Further, we will dig a little bit deeper to talk about the variables that impact the performance of different Ethernet transports. There are a lot of variables that you can tune, but these variables will impact the performance of each transport differently. We will cover the variables including:

  1. How many CPU cores are needed (I’m willing to give)?
  2. Optane SSD or 3D NAND SSD?
  3. How deep should the Queue-Depth be?
  4. Do I need to care about MTU?

This discussion won’t tell you which transport is the best. Instead we unfold the performance of each transport and tell you what it would take for each transport to get the best performance, so that you can make the best choice for your NVMe over Fabrics solutions.

I hope you will join us on September 16th for this live session that is sure to be informative. Register today.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Optimizing NVMe over Fabrics Performance with Different Ethernet Transports: Host Factors

Tom Friend

Aug 11, 2020

title of post
NVMe over Fabrics technology is gaining momentum and getting more traction in data centers, but there are three kinds of Ethernet based NVMe over Fabrics transports: iWARP, RoCEv2 and TCP. How do we optimize NVMe over Fabrics performance with different Ethernet transports? That will be the discussion topic at our SNIA Networking Storage Forum Webcast, “Optimizing NVMe over Fabrics Performance with Different Ethernet Transports: Host Factorson September 16, 2020. Setting aside the considerations of network infrastructure, scalability, security requirements and complete solution stack, this webcast will explore the performance of different Ethernet-based transports for NVMe over Fabrics at the detailed benchmark level. We will show three key performance indicators: IOPs, Throughput, and Latency with different workloads including: Sequential Read/Write, Random Read/Write, 70%Read/30%Write, all with different data sizes. We will compare the result of three Ethernet based transports: iWARP, RoCEv2 and TCP. Further, we will dig a little bit deeper to talk about the variables that impact the performance of different Ethernet transports. There are a lot of variables that you can tune, but these variables will impact the performance of each transport differently. We will cover the variables including:
  1. How many CPU cores are needed (I’m willing to give)?
  2. Optane SSD or 3D NAND SSD?
  3. How deep should the Queue-Depth be?
  4. Do I need to care about MTU?
This discussion won’t tell you which transport is the best. Instead we unfold the performance of each transport and tell you what it would take for each transport to get the best performance, so that you can make the best choice for your NVMe over Fabrics solutions. I hope you will join us on September 16th for this live session that is sure to be informative. Register today.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Notable Questions on NVMe-oF 1.1

Tim Lustig

Jul 14, 2020

title of post

At our recent SNIA Networking Storage Forum (NSF) webcast, Notable Updates in NVMe-oF™ 1.1we explored the latest features of NVMe over Fabrics (NVMe-oF), discussing what's new in the NVMe-oF 1.1 release, support for CMB and PMR, managing and provisioning NVMe-oF devices with SNIA Swordfish™, and FC-NVMe-2. If you missed the live event, you can watch it here. Our presenters received many interesting questions on NVMe-oF and here are answers to them all:

Q. Is there an implementation of NVMe-oF with direct CMB access?

A. The Controller Memory Buffer (CMB) was introduced in NVMe 1.2 and first supported in the NVMe-oF 1.0 specification. It's supported if the storage vendor has implemented this within the hardware and the network supports it. We recommend that you ask your favorite vendor if they support the feature.

Q. What is the different between PMR in an NVMe device and the persistent memory in general?

A. The Persistent Memory Region (PMR) is a region within the SSD controller and it is reserved for system level persistent memory that is exposed to the host. Just like a Controller Memory Buffer (CMB), the PMR may be used to store command data, but because it's persistent it allows the content to remain even after power cycles and resets. To go further into this answer would require a follow up webinar.

Q. Are any special actions required on the host side over Controller Memory Buffers to maintain the data consistency?

A. To prevent possible disruption and to maintain data consistency, first the control address range must be configured so that addresses will not overlap, as described in the latest specification. There is also a flush command so that persistent memory can be cleared, (also described in the specification).

Q. Is there a field to know the size of CMB and PMR supported by controller? What is the general size of CMR in current devices?

A. The general size of CMB/PMR is vendor-specific but there is a size register field in both that is defined in the specification by the size register.

Q. Does having PMR guarantee that write requests to the PMR region have been committed to media, even though they have not been acknowledged before the power fail? Is there a max time limit in spec, within which NVMe drive should recover after power fail?

A. The implementation must ensure that the previous write has completed and that it is persistent. Time limit is vendor-specific.

Q. What is the average latency of an unladen swallow using NVMe-oF 1.1?

A. Average latency will depend on the media, the network and the way the devices are implemented. It also depends on whether or not the swallow is African or European (African swallows are non-migratory).

Q. Doesn't RDMA provide an 'implicit' queue on the controller side (negating the need for CMB for queues). Can the CMB also be used for data?

A. Yes, the CMB can be used to hold both commands and command data and the queues are managed by RDMA within host memory or within the adapter. By having the queue in the CMB you can gain performance advantages.

Q. What is a ballpark latency difference number between CMB and PMR access, can you provide a number based on assumption that both of these are accessed over RDMA fabric?

A. When using CMB, latency goes down but there are no specific latency numbers available as of this writing.

Q. What is the performance of NVMe/TCP in terms of IOPS as compared to NVMe/RDMA? (Good implementation assumed)

A. This is heavily implementation dependent as the network adapter may provide offloads for TCP. NVMe/RDMA generally will have lower latency.

Q. If there are several sequence-level errors, how can we correct the errors in an appropriate order?

Q. How could we control the right order for the error corrections in FC-NVMe-2?

  1. These two questions are related and the response below is applicable to both questions.

As mentioned in the presentation, Sequence-level error recovery provides the ability to detect and recover from lost commands, lost data, and lost status responses. For Fibre Channel, a Sequence consists of one or more frames: e.g., a Sequence containing a NVMe command, a Sequence containing data, or a Sequence containing a status response. 

The order for error correction is based on information returned from the target on the given state of the Exchange compared to the state of the Exchange at the initiator. To do this, and from a high-level overview, upon sending an Exchange containing an NVMe command, a timer is started at the initiator. The default value for this timer is 2 seconds, and if a response is not received for the Exchange before the timer expires, a message is sent from the initiator to the target to determine the status of the Exchange.

Also, a response from the target may be received before the information on the Exchange is obtained from the target. If this occurs the command just continues on as normal, the timer is restarted if the Exchange is still in progress, and all is good. Otherwise, if no response from the target has been received since sending the Exchange information message, then one of two actions usually take place:

a) If the information returned from the target indicates the Exchange is not known, then the Exchange resources are cleaned up and released, and the Exchange containing the NVMe command is re-transmitted; or

b) If the information returned from the target indicates the Exchange is known and the target is still working on the command, then no error recovery is needed; the timer is restarted, and the initiator continues to wait for a response from the target.

An example of this behavior is a format command, where it may take a while for the command to complete, and the status response to be sent.

For some other typical information returned from the target per the Exchange status query:

  1. If the information returned from the target indicates the Exchange is known, and a ready to receive data message was sent by the target (e.g., a write operation), then the initiator requests the target to re-transmit the ready-to-receive data message, and the write operation continues at the transport level;
  2. If the information returned from the target indicates the Exchange is known, and data was sent by the target (e.g., a read operation), then the initiator requests the target to re-transmit the data and the status response, and the read operation continues at the transport level; and
  3. If the information returned from the target indicates the Exchange is known, and the status response was sent by the target, then the initiator requests the target to re-transmit the status response and the command completes accordingly at the transport level.

For further information, detailed informative Sequence level error recovery diagrams are provided in Annex E of the FC-NVMe-2 standard available via INCITS. 

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Notable Questions on NVMe-oF 1.1

Tim Lustig

Jul 14, 2020

title of post
At our recent SNIA Networking Storage Forum (NSF) webcast, “Notable Updates in NVMe-oF™ 1.1” we explored the latest features of NVMe over Fabrics (NVMe-oF), discussing what’s new in the NVMe-oF 1.1 release, support for CMB and PMR, managing and provisioning NVMe-oF devices with SNIA Swordfish™, and FC-NVMe-2. If you missed the live event, you can watch it here. Our presenters received many interesting questions on NVMe-oF and here are answers to them all: Q. Is there an implementation of NVMe-oF with direct CMB access? A. The Controller Memory Buffer (CMB) was introduced in NVMe 1.2 and first supported in the NVMe-oF 1.0 specification. It’s supported if the storage vendor has implemented this within the hardware and the network supports it. We recommend that you ask your favorite vendor if they support the feature. Q. What is the different between PMR in an NVMe device and the persistent memory in general? A. The Persistent Memory Region (PMR) is a region within the SSD controller and it is reserved for system level persistent memory that is exposed to the host. Just like a Controller Memory Buffer (CMB), the PMR may be used to store command data, but because it’s persistent it allows the content to remain even after power cycles and resets. To go further into this answer would require a follow up webinar. Q. Are any special actions required on the host side over Controller Memory Buffers to maintain the data consistency? A. To prevent possible disruption and to maintain data consistency, first the control address range must be configured so that addresses will not overlap, as described in the latest specification. There is also a flush command so that persistent memory can be cleared, (also described in the specification). Q. Is there a field to know the size of CMB and PMR supported by controller? What is the general size of CMR in current devices? A. The general size of CMB/PMR is vendor-specific but there is a size register field in both that is defined in the specification by the size register. Q. Does having PMR guarantee that write requests to the PMR region have been committed to media, even though they have not been acknowledged before the power fail? Is there a max time limit in spec, within which NVMe drive should recover after power fail? A. The implementation must ensure that the previous write has completed and that it is persistent. Time limit is vendor-specific. Q. What is the average latency of an unladen swallow using NVMe-oF 1.1? A. Average latency will depend on the media, the network and the way the devices are implemented. It also depends on whether or not the swallow is African or European (African swallows are non-migratory). Q. Doesn’t RDMA provide an ‘implicit’ queue on the controller side (negating the need for CMB for queues). Can the CMB also be used for data? A. Yes, the CMB can be used to hold both commands and command data and the queues are managed by RDMA within host memory or within the adapter. By having the queue in the CMB you can gain performance advantages. Q. What is a ballpark latency difference number between CMB and PMR access, can you provide a number based on assumption that both of these are accessed over RDMA fabric? A. When using CMB, latency goes down but there are no specific latency numbers available as of this writing. Q. What is the performance of NVMe/TCP in terms of IOPS as compared to NVMe/RDMA? (Good implementation assumed) A. This is heavily implementation dependent as the network adapter may provide offloads for TCP. NVMe/RDMA generally will have lower latency. Q. If there are several sequence-level errors, how can we correct the errors in an appropriate order? Q. How could we control the right order for the error corrections in FC-NVMe-2?
  1. These two questions are related and the response below is applicable to both questions.
As mentioned in the presentation, Sequence-level error recovery provides the ability to detect and recover from lost commands, lost data, and lost status responses. For Fibre Channel, a Sequence consists of one or more frames: e.g., a Sequence containing a NVMe command, a Sequence containing data, or a Sequence containing a status response. The order for error correction is based on information returned from the target on the given state of the Exchange compared to the state of the Exchange at the initiator. To do this, and from a high-level overview, upon sending an Exchange containing an NVMe command, a timer is started at the initiator. The default value for this timer is 2 seconds, and if a response is not received for the Exchange before the timer expires, a message is sent from the initiator to the target to determine the status of the Exchange. Also, a response from the target may be received before the information on the Exchange is obtained from the target. If this occurs the command just continues on as normal, the timer is restarted if the Exchange is still in progress, and all is good. Otherwise, if no response from the target has been received since sending the Exchange information message, then one of two actions usually take place: a) If the information returned from the target indicates the Exchange is not known, then the Exchange resources are cleaned up and released, and the Exchange containing the NVMe command is re-transmitted; or b) If the information returned from the target indicates the Exchange is known and the target is still working on the command, then no error recovery is needed; the timer is restarted, and the initiator continues to wait for a response from the target. An example of this behavior is a format command, where it may take a while for the command to complete, and the status response to be sent. For some other typical information returned from the target per the Exchange status query:
  1. If the information returned from the target indicates the Exchange is known, and a ready to receive data message was sent by the target (e.g., a write operation), then the initiator requests the target to re-transmit the ready-to-receive data message, and the write operation continues at the transport level;
  2. If the information returned from the target indicates the Exchange is known, and data was sent by the target (e.g., a read operation), then the initiator requests the target to re-transmit the data and the status response, and the read operation continues at the transport level; and
  3. If the information returned from the target indicates the Exchange is known, and the status response was sent by the target, then the initiator requests the target to re­transmit the status response and the command completes accordingly at the transport level.
For further information, detailed informative Sequence level error recovery diagrams are provided in Annex E of the FC-NVMe-2 standard available via INCITS.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

J Metz

Jun 18, 2020

title of post

Key management focuses on protecting cryptographic keys from threats and ensuring keys are available when needed. And it’s no small task. That's why the SNIA Networking Storage Forum (NSF) invited key management and encryption expert, Judy Furlong, to present a “Key Management 101” session as part our Storage Networking Security Webcast Series. If you missed the live webcast, I encourage you to watch it on-demand as it was highly-rated by attendees. Judy answered many key management questions during the live event, here are answers to those, as well as the ones we did not have time to get to.

Q. How are the keys kept safe in local cache?

A. It depends on the implementation. 
Options include:  1. Only storing
wrapped keys (each key individually encrypted with another key) in the cache. 2.
Encrypting the entire cache content with a separate encryption. In either case,
one needs to properly protect/manage the wrapping (KEK) key or Cache master key.

Q. Rotate key question – Self-encrypting Drive (SED) requires
permanent encryption key. How is rotation is done?

A. It is the Authentication Encryption Key used to access
(and protect the Data (Media) Encryption Key) that can be rotated. If you
change/rotate the DEK you destroy the data on the disk.

Q. You may want to point out that many people use
“FIPS” for FIPS 140, which isn’t strictly correct, as there are
numerous FIPS standards.

A. Yes that is true that many people refer to FIPS 140 as just FIPS which as noted is incorrect.  There are many Federal Information Process
Standards (FIPS).  That is why when I
present/write something I am careful to always add the appropriate FIPS
reference number (e.g. FIPS 140, FIPS 186, FIPS 201 etc.).

Q. So is the math for M of N
key sharing the same as used for object store?

A. Essentially yes, it’s the same mathematical concepts that
are being used.  However, the object
store approach uses a combination of data splitting and key splitting to allow
encrypted data to be stored across a set of cloud providers.

Q. According to the size of the data, this should be the
key, so for 1 TB should a 1T key be used? (
Slide
12
)

A. No, encrypting 1TB of data doesn’t mean that the key has to be
that long. Most data encryption (at rest and in flight) use symmetric
encryption like AES which is a block cipher. In block ciphers the data that is
being encrypted is broken up into blocks of specific size in order to be
processed by that algorithm. For a good overview of block ciphers see the Encryption 101 webcast.

Q. What is the maximum
lifetime of a certificate?

A. Maximum certificate validity (e.g. certificate lifetime)
varies based on regulations/guidance, organizational policies, application or
purpose for which certificate is used, etc. Certificates issued to humans for
authentication or digital signature or to common applications like web
browsers, web services, S/MIME email client, etc. tend to have validities of 1-2
years. CA certificates have slightly longer validities in the 3-5-year
range. 

Q. In data center applications, why not just use AEK as
DEK for SED?

A. Assuming
that AEK is Authentication Encryption Key — A defense in-depth strategy is
taken in the design of SEDs where the DEK (or MEK) is a key that is generated
on the drive and never leaves the drive. The MEK is protected by an AEK. This
AEK is externalized from the drive and needs to be provided by the
application/product that is accessing the SED in order to unlock the SED and
take advantage of its capabilities. 

Using separate keys follows the principles of only using a key for one purpose
(e.g. encryption vs. authentication).  It
also reduces the attack surface for each key. If an attacker obtains an AEK
they also need to have access to the SED it belongs to as well as the
application used to access that SED.

Q. Does NIST require
“timeframe” to rotate key?

A.NIST recommendations for the cryptoperiod of keys used for a
range of purposes may be found in section 5.3.6 of NIST SP800-57 Part 1 R5.

Q. Does D@RE use symmetric or asymmetric
encryption?

A.There are many Data at Rest (D@RE) implementations, but the
majority of the D@RE implementations within the storage industry (e.g.
controller based, Self-Encrypting Drives (SEDs)) symmetric encryption is used.
For more information about D@RE implementations, check out the Storage
Security Series: Data-at-Rest webcast
.

Q. In the TLS example shown, where does the “key
management” take place?

There
are multiple places in the TLS handshake example where different key management
concepts discussed in the webinar are leveraged:

  • In steps 3 and 5 the client and server exchange their public key
    certificates (example of asymmetric cryptography/certificate management)
  • In steps 4 and 6 the client and server validate each other’s
    certificates (example of certificate path validation — part of key management)
  • In step 5 the client creates and sends pre-master secret (example
    of key agreement)
  • In step 7 the client and server use this pre-master secret and
    other information to calculate the same symmetric key that will be used to
    encrypt the communication channel (example of key derivation).

Remember
I said this was part of the Storage Networking Security Webcast Series? Check
out the other webcasts we’ve done to date as well as what’s coming up

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Alex McDonald

May 27, 2020

title of post

Ever wonder how encryption actually works? Experts, Ed Pullin and Judy Furlong, provided an encryption primer to hundreds of attendees at our SNIA NSF webcast Storage Networking Security: Encryption 101. If you missed it, It's now available on-demand. We promised during the live event to post answers to the questions we received. Here they are:

Q. When using asymmetric keys, how often do the keys need to be changed?

A. How often asymmetric (and symmetric) keys need to be changed is driven by the purpose the keys are used for, the security policies of the organization/environment in which they are used and the length of the key material. For example, the CA/Browser Forum has a policy that certificates used for TLS (secure communications) have a validity of no more than two years.

Q. In earlier slides there was a mention that information can only be decrypted via private key (not public key). So, was Bob's public key retrieved using the public key of signing authority?

A. In asymmetric cryptography the opposite key is needed to reverse the encryption process.  So, if you encrypt using Bob's private key (normally referred to a digital signature) then anyone can use his public key to decrypt.  If you use Bob's public key to encrypt, then his private key should be used to decrypt.  Bob's public key would be contained in the public key certificate that is digitally signed by the CA and can be extracted from the certificate to be used to verify Bob's signature.

Q. Do you see TCG Opal 2.0 or TCG for Enterprise as requirements for drive encryption? What about the FIPS 140-2 L2 with cryptography validated by 3rd party NIST? As NIST was the key player in selecting AES, their stamp of approval for a FIPS drive seems to be the best way to prove that the cryptographic methods of a specific drive are properly implemented.

A. Yes, the TCG Opal 2.0 and TCG for Enterprise standards are generally recognized in the industry for self-encrypting drives (SEDs)/drive level encryption. FIPS 140 cryptographic module validation is a requirement for sale into the U.S. Federal market and is also recognized in other verticals as well.     Validation of the algorithm implementation (e.g. AES) is part of the FIPS 140 (Cryptographic Module Validation Program (CMVP)) companion Cryptographic Algorithm Validation Program (CAVP).

Q. Can you explain Constructive Key Management (CKM) that allows different keys given to different parties in order to allow levels of credentialed access to components of a single encrypted object?

A. Based on the available descriptions of CKM, this approach is using a combination of key derivation and key splitting techniques. Both of these concepts will be covered in the upcoming Key Management 101 webinar. An overview of CKM can be found in  this Computer World article (box at the top right). 

Q. Could you comment on Zero Knowledge Proofs and Digital Verifiable Credentials based on Decentralized IDs (DIDs)?

A. A Zero Knowledge Proof is a cryptographic-based method for being able to prove you know something without revealing what it is. This is a field of cryptography that has emerged in the past few decades and has only more recently transitioned from a theoretical research to a practical implementation phase with crypto currencies/blockchain and multi-party computation (privacy preservation).

Decentralized IDs (DIDs) is an authentication approach which leverages blockchain/decentralized ledger technology. Blockchain/decentralized ledgers employ cryptographic techniques and is an example of applying cryptography and uses several of the underlying cryptographic algorithms described in this 101 webinar.

Q. Is Ed saying every block should be encrypted with a different key?

A. No. we believe the confusion was over the key transformation portion of Ed's diagram.  In the AES Algorithm a key transformation occurs that uses the initial key as input, and provides the AES rounds their own key.  This Key expansion is part of the AES Algorithm itself and is known as the Key Schedule.

Q. Where can I learn more about storage security?

A. Remember this Encryption 101 webcast was part of the SNIA Networking Storage Forum's Storage Networking Security Webcast Series. You can keep up with additional installments here and by following us on Twitter @SNIANSF.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

AlexMcDonald

May 27, 2020

title of post

Ever wonder how encryption actually works? Experts, Ed Pullin and Judy Furlong, provided an encryption primer to hundreds of attendees at our SNIA NSF webcast Storage Networking Security: Encryption 101. If you missed it, It’s now available on-demand. We promised during the live event to post answers to the questions we received. Here they are:

Q. When using asymmetric keys, how often do the keys need to be changed?

A. How often asymmetric (and symmetric) keys need to be changed is driven by the purpose the keys are used for, the security policies of the organization/environment in which they are used and the length of the key material. For example, the CA/Browser Forum has a policy that certificates used for TLS (secure communications) have a validity of no more than two years.

Q.
In earlier slides there was a mention that information can only be decrypted
via private key (not public key). So, was Bob’s public key retrieved using the
public key of signing authority?

A.
In asymmetric cryptography the opposite key is needed to reverse the encryption
process.  So, if you encrypt using Bob’s
private key (normally referred to a digital signature) then anyone can use his
public key to decrypt.  If you use Bob’s
public key to encrypt, then his private key should be used to decrypt.  Bob’s public key would be contained in the
public key certificate that is digitally signed by the CA and can be extracted
from the certificate to be used to verify Bob’s signature.

Q.
Do you see TCG Opal 2.0 or TCG for Enterprise as requirements for drive
encryption? What about the FIPS 140-2 L2 with cryptography validated by 3rd
party NIST? As NIST was the key player in selecting AES, their stamp of
approval for a FIPS drive seems to be the best way to prove that the
cryptographic methods of a specific drive are properly implemented.

A.
Yes, the TCG Opal 2.0 and TCG for Enterprise standards are generally recognized
in the industry for self-encrypting drives (SEDs)/drive level encryption. FIPS
140 cryptographic module validation is a requirement for sale into the U.S.
Federal market and is also recognized in other verticals as well.     Validation of the algorithm implementation
(e.g. AES) is part of the FIPS 140 (Cryptographic Module Validation Program
(CMVP)) companion Cryptographic Algorithm Validation Program (CAVP).

Q.
Can you explain Constructive Key Management (CKM) that allows different keys
given to different parties in order to allow levels of credentialed access to
components of a single encrypted object?

A.
Based on the available descriptions of CKM, this approach is using a
combination of key derivation and key splitting techniques. Both of these
concepts will be covered in the upcoming Key
Management 101 webinar
. An overview of CKM can be found in  this Computer
World article
(box at the top right). 

Q.
Could you comment on Zero Knowledge Proofs and Digital Verifiable Credentials
based on Decentralized IDs (DIDs)?

A.
A Zero Knowledge Proof is a cryptographic-based method for being able to prove
you know something without revealing what it is. This is a field of
cryptography that has emerged in the past few decades and has only more
recently transitioned from a theoretical research to a practical implementation
phase with crypto currencies/blockchain and multi-party computation (privacy
preservation).

Decentralized IDs (DIDs) is an authentication approach which leverages
blockchain/decentralized ledger technology. Blockchain/decentralized ledgers
employ cryptographic techniques and is an example of applying cryptography and
uses several of the underlying cryptographic algorithms described in this 101
webinar.

Q.
Is Ed saying every block should be encrypted with a different key?

A.
No. we believe the confusion was over the key transformation portion of Ed’s
diagram.  In the AES Algorithm a key
transformation occurs that uses the initial key as input, and provides the AES rounds
their own key.  This Key expansion is
part of the AES Algorithm itself and is known as the Key Schedule.

Q.
Where can I learn more about storage security?

A.
Remember this Encryption 101 webcast was part of the SNIA Networking Storage
Forum’s Storage
Networking Security Webcast Series
. You can keep up with additional installments here and by
following us on Twitter @SNIANSF.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

J Metz

May 12, 2020

title of post

There's a lot that goes into effective key management. In order to properly use cryptography to protect information, one has to ensure that the associated cryptographic keys themselves are also protected. Careful attention must be paid to how cryptographic keys are generated, distributed, used, stored, replaced and destroyed in order to ensure that the security of cryptographic implementations is not compromised.

It's the next topic the SNIA Networking Storage Forum is going to cover in our Storage Networking Security Webcast Series. Join us on June 10, 2020 for Key Management 101 where security expert and Dell Technologies distinguished engineer, Judith Furlong, will introduce the fundamentals of cryptographic key management.

Key (see what I did there?) topics will include:

  • Key lifecycles
  • Key generation
  • Key distribution
  • Symmetric vs. asymmetric key management, and
  • Integrated vs. centralized key management models

In addition, Judith will also dive into relevant standards, protocols and industry best practices. Register today to save your spot for June 10th we hope to see you there.  

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Subscribe to Networked Storage