Tom Friend

Oct 20, 2021

title of post
What types of storage are needed for different aspects of AI? That was one of the many topics covered in our SNIA Networking Storage Forum (NSF) webcast “Storage for AI Applications.” It was a fascinating discussion and I encourage you to check it out on-demand. Our panel of experts answered many questions during the live roundtable Q&A. Here are answers to those questions, as well as the ones we didn’t have time to address. Q. What are the different data set sizes and workloads in AI/ML in terms of data set size, sequential/ random, write/read mix? A. Data sets will vary incredibly from use case to use case. They may be GBs to possibly 100s of PB. In general, the workloads are very heavily reads maybe 95%+. While it would be better to have sequential reads, in general the patterns tend to be closer to random. In addition, different use cases will have very different data sizes. Some may be GBs large, while others may be <1 KB. The different sizes have a direct impact on performance in storage and may change how you decide to store the data. Q. More details on the risks associated with the use of online databases? A. The biggest risk with using an online DB is that you will be adding an additional workload to an important central system. In particular, you may find that the load is not as predictable as you think and it may impact the database performance of the transactional system. In some cases, this is not a problem, but when it is intended for actual transactions, you could be hurting your business. Q. What is the difference between a DPU and a RAID / storage controller? A. A Data Processing Unit or DPU is intended to process the actual data passing through it. A RAID/storage controller is only intended to handle functions such as data resiliency around the data, but not the data itself. A RAID controller might take a CSV file and break it down for storage in different drives. However, it does not actually analyze the data. A DPU might take that same CSV and look at the different rows and columns to analyze the data. While the distinction may seem small, there is a big difference in the software. A RAID controller does not need to know anything about the data, whereas a DPU must be programmed to deal with it. Another important aspect is whether or not the data will be encrypted. If the data will encrypted, a DPU will have to have additional security mechanisms to deal with decryption of the data. However, a RAID-based system will not be affected. Q. Is a CPU-bypass device the same as a SmartNIC? A. Not entirely. They are often discussed together, but a DPU is intended to process data, whereas a SmartNIC may only process how the data is handled (such as encryption, handle TCP/IP functions, etc.).  It is possible for a SmartNIC to also act as a DPU where the data itself is processed. There are new NVMe-oF™ technologies that are beginning to allow FPGA, TPD, DPU, GPU and other devices direct access to other servers’ storage directly over a high-speed local area network without having to access the CPU of that system. Q. What work is being done to accelerate S3 performance with regard to AI? A. A number of companies are working to accelerate the S3 protocol. Presto and a number of Big Data technologies use it natively. For AI workloads there are a number of caching technologies to handle the re-reads of training on a local system. Minimizing the performance penalty Q. From a storage perspective, how do I take different types of data from different storage systems to develop a model? A. Work with your project team to find the data you need and ensure it can be served to the ML/DL training (or inference) environment in a timely manner. You may need to copy (or clone) data on to a faster medium to achieve your goals. But look at the process as a whole. Do not underestimate the data cleansing/normalization steps in your storage analysis as it can prove to be a bottleneck. Q. Do I have to “normalize” that data to the same type, or can a model accommodate different data types? A. In general, yes. Models can be very sensitive. A model trained on one set of data with one set of normalizations may not be accurate if data that was taken from a different set with different normalizations is used for inference. This does depend on the model, but you should be aware not only of the model, but also the details of how the data was prepared prior to training. Q. If I have to change the data type, do I then need to store it separately? A. It depends on your data, “do other systems need it in the old format?” Q. Are storage solutions that are right for one form of AI also the best for others? A. No. While it may be possible to use a single solution for multiple AIs, in general there are differences in the data that can necessitate different storage. A relatively simple example is large data (MBs) vs. small data (~1KB). Data in that multiple MBs large example can be easily erasure coded and stored more cost effectively. However, for small data, Erasure Coding is not practical and you generally will have to go with replication. Q. How do features like CPU bypass impact performance of storage? A. CPU bypass is essential for those times when all you need to do is transfer data from one peripheral to another without processing. For example, if you are trying to take data from a NIC and transfer it to a GPU, but not process the data in any way, CPU bypass works very well. It prevents the CPU and system memory from becoming a bottleneck. Likewise, on a storage server, if you simply need to take data from an SSD and pass it to a NIC during a read, CPU bypass can really help boost system performance. One important note: if you are well under the limits of the CPU, the benefits of bypass are small. So, think carefully about your system design and whether or not the CPU is a bottleneck. In some cases, people will use system memory as a cache and in these cases, bypassing CPU isn’t possible. Q. How important is it to use All-Flash storage compared to HDD or hybrid? A. Of course, It depends on your workloads. For any single model, you may be able to make due with HDD. However, another consideration for many of the AI/ML systems is that their use can quite suddenly expand. Once there is some amount of success, you may find that more people will want access to the data and the system may experience more load. So beware of the success of these early projects as you may find your need for creation of multiple models from the same data could overload your system. Q. Will storage for AI/ML necessarily be different from standard enterprise storage today? A. Not necessarily. It may be possible for enterprise solutions today to meet your requirements. However, a key consideration is that if your current solution is barely able to handle its current requirements, then adding an AI/ML training workload may push it over the edge. In addition, even if your current solution is adequate, the size of many ML/DL models are growing exponentially every year.  So, what you provision today may not be adequate in a year or even several months.  Understanding the direction of the work your data scientists are pursuing is important for capacity and performance planning.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Automating Discovery for NVMe IP-based SANs

Erik Smith

Oct 6, 2021

title of post

NVMe® IP-based SANs (including transports such as TCP, RoCE, and iWARP) have the potential to provide significant benefits in application environments ranging from the Edge to the Data Center. However, before we can fully unlock the potential of the NVMe IP-based SAN, we first need to address the manual and error prone process that is currently used to establish connectivity between NVMe Hosts and NVM subsystems.  This process includes administrators explicitly configuring each Host to access the appropriate NVM subsystems in their environment. In addition, any time an NVM Subsystem interface is added or removed, a Host administrator may need to explicitly update the configuration of impacted hosts to reflect this change. 

Due to the decentralized nature of this configuration process, using it to manage connectivity for more than a few Host and NVM subsystem interfaces is impractical and adds complexity when deploying an NVMe IP-based SAN in environments that require a high-degrees of automation.

For these and other reasons, several companies have been collaborating on innovations that simplify and automate the discovery process used with NVMe IP-based SANs. This will be the topic of our live webcast on November 4, 2021 “NVMe-oF: Discovery Automation for IP-based SANs.”

During this session we will explain:

  • The NVMe IP-based SAN discovery problem
  • The types of network topologies that can support the automated discovery of NVMe-oF Discovery controllers
  • Direct Discovery versus Centralized Discovery
  • An overview of the discovery protocol

We hope you will join us. The experts working to address this limitation with NVME IP-based SANs will be on-hand to directly answer your questions on November 4th. Register today.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Automating Discovery for NVMe IP-based SANs

Erik Smith

Oct 6, 2021

title of post
NVMe® IP-based SANs (including transports such as TCP, RoCE, and iWARP) have the potential to provide significant benefits in application environments ranging from the Edge to the Data Center. However, before we can fully unlock the potential of the NVMe IP-based SAN, we first need to address the manual and error prone process that is currently used to establish connectivity between NVMe Hosts and NVM subsystems.  This process includes administrators explicitly configuring each Host to access the appropriate NVM subsystems in their environment. In addition, any time an NVM Subsystem interface is added or removed, a Host administrator may need to explicitly update the configuration of impacted hosts to reflect this change. Due to the decentralized nature of this configuration process, using it to manage connectivity for more than a few Host and NVM subsystem interfaces is impractical and adds complexity when deploying an NVMe IP-based SAN in environments that require a high-degrees of automation. For these and other reasons, several companies have been collaborating on innovations that simplify and automate the discovery process used with NVMe IP-based SANs. This will be the topic of our live webcast on November 4, 2021 “NVMe-oF: Discovery Automation for IP-based SANs.” During this session we will explain:
  • The NVMe IP-based SAN discovery problem
  • The types of network topologies that can support the automated discovery of NVMe-oF Discovery controllers
  • Direct Discovery versus Centralized Discovery
  • An overview of the discovery protocol
We hope you will join us. The experts working to address this limitation with NVME IP-based SANs will be on-hand to directly answer your questions on November 4th. Register today.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Demystifying the Fibre Channel SAN Protocol

John Kim

Sep 10, 2021

title of post

Every wonder how Fibre Channel (FC) hosts and targets really communicate? Join the SNIA Networking Storage Forum (NSF) on September 23, 2021 for a live webcast, “How Fibre Channel Hosts and Targets Really Communicate.” This SAN overview will dive into details on how initiators (hosts) and targets (storage arrays) communicate and will address key questions, like:

  • How do FC links activate?
  • Is FC routable?
  • What kind of flow control is present in FC?
  • How do initiators find targets and set up their communication?
  • Finally, how does actual data get transferred between initiators and hosts, since that is the ultimate goal?

Each SAN transport has its own way to initialize and transfer data. This is an opportunity to learn how it works in the Fibre Channel world. Storage experts will introduce the concepts and demystify the inner workings of FC SAN. Register today.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Demystifying the Fibre Channel SAN Protocol

John Kim

Sep 10, 2021

title of post
Every wonder how Fibre Channel (FC) hosts and targets really communicate? Join the SNIA Networking Storage Forum (NSF) on September 23, 2021 for a live webcast, “How Fibre Channel Hosts and Targets Really Communicate.” This SAN overview will dive into details on how initiators (hosts) and targets (storage arrays) communicate and will address key questions, like:
  • How do FC links activate?
  • Is FC routable?
  • What kind of flow control is present in FC?
  • How do initiators find targets and set up their communication?
  • Finally, how does actual data get transferred between initiators and hosts, since that is the ultimate goal?
Each SAN transport has its own way to initialize and transfer data. This is an opportunity to learn how it works in the Fibre Channel world. Storage experts will introduce the concepts and demystify the inner workings of FC SAN. Register today.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Storage for Applications Webcast Series

John Kim

Sep 8, 2021

title of post

Everyone enjoys having storage that is fast, reliable, scalable, and affordable. But it turns out different applications have different storage needs in terms of I/O requirements, capacity, data sharing, and security.  Some need local storage, some need a centralized storage array, and others need distributed storage—which itself could be local or networked. One application might excel with block storage while another with file or object storage. For example, an OLTP database might require small amounts of very fast flash storage; a media or streaming application might need vast quantities of inexpensive disk storage with extra security safeguards; while a third application might require a mix of different storage tiers with multiple servers sharing the same data. This SNIA Networking Storage Forum "Storage for Applications" webcast series will cover the storage requirements for specific uses such as artificial intelligence (AI), database, cloud, media & entertainment, automotive, edge, and more. With limited resources, it’s important to understand the storage intent of the applications in order to choose the right storage and storage networking strategy, rather than discovering the hard way that you’ve chosen the wrong solution for your application.

We kick off this series on October 5, 2020 with “Storage for AI Applications.” AI is a technology which itself encompasses a broad range of use cases, largely divided into training and inference. In this webcast, we’ll look at what types of storage are typically needed for different aspects of AI, including different types of access (local vs. networked, block vs. file vs. object) and different performance requirements. And we will discuss how different AI implementations balance the use of on-premises vs. cloud storage. Tune in to this SNIA Networking Storage Forum (NSF) webcast to boost your natural (not artificial) intelligence about application-specific storage. Register today. Our AI experts will be waiting to answer your questions.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Storage for Applications Webcast Series

John Kim

Sep 8, 2021

title of post
Everyone enjoys having storage that is fast, reliable, scalable, and affordable. But it turns out different applications have different storage needs in terms of I/O requirements, capacity, data sharing, and security.  Some need local storage, some need a centralized storage array, and others need distributed storage—which itself could be local or networked. One application might excel with block storage while another with file or object storage. For example, an OLTP database might require small amounts of very fast flash storage; a media or streaming application might need vast quantities of inexpensive disk storage with extra security safeguards; while a third application might require a mix of different storage tiers with multiple servers sharing the same data. This SNIA Networking Storage Forum “Storage for Applications” webcast series will cover the storage requirements for specific uses such as artificial intelligence (AI), database, cloud, media & entertainment, automotive, edge, and more. With limited resources, it’s important to understand the storage intent of the applications in order to choose the right storage and storage networking strategy, rather than discovering the hard way that you’ve chosen the wrong solution for your application. We kick off this series on October 5, 2020 with “Storage for AI Applications.” AI is a technology which itself encompasses a broad range of use cases, largely divided into training and inference. In this webcast, we’ll look at what types of storage are typically needed for different aspects of AI, including different types of access (local vs. networked, block vs. file vs. object) and different performance requirements. And we will discuss how different AI implementations balance the use of on-premises vs. cloud storage. Tune in to this SNIA Networking Storage Forum (NSF) webcast to boost your natural (not artificial) intelligence about application-specific storage. Register today. Our AI experts will be waiting to answer your questions.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Q&A: Security of Data on NVMe-oF

John Kim

Jul 28, 2021

title of post

Ensuring the security of data on NVMe® over Fabrics was the topic of our SNIA Networking Storage Forum (NSF) webcast “Security of Data on NVMe over Fabrics, the Armored Truck Way.” During the webcast our experts outlined industry trends, potential threats, security best practices and much more. The live audience asked several interesting questions and here are answers to them.

Q. Does use of strong authentication and network encryption ensure I will be compliant with regulations such as HIPAA, GDPR, PCI, CCPA, etc.?

A. Not by themselves. Proper use of strong authentication and network encryption will reduce the risk of data theft or improper data access, which can help achieve compliance with data privacy regulations. But full compliance also requires establishment of proper processes, employee training, system testing and monitoring. Compliance may also require regular reviews and audits of systems and processes plus the involvement of lawyers and compliance consultants.

Q. Does using encryption on the wire such as IPsec, FC_ESP, or TLS protect against ransomware, man-in-the middle attacks, or physical theft of the storage system?

A. Proper use of data encryption on the storage network can protect against man-in-the middle snooping attacks because any data intercepted would be encrypted and very difficult to decrypt.  Use of strong authentication such DH-HMAC-CHAP can reduce the risk of a man-in-the-middle attack succeeding in the first place. However, encrypting data on the wire does not by itself protect against ransomware nor against physical theft of the storage systems because the data is decrypted once it arrives on the storage system or on the accessing server.

Q. Does "zero trust" mean I cannot trust anybody else on my IT team or trust my family members?

A. Zero Trust does not mean your coworker, mother or cousin is a hacker.  But it does require assuming that any server, user (even your coworker or mother), or application could be compromised and that malware or hackers might already be inside the network, as opposed to assuming all threats are being kept outside the network by perimeter firewalls. As a result, Zero Trust means regular use of security technologies--including firewalls, encryption, IDS/IPS, anti-virus software, monitoring, audits, penetration testing, etc.--on all parts of the data center to detect and prevent attacks in case one of the applications, machines or users has been compromised.

Q. Great information! Is there any reference security practice for eBOF and NVMe-oF™ that you recommend?

A. Generally security practices with an eBOF using NVMe-oF would be similar to with traditional storage arrays (whether they use NVMe-oF, iSCSI, FCP, or a NAS protocol). You should authenticate users, emplace fine-grained access controls, encrypt data, and backup your data regularly. You might also want to physically or logically separate your storage network from the compute traffic or user access networks. Some differences may arise from the fact that with an eBOF, it's likely that multiple servers will access multiple eBOFs directly, instead of each server going to a central storage controller that in turn accesses the storage shelves or JBOFs.

Q. Are there concerns around FC-NVMe security when it comes to Fibre Channel Fabric services? Can a rogue NVMe initiator discover the subsystem controllers during the discovery phase and cause a denial-of-service kind of attack? Under such circumstances can DH-CHAP authentication help?

A. A rogue initiator might be able to discover storage arrays using the FC-NVMe protocol but this may be blocked by proper use of Fibre Channel zoning and LUN masking. If a rogue initiator is able to discover a storage array, proper use of DH-CHAP should prevent it from connecting and accessing data, unless the rogue initiator is able to successfully impersonate a legitimate server. If the rogue server is able to discover an array using FC-NVMe, but cannot connect due to being blocked by strong authentication, it could initiate a denial-of-service attack and DH-CHAP by itself would not block or prevent a denial-of-service attack.

Q. With the recent example of Colonial Pipeline cyber-attack, can you please comment on what are best practice security recommendations for storage with regards to separation of networks for data protection and security?

A. It's a best practice to separate storage networks from the application and/or user networks. This separation can be physical or logical and could include access controls and authentication within each physical or logical network. A separate physical network is often used for management and monitoring. In addition, to protect against ransomware, storage systems should be backed up regularly with some backups kept physically offline, and the storage team should practice restoring data from backups on a regular basis to verify the integrity of the backups and the restoration process.

For those of you who follow the many educational webcasts that the NSF hosts, you may have noticed that we are discussing the important topic of data security a lot. In fact, there is an entire Storage Networking Security Webcast Series that dives into protecting data at rest, protecting data in flight, encryption, key management, and more.

We’ve also been talking about NVMe-oF a lot. I encourage you to watch “NVMe-oF: Looking Beyond Performance Hero Numbers” where our SNIA experts explain why it is important to look beyond test results that demonstrate NVMe-oF’s dramatic reduction in latency. And if you’re ready for more, you can “Geek Out” on NVMe-oF here, where we’ve curated several great basic and advanced educational assets on NVMe-oF.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Q&A: Security of Data on NVMe-oF

John Kim

Jul 28, 2021

title of post
Ensuring the security of data on NVMe over Fabrics was the topic of our SNIA Networking Storage Forum (NSF) webcast “Security of Data on NVMe over Fabrics, the Armored Truck Way.” During the webcast our experts outlined industry trends, potential threats, security best practices and much more. The live audience asked several interesting questions and here are answers to them. Q. Does use of strong authentication and network encryption ensure I will be compliant with regulations such as HIPAA, GDPR, PCI, CCPA, etc.? A. Not by themselves. Proper use of strong authentication and network encryption will reduce the risk of data theft or improper data access, which can help achieve compliance with data privacy regulations. But full compliance also requires establishment of proper processes, employee training, system testing and monitoring. Compliance may also require regular reviews and audits of systems and processes plus the involvement of lawyers and compliance consultants. Q. Does using encryption on the wire such as IPsec, FC_ESP, or TLS protect against ransomware, man-in-the middle attacks, or physical theft of the storage system? A. Proper use of data encryption on the storage network can protect against man-in-the middle snooping attacks because any data intercepted would be encrypted and very difficult to decrypt. Use of strong authentication such DH-HMAC-CHAP can reduce the risk of a man-in-the-middle attack succeeding in the first place. However, encrypting data on the wire does not by itself protect against ransomware nor against physical theft of the storage systems because the data is decrypted once it arrives on the storage system or on the accessing server. Q. Does “zero trust” mean I cannot trust anybody else on my IT team or trust my family members? A. Zero Trust does not mean your coworker, mother or cousin is a hacker.  But it does require assuming that any server, user (even your coworker or mother), or application could be compromised and that malware or hackers might already be inside the network, as opposed to assuming all threats are being kept outside the network by perimeter firewalls. As a result, Zero Trust means regular use of security technologies–including firewalls, encryption, IDS/IPS, anti-virus software, monitoring, audits, penetration testing, etc.–on all parts of the data center to detect and prevent attacks in case one of the applications, machines or users has been compromised. Q. Great information! Is there any reference security practice for eBOF and NVMe-oF that you recommend? A. Generally security practices with an eBOF using NVMe-oF would be similar to with traditional storage arrays (whether they use NVMe-oF, iSCSI, FCP, or a NAS protocol). You should authenticate users, emplace fine-grained access controls, encrypt data, and backup your data regularly. You might also want to physically or logically separate your storage network from the compute traffic or user access networks. Some differences may arise from the fact that with an eBOF, it’s likely that multiple servers will access multiple eBOFs directly, instead of each server going to a central storage controller that in turn accesses the storage shelves or JBOFs. Q. Are there concerns around FC-NVMe security when it comes to Fibre Channel Fabric services? Can a rogue NVMe initiator discover the subsystem controllers during the discovery phase and cause a denial-of-service kind of attack? Under such circumstances can DH-CHAP authentication help? A. A rogue initiator might be able to discover storage arrays using the FC-NVMe protocol but this may be blocked by proper use of Fibre Channel zoning and LUN masking. If a rogue initiator is able to discover a storage array, proper use of DH-CHAP should prevent it from connecting and accessing data, unless the rogue initiator is able to successfully impersonate a legitimate server. If the rogue server is able to discover an array using FC-NVMe, but cannot connect due to being blocked by strong authentication, it could initiate a denial-of-service attack and DH-CHAP by itself would not block or prevent a denial-of-service attack. Q. With the recent example of Colonial Pipeline cyber-attack, can you please comment on what are best practice security recommendations for storage with regards to separation of networks for data protection and security? A. It’s a best practice to separate storage networks from the application and/or user networks. This separation can be physical or logical and could include access controls and authentication within each physical or logical network. A separate physical network is often used for management and monitoring. In addition, to protect against ransomware, storage systems should be backed up regularly with some backups kept physically offline, and the storage team should practice restoring data from backups on a regular basis to verify the integrity of the backups and the restoration process. For those of you who follow the many educational webcasts that the NSF hosts, you may have noticed that we are discussing the important topic of data security a lot. In fact, there is an entire Storage Networking Security Webcast Series that dives into protecting data at rest, protecting data in flight, encryption, key management, and more. We’ve also been talking about NVMe-oF a lot. I encourage you to watch “NVMe-oF: Looking Beyond Performance Hero Numbers” where our SNIA experts explain why it is important to look beyond test results that demonstrate NVMe-oF’s dramatic reduction in latency. And if you’re ready for more, you can “Geek Out” on NVMe-oF here, where we’ve curated several great basic and advanced educational assets on NVMe-oF.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

A Storage Debate Q&A: Hyperconverged vs. Disaggregated vs. Centralized

John Kim

Jul 12, 2021

title of post
The SNIA Networking Storage Forum recently hosted another webcast in our Great Storage Debate webcast series. This time, our SNIA experts debated three competing visions about how storage should be done: Hyperconverged Infrastructure (HCI), Disaggregated Storage, and Centralized Storage. If you missed the live event, it’s available on-demand. Questions from the webcast attendees made the panel debate quite lively. As promised, here are answers to those questions. Q. Can you imagine a realistic scenario where the different storage types are used as storage tiers? How much are they interoperable? A. Most HCI solutions already have a tiering/caching structure built-in.  However, a user could use HCI for hot to warm data, and also tier less frequently accessed data out to a separate backup/archive.  Some of the HCI solutions have close partnerships with backup/archive vendor solutions just for this purpose. Q. Does Hyperconverged (HCI) use primarily object storage with erasure coding for managing the distributed storage, such as vSAN for VxRail (from Dell)? A. That is accurate for vSAN, but other HCI solutions are not necessarily object based. Even if object-based, the object interface is rarely exposed. Erasure coding is a common method of distributing the data across the cluster for increased durability with efficient space sharing. Q. Is there a possibility if two or more classification of storage can co-exist or deployed? Examples please? A. Often IT organizations have multiple types of storage deployed in their data centers, particularly over time with various types of legacy systems. Also, HCI solutions that support iSCSI can interface with these legacy systems to enable better sharing of data to avoid silos. Q. How would you classify HPC deployment given it is more distributed file systems and converged storage? does it need a new classification? A. Often HPC storage is deployed on large, distributed file systems (e.g. Lustre), which I would classify as distributed, scale-out storage, but not hyperconverged, as the compute is still on separate servers. Q. A lot of HCI solutions are already allowing heterogeneous nodes within a cluster. What about these “new” Disaggregated HCI solutions that uses “traditional” storage arrays in the solution (thus not using a Software Defined Storage solution? Doesn’t it sound a step back? It seems most of the innovation comes on the software. A. The solutions marketed as disaggregated HCI are not HCI. They are traditional servers and storage combined in a chassis.  This would meet the definition of converged, but not hyperconverged. Q. Why is HCI growing so quickly and seems so popular of late?  It seems to be one of the fastest growing “data storage” use cases. A. HCI has many advantages, as I shared in the slides up front. The #1 reason for the growth and popularity is the ease of deployment and management. Any IT person who is familiar with deploying and managing a VM can now easily deploy and manage the storage with the VM.  No specialized storage system skillsets required, which makes better use of limited IT people resources, and reduces OpEx. Q. Where do you categorize newer deployments like Vast Data? Is that considered NAS since it presents as NFS and CIFS? A. I would categorize Vast Data as scale-out, software-defined storage.  HCI is also a type of scale-out, software-defined storage, but with compute as well, so that is the key difference. Q. So what happens when HCI works with ANY storage including centralized solutions. What is HCI then? A. I believe this question is referencing the SCSi interface support. HCI solutions that support iSCSI can interface with other types of storage systems to enable better sharing of data to avoid silos. Q. With NVMe/RoCE becoming more available, DAS-like performance while have reducing CPU usage on the hosts massively, saving license costs (potentially, we are only in pilot phase) does the ball swing back towards disaggregated? A. I’m not sure I fully understand the question, but RDMA can be used to streamline the inter-node traffic across the HCI cluster.  Network performance becomes more critical as the size of the cluster, and therefore the traffic between nodes increases, and RDMA can reduce any network bottlenecks. RoCEv2 is popular, and some HCI solutions also support iWARP.  Therefore, as HCI solutions adopt RDMA, this is not a driver to disaggregated. Q. HCI was initially targeted at SMB and had difficulty scaling beyond 16 nodes. Why would HCI be the choice for large scale enterprise implementations? A. HCI has proven itself as capable of running a broad range of workloads in small to large data center environments at this point. Each HCI solution can scale to different numbers of nodes, but usage data shows that single clusters rarely exceed about 12 nodes, and then users start a new cluster. There are a mix of reasons for this: concerns about the size of failure domains, departmental or remote site deployment size requirements, but often it’s the software license fees for the applications running on the HCI infrastructure that limits the typical clusters sizes in practice. Q. SPC (Storage Performance Council) benchmarks are still the gold standard (maybe?) and my understanding is they typically use an FC SAN. Is that changing? I understand that the underlying hardware is what determines performance but I’m not aware of SPC benchmarks using anything other than SAN. A. Myriad benchmarks are used to measure HCI performance across a cluster. I/O benchmarks that are variants on FIO are common to measure the storage performance, and then the compute performance is often measured using other benchmarks, such as TPC benchmarks for database performance, LoginVSI for VDI performance, etc. Q. What is the current implementation mix ratio in the industry? What is the long-term projected mix ratio? A. Today the enterprise is dominated by centralized storage with HCI in second place and growing more rapidly. Large cloud service providers and hyperscalers are dominated by disaggregated storage, but also use some centralized storage and some have their own customized HCI implementations for specific workloads. HPC and AI customers use a mix of disaggregated and centralized storage. In the long-term, it’s possible that disaggregated will have the largest overall share since cloud storage is growing the most, with centralized storage and HCI splitting the rest. Q. Is the latency high for HCI vs. disaggregated vs. centralized? A. It depends on the implementation. HCI and disaggregated might have slightly higher latency than centralized storage if they distribute writes across nodes before acknowledging them or if they must retrieve reads from multiple nodes. But HCI and disaggregated storage can also be implemented in a way that offers the same latency as centralized. Q. What about GPUDirect? A. GPUDirect Storage allows GPUs to access storage more directly to reduce latency. Currently it is supported by some types of centralized and disaggregated storage. In the future, it might be supported with HCI as well. Q. Splitting so many hairs here. Each of the three storage types are more about HOW the storage is consumed by the user/application versus the actual architecture. A. Yes, that is largely correct, but the storage architecture can also affect how it’s consumed. Q. Besides technical qualities, is there a financial differentiator between solutions? For example, OpEx and CapEx, ROI? A. For very large-scale storage implementations, disaggregated generally has the lowest CapEx and OpEx because the higher initial cost of managing distributed storage software is amortized across many nodes and many terabytes. For medium to large implementations, centralized storage usually has the best CapEx and OpEx. For small to medium implementations, HCI usually has the lowest CapEx and OpEx because it’s easy and fast to acquire and deploy. However, it always depends on the specific type of storage and the skill set or expertise of the team managing the storage. Q. Why wouldn’t disaggregating storage compute and memory be the next trend? The Hyperscalers have already done it. What are we waiting for? A. Disaggregating compute is indeed happening, supported by VMs, containers, and faster network links. However, disaggregating memory across different physical machines is more challenging because even today’s very fast network links have much higher latency than memory. For now, memory disaggregation is largely limited to being done “inside the box” or within one rack with links like PCIe, or to cases where the compute and memory stick together and are disaggregated as a unit. Q. Storage lends itself as first choice for disaggregation as mentioned before. What about disaggregation of other resources (such as networking, GPU, memory) in the future and how do you believe will it impact the selection of centralized vs disaggregated storage? Will Ethernet stay 1st choice for the fabric for disaggregation? A. See the above answer about disaggregating memory. Networking can be disaggregated within a rack by using a very low-latency fabric, for example PCIe, but usually networking is used to support disaggregation of other resources. GPUs can be disaggregated but normally still travel with some CPU and memory in the same box, though this could change in the near future. Ethernet will indeed remain the 1st networking choice for disaggregation, but other network types will also be used (InfiniBand, Fibre Channel, Ethernet with RDMA, etc.) Don’t forget to check out our other great storage debates, including: File vs. Block vs. Object Storage, Fibre Channel vs. iSCSI, FCoE vs. iSCSI vs. iSER, RoCE vs. iWARP, and Centralized vs. Distributed. You can view them all on our SNIAVideo YouTube Channel.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Subscribe to Networked Storage