Blog

Ceph Storage for AI/ML Q&A

Ceph Storage for AI/ML Q&A

Apr 1, 2025

At our SNIA Cloud Storage Technologies webinar, “Ceph Storage in a World of AI/ML Workloads,” our experts, Kyle Bader from IBM and Philip Williams from Canonical, explained how open source solutions like Ceph can provide almost limitless scaling capabilities, both for performance and capacity for AI/ML workloads. If you missed the presentation, it’s available at the SNIA Educational Library along with the slides.

The live webinar audience was highly engaged with this timely topic, and they asked several questions. Our presenters have generously taken the time to answer them here.

Q: What does checkpoint mean?

A: Checkpointing is storing the state of the model (weights and optimizer states) to storage so that if there is an issue with the training cluster, training can resume from the last checkpoint instead of starting from scratch.

Q: Is CEPH a containerized solution?

A: One of the ways to deploy Ceph is as a set of containers. These can be hosted directly on conventional Linux hosts and coordinated with cephadm / systemd / podman, or in a Kubernetes cluster using an operator like Rook.

Q: Any advantages or disadvantages of using Hardware RAID to help with data protection and redundancy along with Ceph?

A: The main downside to using RAID under Ceph is the additional overhead. You can adjust the way Ceph does data protection to compensate, but generally that changes the availability of the pools in a negative way. That said, you could, in theory, construct larger storage systems, in terms of total PBs if you were to have an OSD per aggregate instead of per physical disk.

Q: What are the changes we are seeing to upgrade the storage hardware with AI? Only GPUs or is there other specific hardware to be upgraded?

A: There are no GPU-upgrades required for the storage hardware. If you’re running training or inference co-resident on the storage hosts then you could include GPUs, but for standalone storage serving AI workloads, there is no need for GPUs in the storage system themselves. Ceph off the shelf servers configured for high throughput is all that’s needed.

Q: Does Ceph provide AI features?

A: In the context of AI, the biggest thing that is needed, and that is provided by Ceph, is scalability in multiple dimensions (capacities, bandwidth, IOPs, etc). We also have some capabilities to do device failure predictions using a model.

Q: How do you see a storage admin career in AI Industry and what are the key learnings needed?

A: Understanding how to use scale-out storage technologies like Ceph. Understanding the hardware, differences in types of SSDs, networking - basically, feeds and speeds type stuff. It's also essential to learn as much as possible about what AI practitioners are doing, so that you can "meet them in their world" and have constructive conversations.

Q: Any efforts to move the processing into the Ceph infrastructure so the data doesn't have to move?

A: Yes! At the low level, RADOS (Reliable Autonomic Distributed Object Store) has always had classes that can be executed on objects, they tend to be used to provide the semantics needed for different protocols. So, at the core, Ceph has always been a computational storage technology. More recently as an example, we’ve seen S3 Select added to the object protocol, which allows pushdown of filtering and aggregation - think pNFS but for tabular data, with storage side filtering and aggregation.

Q: What is the realistic checkpoint frequency?

A: The best thing to do is to checkpoint every round, but that might not be viable depending on the bandwidth of the storage system, the size of the checkpoint, the amount of data parallelization in the training pipeline, and whether or not they are leveraging asynchronous checkpointing. ~~The more frequently the better.~~ As the GPU cluster gets bigger, the need to checkpoint more frequently goes up, because they need to protect against failures in the training environment.

Q: Why train with Ceph storage instead of direct-attached NVMe storage? That would speed up the training by orders of magnitude.

A: When you’re looking at modest data set sizes and looking for ways to do significant levels of data parallelization, yes, you could copy these data sets onto local attach NVMe storage. In this case, yes, you would accomplish faster results, just because that's how physics works.

However, for larger recommendation systems, you may be dealing with much larger training data set sizes, and you might not be able to fit all of the necessary data onto local NVMe storage of the system. In this case, there are a number of trade-offs people make that favor the use of external Ceph storage, including the size of your GPU system, the need for more flexibility, as well as the need for experimentation to test various ways to accomplish data level parallelism and data pipelining. All of this is intended to maximize the use of the GPUs, and causes readjustment on how you partition, pre-load and use data on local NVMe versus external Ceph storage. Flexibility and experimentation are important, and there are always trade-offs.

Q: How can we integrate Ceph to an HCI environment using VMware?

A: It’s possible to use NVMe-oF for that.

Q: Is there a QAT analogue (compression) on EPYCs?

A: Not today, you could use one of the legacy PCIe add in cards though.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Ceph Cloud Storage

Blog

Solving Cloud Object Storage Incompatibilities in a Multi-Vendor Community

Solving Cloud Object Storage Incompatibilities in a Multi-Vendor Community

Michael Hoard

Oct 25, 2024

The SNIA Cloud Storage Technologies Initiative (CSTI) conducted a poll early in 2024 during a live webinar “Navigating Complexities of Object Storage Compatibility,” citing 72% of organizations have encountered incompatibility issues between various object storage implementations. These results resulted in a call to action for SNIA to create an open expert community dedicated to resolving these issues and building best practices for the industry. Since then, SNIA CSTI has partnered with the SNIA Cloud Storage Technical Work Group (TWG) and successfully organized, hosted, and completed the first SNIA Cloud Object Storage Plugfest (multi-vendor interoperability testing), co-located at SNIA Developer Conference (SDC), September 2024, in Santa Clara, CA. Participating Plugfest companies included engineers from Dell, Google, Hammerspace, IBM, Microsoft, NetApp, VAST Data, and Versity Software. Three days of Plugfest testing discovered and resolved issues, and included a Birds of a Feather (BoF) session to gain consensus on next steps for the industry. Plugfest contributors are now planning two 2025 Plugfest events in Denver in April and Santa Clara in September. It's a collaborative effort that we’ll discuss in detail on November 21, 2024 at our next live SNIA CSTI webinar, “Building a Community to Tackle Cloud Object Storage Incompatibilities.” At this webinar, we will share insights into industry best practices, explain the benefits your implementation may gain with improved compatibility, and provide an overview of how a wide range of vendors is uniting to address real customer issues, discussing:

Implications on client applications
Complexity and variety of APIs
Access control mechanisms
Performance and scalability requirements
Real-world incompatibilities found in various object storage implementations
Missing or incorrect response headers
Unsupported API calls and unexpected behavior

We also plan to have some of the Plugfest engineers on tap to share their Plugfest experience, answer questions, and welcome client and server cloud object storage engineers to join us in building momentum at our upcoming 2025 Plugfests. Register today to join us on November 21^st. We look forward to seeing you. Michael Hoard, SNIA CSTI Chair

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Cloud Storage Object Storage

Blog

Complexities of Object Storage Compatibility Q&A

Complexities of Object Storage Compatibility Q&A

Gregory Touretsky

Jan 26, 2024

72% of organizations have encountered incompatibility issues between various object storage implementations according to a poll at our recent SNIA Cloud Storage Technologies Initiative webinar, “Navigating the Complexities of Object Storage Compatibility.” If you missed the live presentation or you would like to see the answers to the other poll questions we asked the audience, you can view it on-demand at the SNIA Educational Library. The audience was highly-engaged during the live event and asked several great questions. Here are answers to them all. Q. Do you see the need for fast object storage for AI kind of workloads? A. Yes, the demand for fast object storage in AI workloads is growing. Initially, object storage was mainly used for backup or archival purposes. However, its evolution into Data Lakes and the introduction of features like the S3 SELECT API have made it more suitable for data analytics. The launch of Amazon's S3 Express, a faster yet more expensive tier, is a clear indication of this trend. Other vendors are following suit, suggesting a shift towards object storage as a primary data storage platform for specific workloads. Q. As Object Storage becomes more prevalent in the primary storage space, could you talk about data protection, especially functionalities like synchronous replication and multi-site deployments - or is your view that this is not needed for object storage deployments? A. Data protection, including functionalities like synchronous replication and multi-site deployments, is essential for object storage, especially as it becomes more prevalent in primary storage. Various object storage implementations address this differently. For instance, Amazon S3 supports asynchronous replication. Azure ZRS (Zone-redundant storage) offers something akin to synchronous replication within a specific geographical area. Many on-premises solutions provide multi-site deployment and replication capabilities. It's crucial for vendors to offer distinct features and value additions, giving customers a range of choices to best meet their specific requirements. Ultimately, customers must define their data availability and durability needs and select the solution that aligns with their use case. Q. Regarding polling question #3 during the webinar, why did the question only ask “above 10PB?” We look for multi-PB like 100PB ... does this mean Object Storage is not suitable for multi PB? A. Object storage is inherently scalable and can support deployments ranging from petabyte to exabyte scale. However, scalability can vary based on specific implementations. Each object storage solution may have its own limits in terms of capacity. It's important to review the details of each solution to ensure it meets your specific needs for multi-petabyte scale deployments. Q. Is Wasabi 100% Compatible with Amazon S3? A. While we typically avoid discussing specific vendors in a general forum, it's important to note that most 'S3-compatible' object storage implementations have some discrepancies when compared to Amazon S3. These differences can vary in significance. Therefore, we always recommend testing your actual workload against the specific object storage solution to identify any critical issues or incompatibilities. Q. What are the best ways to see a unified view of different types of storage -- including objects, file and blocks? This may be most relevant for enterprise-wide data tracking and multi-cloud deployments. A. There are various solutions available from different vendors that offer visibility into multiple types of storage, including object, file, and block storage. These solutions are particularly useful for enterprise-wide data management and multi-cloud deployments. However, this topic extends beyond the scope of our current discussion. SNIA might consider addressing this in a separate, dedicated webinar in the future. Q. Is there any standard object storage implementation against which the S3 compatibility would be defined? A. Amazon S3 serves as a de facto standard for object storage implementation. Independent software vendors (ISVs) can decide the degree of compatibility they want to achieve with Amazon S3, including which features to implement and to what extent. The objective isn't necessarily to achieve identical functionality across all implementations, but rather for each ISV to be cognizant of the specific differences and potential incompatibilities in their own solutions. Being aware of these discrepancies is key, even if complete compatibility isn't achieved. Q. With the introduction of directory buckets, do you anticipate vendors picking up compatibility there as well or maintaining a strictly flat namespace? A. That's an intriguing question. We are putting together an on-going object storage forum, which will delve into more in follow-up calls, and will serve as a platform for these kinds of discussions. We anticipate addressing not only the concept of directory buckets versus a flat namespace, but also exploring other ideas like performance enhancements and alternate transport layers for S3. This forum is intended to be a collaborative space for discussing future directions in object storage. If you’re interested, contact the cloudtwg@snia.org. Q. How would an incompatibility be categorized as something that is important for clients vs. just something that doesn't meet the AWS spec/behavior? A. Incompatibilities should be assessed based on the specific needs and priorities of each implementor. While we don't set universal compatibility goals, it's up to every implementor to determine how closely they align with S3 or other protocols. They must decide whether to address any discrepancies in behavior or functionality based on their own objectives and their clients' requirements. Essentially, the significance of an incompatibility is determined by its impact on the implementor's goals and client needs. Q. Have customers experienced incompatibilities around different SDKs with regard to HA behaviors? Load balancers vs. round robin DNS vs. other HA techniques on-prem and in the cloud? A. Yes, customers do encounter incompatibilities related to different SDKs, particularly concerning high availability (HA) behaviors. Object storage encompasses more than just APIs; it also involves implementation choices like load balancing decisions and HA techniques. Discrepancies often arise due to these differences, especially when object storage solutions are deployed within a customer's data center and need to integrate with the existing networking infrastructure. These incompatibilities can be due to various factors, including whether load balancing is handled through round robin DNS, dedicated load balancers, or other HA techniques, either on-premises or in the cloud. Q. Any thoughts on keeping pace with AWS as they evolve the S3 API? I'm specifically thinking about the new Directory Bucket type and the associated API changes to support hierarchy. A. We at the SNIA Cloud Storage Technical Work Group are in dialogue with Amazon and are encouraging their participation in our planned Plugfest at SDC’24. Their involvement would be invaluable in helping us anticipate upcoming changes and understand new developments, such as the Directory Bucket type and its associated API changes. This new variation of S3 from Amazon, which differs from the original implementation, underscores the importance of compatibility testing. While complete compatibility may not always be achievable, it's crucial for ISVs to be fully aware of how their implementations differ from S3's evolving standards. Q. When it comes to object store data protection with backup software, do you see also some incompatibilities with recovered data? A. When data is backed up to an object storage system, there's a fundamental expectation that it can be reliably retrieved later. This reliability is a cornerstone of any storage platform. However, issues can arise when data is initially stored in one specific object storage implementation and later transferred to a different one. If this transfer isn't executed in accordance with the backup software provider's requirements, it could lead to difficulties in accessing the data in the future. Therefore, careful planning and adherence to recommended practices are crucial during any data migration process to prevent such compatibility issues. The SNIA Cloud Storage Technical Work Group is actively working on this topic. If you want to get involved, reach out at cloudtwg@snia.org and follow us @sniacloud_com

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Cloud Mobility Cloud Standards Cloud Storage Object Storage

Blog

Complexities of Object Storage Compatibility Q&A

Complexities of Object Storage Compatibility Q&A

Gregory Touretsky

Jan 26, 2024

72% of organizations have encountered incompatibility issues between various object storage implementations according to a poll at our recent SNIA Cloud Storage Technologies Initiative webinar, “Navigating the Complexities of Object Storage Compatibility.” If you missed the live presentation or you would like to see the answers to the other poll questions we asked the audience, you can view it on-demand at the SNIA Educational Library. The audience was highly-engaged during the live event and asked several great questions. Here are answers to them all. Q. Do you see the need for fast object storage for AI kind of workloads? A. Yes, the demand for fast object storage in AI workloads is growing. Initially, object storage was mainly used for backup or archival purposes. However, its evolution into Data Lakes and the introduction of features like the S3 SELECT API have made it more suitable for data analytics. The launch of Amazon’s S3 Express, a faster yet more expensive tier, is a clear indication of this trend. Other vendors are following suit, suggesting a shift towards object storage as a primary data storage platform for specific workloads. Q. As Object Storage becomes more prevalent in the primary storage space, could you talk about data protection, especially functionalities like synchronous replication and multi-site deployments – or is your view that this is not needed for object storage deployments? A. Data protection, including functionalities like synchronous replication and multi-site deployments, is essential for object storage, especially as it becomes more prevalent in primary storage. Various object storage implementations address this differently. For instance, Amazon S3 supports asynchronous replication. Azure ZRS (Zone-redundant storage) offers something akin to synchronous replication within a specific geographical area. Many on-premises solutions provide multi-site deployment and replication capabilities. It’s crucial for vendors to offer distinct features and value additions, giving customers a range of choices to best meet their specific requirements. Ultimately, customers must define their data availability and durability needs and select the solution that aligns with their use case. Q. Regarding polling question #3 during the webinar, why did the question only ask “above 10PB?” We look for multi-PB like 100PB … does this mean Object Storage is not suitable for multi PB? A. Object storage is inherently scalable and can support deployments ranging from petabyte to exabyte scale. However, scalability can vary based on specific implementations. Each object storage solution may have its own limits in terms of capacity. It’s important to review the details of each solution to ensure it meets your specific needs for multi-petabyte scale deployments. Q. Is Wasabi 100% Compatible with Amazon S3? A. While we typically avoid discussing specific vendors in a general forum, it’s important to note that most ‘S3-compatible’ object storage implementations have some discrepancies when compared to Amazon S3. These differences can vary in significance. Therefore, we always recommend testing your actual workload against the specific object storage solution to identify any critical issues or incompatibilities. Q. What are the best ways to see a unified view of different types of storage — including objects, file and blocks? This may be most relevant for enterprise-wide data tracking and multi-cloud deployments. A. There are various solutions available from different vendors that offer visibility into multiple types of storage, including object, file, and block storage. These solutions are particularly useful for enterprise-wide data management and multi-cloud deployments. However, this topic extends beyond the scope of our current discussion. SNIA might consider addressing this in a separate, dedicated webinar in the future. Q. Is there any standard object storage implementation against which the S3 compatibility would be defined? A. Amazon S3 serves as a de facto standard for object storage implementation. Independent software vendors (ISVs) can decide the degree of compatibility they want to achieve with Amazon S3, including which features to implement and to what extent. The objective isn’t necessarily to achieve identical functionality across all implementations, but rather for each ISV to be cognizant of the specific differences and potential incompatibilities in their own solutions. Being aware of these discrepancies is key, even if complete compatibility isn’t achieved. Q. With the introduction of directory buckets, do you anticipate vendors picking up compatibility there as well or maintaining a strictly flat namespace? A. That’s an intriguing question. We are putting together an on-going object storage forum, which will delve into more in follow-up calls, and will serve as a platform for these kinds of discussions. We anticipate addressing not only the concept of directory buckets versus a flat namespace, but also exploring other ideas like performance enhancements and alternate transport layers for S3. This forum is intended to be a collaborative space for discussing future directions in object storage. If you’re interested, contact the cloudtwg@snia.org. Q. How would an incompatibility be categorized as something that is important for clients vs. just something that doesn’t meet the AWS spec/behavior? A. Incompatibilities should be assessed based on the specific needs and priorities of each implementor. While we don’t set universal compatibility goals, it’s up to every implementor to determine how closely they align with S3 or other protocols. They must decide whether to address any discrepancies in behavior or functionality based on their own objectives and their clients’ requirements. Essentially, the significance of an incompatibility is determined by its impact on the implementor’s goals and client needs. Q. Have customers experienced incompatibilities around different SDKs with regard to HA behaviors? Load balancers vs. round robin DNS vs. other HA techniques on-prem and in the cloud? A. Yes, customers do encounter incompatibilities related to different SDKs, particularly concerning high availability (HA) behaviors. Object storage encompasses more than just APIs; it also involves implementation choices like load balancing decisions and HA techniques. Discrepancies often arise due to these differences, especially when object storage solutions are deployed within a customer’s data center and need to integrate with the existing networking infrastructure. These incompatibilities can be due to various factors, including whether load balancing is handled through round robin DNS, dedicated load balancers, or other HA techniques, either on-premises or in the cloud. Q. Any thoughts on keeping pace with AWS as they evolve the S3 API? I’m specifically thinking about the new Directory Bucket type and the associated API changes to support hierarchy. A. We at the SNIA Cloud Storage Technical Work Group are in dialogue with Amazon and are encouraging their participation in our planned Plugfest at SDC’24. Their involvement would be invaluable in helping us anticipate upcoming changes and understand new developments, such as the Directory Bucket type and its associated API changes. This new variation of S3 from Amazon, which differs from the original implementation, underscores the importance of compatibility testing. While complete compatibility may not always be achievable, it’s crucial for ISVs to be fully aware of how their implementations differ from S3’s evolving standards. Q. When it comes to object store data protection with backup software, do you see also some incompatibilities with recovered data? A. When data is backed up to an object storage system, there’s a fundamental expectation that it can be reliably retrieved later. This reliability is a cornerstone of any storage platform. However, issues can arise when data is initially stored in one specific object storage implementation and later transferred to a different one. If this transfer isn’t executed in accordance with the backup software provider’s requirements, it could lead to difficulties in accessing the data in the future. Therefore, careful planning and adherence to recommended practices are crucial during any data migration process to prevent such compatibility issues. The SNIA Cloud Storage Technical Work Group is actively working on this topic. If you want to get involved, reach out at cloudtwg@snia.org and follow us @sniacloud_com

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Cloud Mobility Cloud Standards Cloud Storage Object Storage

Blog

It’s All About Cloud Object Storage Interoperability

It’s All About Cloud Object Storage Interoperability

Michael Hoard

Dec 11, 2023

Object storage has firmly established itself as a cornerstone of modern data centers and cloud infrastructure. Ensuring API compatibility has become crucial for object storage developers who want to benefit from the wide ecosystem of existing applications. However, achieving compatibility can be challenging due to the complexity and variety of the APIs, access control mechanisms, and performance and scalability requirements. The SNIA Cloud Storage Technologies Initiative, together with the SNIA Cloud Storage Technical Work Group, is working to address the issues of cloud object storage complexity and interoperability. We’re kicking off 2024 with two exciting initiatives: 1) a webinar on June 9, 2024, and 2) a Plugfest in September of 2024. Here are the details: Webinar: Navigating the Complexities of Object Storage Compatibility In this webinar, we'll highlight real-world incompatibilities found in various object storage implementations. We'll discuss specific examples of existing discrepancies, such as missing or incorrect response headers, unsupported API calls, and unexpected behavior. We’ll also describe the implications these have on actual client applications. This analysis is based on years of experience with implementation, deployment, and evaluation of a wide range of object storage systems on the market. Attendees will leave with a deeper understanding of the challenges around compatibility and how to address them in their own applications. Register here to join us on January 9, 2024. Plugfest: Cloud Object Storage Plugfest SNIA is planning an open collaborative Cloud Object Storage Plugfest co-located at SNIA Storage Developer Conference (SDC) scheduled for September 2024 to work on improving cross-implementation compatibility for client and/or server implementations of private and public cloud object storage solutions. This endeavor is designed to be an independent, vendor-neutral effort with broad industry support, focused on a variety of solutions, including on-premises and in the cloud. This Plugfest aims to reduce compatibility issues, thus improving customer experience and increasing the adoption rate of object storage solutions. Click here to let us know if you're interested. We hope you will consider participating in both of these initiatives!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Cloud Mobility Cloud Standards Cloud Storage

Blog

It’s All About Cloud Object Storage Interoperability

It’s All About Cloud Object Storage Interoperability

Michael Hoard

Dec 11, 2023

Object storage has firmly established itself as a cornerstone of modern data centers and cloud infrastructure. Ensuring API compatibility has become crucial for object storage developers who want to benefit from the wide ecosystem of existing applications. However, achieving compatibility can be challenging due to the complexity and variety of the APIs, access control mechanisms, and performance and scalability requirements. The SNIA Cloud Storage Technologies Initiative, together with the SNIA Cloud Storage Technical Work Group, is working to address the issues of cloud object storage complexity and interoperability. We’re kicking off 2024 with two exciting initiatives: 1) a webinar on June 9, 2024, and 2) a Plugfest in September of 2024. Here are the details: Webinar: Navigating the Complexities of Object Storage Compatibility In this webinar, we’ll highlight real-world incompatibilities found in various object storage implementations. We’ll discuss specific examples of existing discrepancies, such as missing or incorrect response headers, unsupported API calls, and unexpected behavior. We’ll also describe the implications these have on actual client applications. This analysis is based on years of experience with implementation, deployment, and evaluation of a wide range of object storage systems on the market. Attendees will leave with a deeper understanding of the challenges around compatibility and how to address them in their own applications. Register here to join us on January 9, 2024. Plugfest: Cloud Object Storage Plugfest SNIA is planning an open collaborative Cloud Object Storage Plugfest co-located at SNIA Storage Developer Conference (SDC) scheduled for September 2024 to work on improving cross-implementation compatibility for client and/or server implementations of private and public cloud object storage solutions. This endeavor is designed to be an independent, vendor-neutral effort with broad industry support, focused on a variety of solutions, including on-premises and in the cloud. This Plugfest aims to reduce compatibility issues, thus improving customer experience and increasing the adoption rate of object storage solutions. Click here to let us know if you’re interested. We hope you will consider participating in both of these initiatives!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Cloud Interoperability Cloud Mobility Cloud Standards Cloud Storage

Blog

An Open Standard for Namespace Management

An Open Standard for Namespace Management

Michael Hoard

Sep 20, 2023

The days of simple, static, self-contained file systems have long passed. Today, we have complex, dynamic namespaces, mixing block, file, object, key-value, queue, and graph-based resources, accessed via multiple protocols, and distributed across multiple systems and geographic sites. These complexities result in new challenges for simplifying management. There is good news on addressing this issue, and the SNIA Cloud Storage Technologies Initiative (CSTI) will explain how in our live webinar “Simplified Namespace Management – The Open Standards Way” on October 18, 2023, where David Slik, Chair of the SNIA Cloud Storage Technical Work Group, will demonstrate how the SNIA Cloud Data Management Interface (CDMI

), an open ISO standard (ISO/IEC 17826:2022) for managing data objects and containers, already includes extensive capabilities for simplifying the management of complex namespaces. In this webinar, you’ll learn the benefits of simplifying namespace management in an open standards way, including namespace discovery, introspection, exports, imports and more, discussing:

Challenges and limitations with proprietary namespace management
Overview of namespaces and industry evolution
Lack of portability between platforms
Using CDMI for simplified and consistent namespace management
Use cases for namespace management

As one of the key architects of CDMI, David will dive into the details, discuss real-world use cases and answer your questions. We hope you’ll join us on October 18^th. Register here.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Cloud Standards Cloud Storage Multicloud

Blog

Data Fabric Q&A

Data Fabric Q&A

Michael Hoard

Aug 21, 2023

Unification of structured and unstructured data has long been a goal – and challenge for organizations. Data Fabric is an architecture, set of services and platform that standardizes and integrates data across the enterprise regardless of data location (On-Premises, Cloud, Multi-Cloud, Hybrid Cloud), enabling self-service data access to support various applications, analytics, and use cases. The data fabric leaves data where it lives and applies intelligent automation to govern, secure and bring AI to your data. How a data fabric abstraction layer works and the benefits it delivers was the topic of our recent SNIA Cloud Storage Technologies Initiative (CSTI) webinar, “Data Fabric: Connecting the Dots between Structured and Unstructured Data.” If you missed it, you can watch it on-demand and access the presentations slides at the SNIA Educational Library. We did not have time to answer audience questions at the live session. Here are answers from our expert, Joseph Dain. Q. What are some of the biggest challenges you have encountered when building this architecture? A. The scale of unstructured data makes it challenging to build a catalog of this information. With structured data you may have thousands or hundreds of thousands of table assets, but in unstructured data you can have billions of files and objects that need to be tracked at massive scale. Another challenge is masking unstructured data. With structured data you have a well-defined schema so it is easier to mask specific columns but in unstructured data you don’t have such a schema so you need to be able to understand what term needs to be masked in an unstructured document and you need to know the location of that field without having the luxury of a well-defined schema to guide you. Q. There can be lots of data access requests from many users. How is this handled? A. The data governance layer has two aspects that are leveraged to address this. The first aspect is data privacy rules which are automatically enforced during data access requests and are typically controlled at a group level. The second aspect is the ability to create custom workflows with personas that enable users to initiate data access requests which are sent to the appropriate approvers. Q. What are some of the next steps with this architecture? A. One area of interest is leveraging computational storage to do the classification and profiling of data to identify aspects such as personally identifiable information (PII). In particular, profiling vast amounts of unstructured data for PII is a compute, network, storage, and memory intense operation. By performing this profiling leveraging computational storage close to the data, we gain efficiencies in the rate at which we can process data with less resource consumption. We continue to offer educational webinars on a wide range of cloud-related topics throughout the year. Please follow us @sniacloud_com to make sure you don’t miss any.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Cloud Storage data security Multicloud

Blog

Data Fabric Q&A

Data Fabric Q&A

Michael Hoard

Aug 21, 2023

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Cloud Storage data security Multicloud

Blog

Training Deep Learning Models Q&A

Training Deep Learning Models Q&A

Erin Farr

May 19, 2023

The estimated impact of Deep Learning (DL) across all industries cannot be understated. In fact, analysts predict deep learning will account for the majority of cloud workloads, and training of deep learning models will represent the majority of server applications in the next few years. It’s the topic the SNIA Cloud Storage Technologies Initiative (CSTI) discussed at our webinar “Training Deep Learning Models in the Cloud.” If you missed the live event, it’s available on-demand at the SNIA Educational Library where you can also download the presentation slides. The audience asked our expert presenters, Milind Pandit from Habana Labs Intel and Seetharami Seelam from IBM several interesting questions. Here are their answers: Q. Where do you think most of the AI will run, especially training? Will it be in the public cloud or will it be on-premises or both [Milind:] It’s probably going to be a mix. There are advantages to using the public cloud especially because it’s pay as you go. So, when experimenting with new models, new innovations, new uses of AI, and when scaling deployments, it makes a lot of sense. But there are still a lot of data privacy concerns. There are increasing numbers of regulations regarding where data needs to reside physically and in which geographies. Because of that, many organizations are deciding to build out their own data centers and once they have large-scale training or inference successfully underway, they often find it cost effective to migrate their public cloud deployment into a data center where they can control the cost and other aspects of data management. [Seelam]: I concur with Milind. We are seeing a pattern of dual approaches. There are some small companies that don’t have the right capital necessary nor the expertise or teams necessary to acquire GPU based servers and deploy them. They are increasingly adopting public cloud. We are seeing some decent sized companies that are adopting this same approach as well. Keep in mind these GPU servers tend to be very power hungry and so you need the right floor plan, power, cooling, and so forth. So, public cloud definitely helps you have easy access and to pay for only what you consume. We are also seeing trends where certain organizations have constraints that restrict moving certain data outside their walls. In those scenarios, we are seeing customers deploy GPU systems on-premises. I don’t think it’s going to be one or the other. It is going to be a combination of both, but by adopting more of a common platform technology, this will help unify their usage model in public cloud and on-premises. Q. What is GDR? You mentioned using it with RoCE. [Seelam]: GDR stands for GPUDirect RDMA. There are several ways a GPU on one node can communicate to a GPU on another node. There are three different ways (at least) of doing this: The GPU can use TCP where GPU data is copied back into the CPU which orchestrates the communication to the CPU and GPU on another node. That obviously adds a lot of latency going through the whole TCP protocol. Another way to do this is through RoCEv2 or RDMA where CPUs, FPGAs and/or GPUs actually talk to each other through industry standard RDMA channels. So, you send and receive data without the added latency of traditional networking software layers. A third method is GDR where a GPU on one node can talk to a GPU on another node directly. This is done through network interfaces where basically the GPUs are talking to each other, again bypassing traditional networking software layers. Q. When you are talking about RoCE do you mean RoCEv2? [Seelam]: That is correct I’m talking only about RoCEv2. Thank you for the clarification. Q. Can you comment on storage needs for DL training and have you considered the use of scale out cloud storage services for deep learning training? If so, what are the challenges and issues? [Milind]: The storage needs are 1) massive and 2) based on the kind of training that you’re doing, (data parallel versus model parallel). With different optimizations, you will need parts of your data to be local in many circumstances. It’s not always possible to do efficient training when data is physically remote and there’s a large latency in accessing it. Some sort of a caching infrastructure will be required in order for your training to proceed efficiently. Seelam may have other thoughts on scale out approaches for training data. [Seelam]: Yes, absolutely I agree 100%. Unfortunately, there is no silver bullet to address the data problem with large-scale training. We take a three-pronged approach. Predominantly, we recommend users put their data in object storage and that becomes the source of where all the data lives. Many training jobs, especially training jobs that deal with text data, don’t tend to be huge in size because these are all characters so we use object store as a source directly to read the data and feed the GPUs to train. So that’s one model of training, but that only works for relatively smaller data sets. They get cached once you access the first time because you shard it quite nicely so you don’t have to go back to the data source many times. There are other data sets where the data volume is larger. So, if you’re dealing with pictures, video or these kinds of training domains, we adopt a two-pronged approach. In one scenario we actually have a distributed cache mechanism where the end users have a copy of the data in the file system and that becomes the source for AI training. In another scenario, we deployed that system with sufficient local storage and asked users to copy the data into that local storage to use that local storage as a local cache. So as the AI training is continuing once the data is accessed, it’s actually cached on the local drive and subsequent iterations of the data come from that cache. This is much bigger than the local memory. It’s about 12 terabytes of cache local storage with the 1.5 terabytes of data. So, we could get to these data sets that are in the 10-terabyte range per node just from the local storage. If they exceed that, then we go to this distributed cache. If the data sets are small enough, then we just use object storage. So, there are at least three different ways, depending on the use case on the model you are trying to train. Q. In a fully sharded data parallel model, there are three communication calls when compared to DDP (distributed data parallel). Does that mean it needs about three times more bandwidth? [Seelam]: Not necessarily three times more, but you will use the network a lot more than you would use in a DDP. In a DDP or distributed data parallel model you will not use the network at all in the forward pass. Whereas in an FSDP (fully sharded data parallel) model, you use the network both in forward pass and in backward pass. In that sense you use the network more, but at the same time because you don’t have parts of the model within your system, you need to get the model from the other neighbors and so that means you will be using more bandwidth. I cannot give you the 3x number; I haven’t seen the 3x but it’s more than DDP for sure. The SNIA CSTI has an active schedule of webinars to help educate on cloud technologies. Follow us on Twitter @sniacloud_com and sign up for the SNIA Matters Newsletter, so that you don’t miss any.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Artificial Intelligence Cloud Storage

Subscribe to Cloud Storage

Ceph Storage for AI/ML Q&A

Find a similar article by tags

Leave a Reply

Solving Cloud Object Storage Incompatibilities in a Multi-Vendor Community

Find a similar article by tags

Leave a Reply

Complexities of Object Storage Compatibility Q&A

Find a similar article by tags

Leave a Reply

Complexities of Object Storage Compatibility Q&A

Find a similar article by tags

Leave a Reply

It’s All About Cloud Object Storage Interoperability

Find a similar article by tags

Leave a Reply

It’s All About Cloud Object Storage Interoperability

Find a similar article by tags

Leave a Reply

An Open Standard for Namespace Management

Find a similar article by tags

Leave a Reply

Data Fabric Q&A

Find a similar article by tags

Leave a Reply

Data Fabric Q&A

Find a similar article by tags

Leave a Reply

Training Deep Learning Models Q&A

Find a similar article by tags

Leave a Reply