Erin Farr

Nov 14, 2023

title of post
At our recent SNIA Cloud Storage Technologies (CSTI) webinar “Why Distributed Edge Data is the Future of AI” our expert speakers, Rita Wouhaybi and Heiko Ludwig, explained what’s new and different about edge data, highlighted use cases and phases of AI at the edge, covered Federated Learning, discussed privacy for edge AI, and provided an overview of the many other challenges and complexities being created by increasingly large AI models and algorithms. It was a fascinating session. If you missed it you can access it on-demand along with a PDF of the slides at the SNIA Educational Library. Our live audience asked several interesting questions. Here are answers from our presenters. Q. With the rise of large language models (LLMs) what role will edge AI play? A. LLMs are very good at predicting events based on previous data, often referred to as next token in LLMs. Many edge use cases are also about prediction of the next event, e.g., a machine is going to go down, predicting an outage, a security breach on the network and so on. One of the challenges of applying LLMs to these use cases is to convert the data (tokens) into text with the right context. Q. After you create an AI model how often do you need to update it? A. That is very dependent on the dataset itself, the use case KPIs, and the techniques used (e.g., network backbone and architecture). It used to be where the data collection cycle is very long to collect data that includes outliers and rare events. We are moving away from this kind of development due to its cost and long time required. Instead, most customers start with a few data points and reiterate by updating their models more often. Such a strategy helps a faster return on the investment since you deploy a model as soon as it is good enough. Also, new techniques in AI such as unsupervised learning or even selective annotation can enable some use cases to get a model that is self-learning or at the least self-adaptable. Q. Deploying AI is costly, what use cases tend to be cost effective? A. Just like any technology, prices drop as scale increases. We are at an inflection point where we will see more use cases become feasible to develop and deploy. But yes, many use cases might not have an ROI. Typically, we recommend to start with use cases that are business critical or have potential of improving yield, quality, or both. Q. Do you have any measurements about energy usage for edge AI? Wondering if there is an ecological argument for edge AI in addition to the others mentioned? A. This is very good question and top of mind for many in the industry.There is not data yet to support sustainability claims, however running AI at the edge can provide more control and further refinement for making tradeoffs in relation to corporate goals including sustainability. Of course, compute at the edge reduces data transfer and the environmental impact of these functions.  

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Erin Farr

Nov 14, 2023

title of post
At our recent SNIA Cloud Storage Technologies (CSTI) webinar “Why Distributed Edge Data is the Future of AI” our expert speakers, Rita Wouhaybi and Heiko Ludwig, explained what’s new and different about edge data, highlighted use cases and phases of AI at the edge, covered Federated Learning, discussed privacy for edge AI, and provided an overview of the many other challenges and complexities being created by increasingly large AI models and algorithms. It was a fascinating session. If you missed it you can access it on-demand along with a PDF of the slides at the SNIA Educational Library. Our live audience asked several interesting questions. Here are answers from our presenters. Q. With the rise of large language models (LLMs) what role will edge AI play? A. LLMs are very good at predicting events based on previous data, often referred to as next token in LLMs. Many edge use cases are also about prediction of the next event, e.g., a machine is going to go down, predicting an outage, a security breach on the network and so on. One of the challenges of applying LLMs to these use cases is to convert the data (tokens) into text with the right context. Q. After you create an AI model how often do you need to update it? A. That is very dependent on the dataset itself, the use case KPIs, and the techniques used (e.g., network backbone and architecture). It used to be where the data collection cycle is very long to collect data that includes outliers and rare events. We are moving away from this kind of development due to its cost and long time required. Instead, most customers start with a few data points and reiterate by updating their models more often. Such a strategy helps a faster return on the investment since you deploy a model as soon as it is good enough. Also, new techniques in AI such as unsupervised learning or even selective annotation can enable some use cases to get a model that is self-learning or at the least self-adaptable. Q. Deploying AI is costly, what use cases tend to be cost effective? A. Just like any technology, prices drop as scale increases. We are at an inflection point where we will see more use cases become feasible to develop and deploy. But yes, many use cases might not have an ROI. Typically, we recommend to start with use cases that are business critical or have potential of improving yield, quality, or both. Q. Do you have any measurements about energy usage for edge AI? Wondering if there is an ecological argument for edge AI in addition to the others mentioned? A. This is very good question and top of mind for many in the industry.There is not data yet to support sustainability claims, however running AI at the edge can provide more control and further refinement for making tradeoffs in relation to corporate goals including sustainability. Of course, compute at the edge reduces data transfer and the environmental impact of these functions.  

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Addressing the Hidden Costs of AI

Erik Smith

Nov 9, 2023

title of post
The latest buzz around generative AI ignores the massive costs to run and power the technology. Understanding what the sustainability and cost impacts of AI are and how to effectively address them will be the topic of our next SNIA Networking Storage Forum (NSF) webinar, “Addressing the Hidden Costs of AI.” On December 12, 2023, our SNIA experts will offer insights on the potentially hidden technical and infrastructure costs associated with generative AI. You’ll also learn best practices and potential solutions to be considered as they discuss:
  • Scalability considerations for generative AI in enterprises
  • Significant computational requirements and costs for Large Language Model inferencing
  • Fabric requirements and costs
  • Sustainability impacts due to increased power consumption, heat dissipation, and cooling implications
  • AI infrastructure savings: On-prem vs. Cloud
  • Practical steps to reduce impact, leveraging existing pre-trained models for specific market domains
Register today. Our presenters will be available to answer your questions. The post Addressing the Hidden Costs of AI first appeared on SNIA on Network Storage.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Erin Farr

Sep 14, 2023

title of post
Confidential AI is a new collaborative platform for data and AI teams to work with sensitive data sets and run AI models in a confidential environment. It includes infrastructure, software, and workflow orchestration to create a secure, on-demand work environment that meets organization’s privacy requirements and complies with regulatory mandates. It’s a topic the SNIA Cloud Storage Technologies Initiative (CSTI) covered in depth at our webinar, “The Rise in Confidential AI.” At this webinar, our experts, Parviz Peiravi and Richard Searle provided a deep and insightful look at how this dynamic technology works to ensure data protection and data privacy. Here are their answers to the questions from our webinar audience. Q. Are businesses using Confidential AI today? A. Absolutely, we have seen a big increase in adoption of Confidential AI particularly in industries such as Financial Services, Healthcare and Government, where Confidential AI is helping these organizations enhance risk mitigation, including cybercrime prevention, anti-money laundering, fraud prevention and more. Q: With compute capabilities on the Edge increasing, how do you see Trusted Execution Environments evolving? A. One of the important things about Confidential Computing is although it's a discrete privacy enhancing technology, it's part of the underlying broader, distributed data center compute hardware. However, the Edge is going to be increasingly important as we look ahead to things like 6G communication networks. We see a role for AI at the Edge in terms of things like signal processing and data quality evaluation, particularly in situations where the data is being sourced from different endpoints. Q: Can you elaborate on attestation within a Trusted Execution Environment (TEE)? A. One of the critical things about Confidential Computing is the need for an attested Trusted Execution Environment. In order to have that reassurance of confidentiality and the isolation and integrity guarantees that we spoke about during the webinar, attestation is the foundational truth of Confidential Computing and is absolutely necessary. In every secure implementation of confidential AI, attestation provides the assurance that you're working in that protected memory region, that data and software instructions can be secured in memory, and that the AI workload itself is shielded from the other elements of the computing system. If you're starting with hardware-based technology, then you have the utmost security, removing the majority of actors outside of the boundary of your trust. However, this also creates a level of isolation that you might not want to use for an application that doesn't need this high level of security. You must balance utmost security with your application’s appetite for risk. Q: What is your favorite reference for implementing Confidential Computing that bypasses the OS, BIOS, VMM (Virtual Machine Manager) and uses the root trust certificate? A. It's important to know that there are different implementations of Trusted Execution Environments, and they are very relevant to different types of purposes. For example, there are process-based TEEs that enable a very discrete definition of a TEE and provide the ability to write specific code and protect very sensitive information because of the isolation from things like the hypervisor and virtual machine manager. There are also different technologies available now that have a virtualization basis and include a guest operating system within their trusted computing base, but they provide greater flexibility in terms of implementation, so you might want to use that when you have a larger application or a more complex deployment. The Confidential Computing Consortium, which is part of The Linux Foundation, is also a good resource to keep up with Confidential AI guidance. Q: Can you please give us a picture of the upcoming standards for strengthening security? Do you believe that European Union’s AI Act (EU AI Act) is going in the right direction and that it will have a positive impact on the industry? A. That’s a good question. The draft EU AI Act was approved in June 2023 by the European Parliament, but the UN Security Council has also put out a call for international regulation in the same way that we have treaties and conventions. We think what we're going to see is different nation states taking discrete approaches. The UK has taken an open approach to AI regulation in order to stimulate innovation. The EU already has a very prescriptive data protection regulation method, and the EU AI Act takes a similar approach. It's quite prescriptive and designed to complement data privacy regulations that already exist. For a clear overview of the EU’s groundbreaking AI legislation, refer to the EU AI ACT summary. It breaks down the key obligations, compliance responsibilities, and the broader impact on various AI applications. Q. Where do you think some of the biggest data privacy issues are within generative AI? A. There's quite a lot of debate already about how these massive generative AI systems have used data scraped from the web, whether things like copyright provisions have been acknowledged, and whether data privacy in imagery from social media has been respected. At an international level, it's going to be interesting to see whether people can agree on a cohesive framework to regulate AI and to see if different countries can agree. There’s also the issue of the time required to develop legislation being superseded by technological developments. We saw ChatGPT to be very disruptive last year. There are also ethical considerations around this topic which the SNIA CSTI covered in a webinar “The Ethics of Artificial Intelligence.” Q. Are you optimistic that regulators can come to an agreement on generative AI? A. In the last four or five years, regulators have become more open to working with financial institutions to better understand the impact of adopting new technologies such as AI and generative AI. This collaboration among regulators with those in the financial sector is creating momentum. Regulators such as the Monetary Authority of Singapore are leading this strategy, actively working with vendors to understand the technology application within financial services and how to guide the rest of the banking industry.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Erin Farr

Sep 14, 2023

title of post
Confidential AI is a new collaborative platform for data and AI teams to work with sensitive data sets and run AI models in a confidential environment. It includes infrastructure, software, and workflow orchestration to create a secure, on-demand work environment that meets organization’s privacy requirements and complies with regulatory mandates. It’s a topic the SNIA Cloud Storage Technologies Initiative (CSTI) covered in depth at our webinar, “The Rise in Confidential AI.” At this webinar, our experts, Parviz Peiravi and Richard Searle provided a deep and insightful look at how this dynamic technology works to ensure data protection and data privacy. Here are their answers to the questions from our webinar audience. Q. Are businesses using Confidential AI today? A. Absolutely, we have seen a big increase in adoption of Confidential AI particularly in industries such as Financial Services, Healthcare and Government, where Confidential AI is helping these organizations enhance risk mitigation, including cybercrime prevention, anti-money laundering, fraud prevention and more. Q: With compute capabilities on the Edge increasing, how do you see Trusted Execution Environments evolving? A. One of the important things about Confidential Computing is although it's a discrete privacy enhancing technology, it's part of the underlying broader, distributed data center compute hardware. However, the Edge is going to be increasingly important as we look ahead to things like 6G communication networks. We see a role for AI at the Edge in terms of things like signal processing and data quality evaluation, particularly in situations where the data is being sourced from different endpoints. Q: Can you elaborate on attestation within a Trusted Execution Environment (TEE)? A. One of the critical things about Confidential Computing is the need for an attested Trusted Execution Environment. In order to have that reassurance of confidentiality and the isolation and integrity guarantees that we spoke about during the webinar, attestation is the foundational truth of Confidential Computing and is absolutely necessary. In every secure implementation of confidential AI, attestation provides the assurance that you're working in that protected memory region, that data and software instructions can be secured in memory, and that the AI workload itself is shielded from the other elements of the computing system. If you're starting with hardware-based technology, then you have the utmost security, removing the majority of actors outside of the boundary of your trust. However, this also creates a level of isolation that you might not want to use for an application that doesn't need this high level of security. You must balance utmost security with your application’s appetite for risk. Q: What is your favorite reference for implementing Confidential Computing that bypasses the OS, BIOS, VMM (Virtual Machine Manager) and uses the root trust certificate? A. It's important to know that there are different implementations of Trusted Execution Environments, and they are very relevant to different types of purposes. For example, there are process-based TEEs that enable a very discrete definition of a TEE and provide the ability to write specific code and protect very sensitive information because of the isolation from things like the hypervisor and virtual machine manager. There are also different technologies available now that have a virtualization basis and include a guest operating system within their trusted computing base, but they provide greater flexibility in terms of implementation, so you might want to use that when you have a larger application or a more complex deployment. The Confidential Computing Consortium, which is part of The Linux Foundation, is also a good resource to keep up with Confidential AI guidance. Q: Can you please give us a picture of the upcoming standards for strengthening security? Do you believe that European Union’s AI Act (EU AI Act) is going in the right direction and that it will have a positive impact on the industry? A. That’s a good question. The draft EU AI Act was approved in June 2023 by the European Parliament, but the UN Security Council has also put out a call for international regulation in the same way that we have treaties and conventions. We think what we're going to see is different nation states taking discrete approaches. The UK has taken an open approach to AI regulation in order to stimulate innovation. The EU already has a very prescriptive data protection regulation method, and the EU AI Act takes a similar approach. It's quite prescriptive and designed to complement data privacy regulations that already exist. Q. Where do you think some of the biggest data privacy issues are within generative AI? A. There's quite a lot of debate already about how these massive generative AI systems have used data scraped from the web, whether things like copyright provisions have been acknowledged, and whether data privacy in imagery from social media has been respected. At an international level, it's going to be interesting to see whether people can agree on a cohesive framework to regulate AI and to see if different countries can agree. There’s also the issue of the time required to develop legislation being superseded by technological developments. We saw ChatGPT to be very disruptive last year. There are also ethical considerations around this topic which the SNIA CSTI covered in a webinar “The Ethics of Artificial Intelligence.” Q. Are you optimistic that regulators can come to an agreement on generative AI? A. In the last four or five years, regulators have become more open to working with financial institutions to better understand the impact of adopting new technologies such as AI and generative AI. This collaboration among regulators with those in the financial sector is creating momentum. Regulators such as the Monetary Authority of Singapore are leading this strategy, actively working with vendors to understand the technology application within financial services and how to guide the rest of the banking industry.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

How Edge Data is Impacting AI

Erin Farr

Sep 6, 2023

title of post
AI is disrupting so many domains and industries and by doing so, AI models and algorithms are becoming increasingly large and complex. This complexity is driven by the proliferation in size and diversity of localized data everywhere, which creates the need for a unified data fabric and/or federated learning. It could be argued that whoever wins the data race will win the AI race, which is inherently built on two premises: 1) Data is available in a central location for AI to have full access to it, 2) Compute is centralized and abundant. The impact of edge AI is the topic for our next SNIA Cloud Storage Technologies Initiative (CSTI) live webinar, “Why Distributed Edge Data is the Future of AI,” on October 3, 2023. If centralized (or in the cloud), AI is a superpower and super expert, but edge AI is a community of many smart wizards with the power of cumulative knowledge over a central superpower.  In this webinar, our SNIA experts will discuss:
  • The value and use cases of distributed edge AI
  • How data fabric on the edge differs from the cloud and its impact on AI
  • Edge device data privacy trade-offs and distributed agency trends
  • Privacy mechanisms for federated learning, inference, and analytics
  • How interoperability between cloud and edge AI can happen
Register here to join us on October 3rd. Our experts will be ready to answer your questions.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

How Edge Data is Impacting AI

Erin Farr

Sep 6, 2023

title of post

AI is disrupting so many domains and industries and by doing so, AI models and algorithms are becoming increasingly large and complex. This complexity is driven by the proliferation in size and diversity of localized data everywhere, which creates the need for a unified data fabric and/or federated learning. It could be argued that whoever wins the data race will win the AI race, which is inherently built on two premises: 1) Data is available in a central location for AI to have full access to it, 2) Compute is centralized and abundant.

The impact of edge AI is the topic for our next SNIA Cloud Storage Technologies Initiative (CSTI) live webinar, “Why Distributed Edge Data is the Future of AI,” on October 3, 2023. If centralized (or in the cloud), AI is a superpower and super expert, but edge AI is a community of many smart wizards with the power of cumulative knowledge over a central superpower.  In this webinar, our SNIA experts will discuss:

  • The value and use cases of distributed edge AI
  • How data fabric on the edge differs from the cloud and its impact on AI
  • Edge device data privacy trade-offs and distributed agency trends
  • Privacy mechanisms for federated learning, inference, and analytics
  • How interoperability between cloud and edge AI can happen

Register here to join us on October 3rd. Our experts will be ready to answer your questions.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Training Deep Learning Models Q&A

Erin Farr

May 19, 2023

title of post
The estimated impact of Deep Learning (DL) across all industries cannot be understated. In fact, analysts predict deep learning will account for the majority of cloud workloads, and training of deep learning models will represent the majority of server applications in the next few years. It’s the topic the SNIA Cloud Storage Technologies Initiative (CSTI) discussed at our webinar “Training Deep Learning Models in the Cloud.” If you missed the live event, it’s available on-demand at the SNIA Educational Library where you can also download the presentation slides. The audience asked our expert presenters, Milind Pandit from Habana Labs Intel and Seetharami Seelam from IBM several interesting questions. Here are their answers: Q. Where do you think most of the AI will run, especially training? Will it be in the public cloud or will it be on-premises or both [Milind:] It’s probably going to be a mix. There are advantages to using the public cloud especially because it’s pay as you go. So, when experimenting with new models, new innovations, new uses of AI, and when scaling deployments, it makes a lot of sense. But there are still a lot of data privacy concerns. There are increasing numbers of regulations regarding where data needs to reside physically and in which geographies. Because of that, many organizations are deciding to build out their own data centers and once they have large-scale training or inference successfully underway, they often find it cost effective to migrate their public cloud deployment into a data center where they can control the cost and other aspects of data management. [Seelam]: I concur with Milind. We are seeing a pattern of dual approaches. There are some small companies that don’t have the right capital necessary nor the expertise or teams necessary to acquire GPU based servers and deploy them. They are increasingly adopting public cloud. We are seeing some decent sized companies that are adopting this same approach as well. Keep in mind these GPU servers tend to be very power hungry and so you need the right floor plan, power, cooling, and so forth. So, public cloud definitely helps you have easy access and to pay for only what you consume. We are also seeing trends where certain organizations have constraints that restrict moving certain data outside their walls. In those scenarios, we are seeing customers deploy GPU systems on-premises. I don’t think it’s going to be one or the other. It is going to be a combination of both, but by adopting more of a common platform technology, this will help unify their usage model in public cloud and on-premises. Q. What is GDR? You mentioned using it with RoCE. [Seelam]: GDR stands for GPUDirect RDMA. There are several ways a GPU on one node can communicate to a GPU on another node. There are three different ways (at least) of doing this: The GPU can use TCP where GPU data is copied back into the CPU which orchestrates the communication to the CPU and GPU on another node. That obviously adds a lot of latency going through the whole TCP protocol. Another way to do this is through RoCEv2 or RDMA where CPUs, FPGAs and/or GPUs actually talk to each other through industry standard RDMA channels. So, you send and receive data without the added latency of traditional networking software layers. A third method is GDR where a GPU on one node can talk to a GPU on another node directly. This is done through network interfaces where basically the GPUs are talking to each other, again bypassing traditional networking software layers. Q. When you are talking about RoCE do you mean RoCEv2? [Seelam]: That is correct I’m talking only about RoCEv2. Thank you for the clarification. Q. Can you comment on storage needs for DL training and have you considered the use of scale out cloud storage services for deep learning training? If so, what are the challenges and issues? [Milind]: The storage needs are 1) massive and 2) based on the kind of training that you’re doing, (data parallel versus model parallel). With different optimizations, you will need parts of your data to be local in many circumstances. It’s not always possible to do efficient training when data is physically remote and there’s a large latency in accessing it. Some sort of a caching infrastructure will be required in order for your training to proceed efficiently. Seelam may have other thoughts on scale out approaches for training data. [Seelam]: Yes, absolutely I agree 100%. Unfortunately, there is no silver bullet to address the data problem with large-scale training. We take a three-pronged approach. Predominantly, we recommend users put their data in object storage and that becomes the source of where all the data lives. Many training jobs, especially training jobs that deal with text data, don’t tend to be huge in size because these are all characters so we use object store as a source directly to read the data and feed the GPUs to train. So that’s one model of training, but that only works for relatively smaller data sets. They get cached once you access the first time because you shard it quite nicely so you don’t have to go back to the data source many times. There are other data sets where the data volume is larger. So, if you’re dealing with pictures, video or these kinds of training domains, we adopt a two-pronged approach. In one scenario we actually have a distributed cache mechanism where the end users have a copy of the data in the file system and that becomes the source for AI training. In another scenario, we deployed that system with sufficient local storage and asked users to copy the data into that local storage to use that local storage as a local cache. So as the AI training is continuing once the data is accessed, it’s actually cached on the local drive and subsequent iterations of the data come from that cache. This is much bigger than the local memory. It’s about 12 terabytes of cache local storage with the 1.5 terabytes of data. So, we could get to these data sets that are in the 10-terabyte range per node just from the local storage. If they exceed that, then we go to this distributed cache. If the data sets are small enough, then we just use object storage. So, there are at least three different ways, depending on the use case on the model you are trying to train. Q. In a fully sharded data parallel model, there are three communication calls when compared to DDP (distributed data parallel). Does that mean it needs about three times more bandwidth? [Seelam]: Not necessarily three times more, but you will use the network a lot more than you would use in a DDP. In a DDP or distributed data parallel model you will not use the network at all in the forward pass. Whereas in an FSDP (fully sharded data parallel) model, you use the network both in forward pass and in backward pass. In that sense you use the network more, but at the same time because you don’t have parts of the model within your system, you need to get the model from the other neighbors and so that means you will be using more bandwidth. I cannot give you the 3x number; I haven’t seen the 3x but it’s more than DDP for sure. The SNIA CSTI has an active schedule of webinars to help educate on cloud technologies. Follow us on Twitter @sniacloud_com and sign up for the SNIA Matters Newsletter, so that you don’t miss any.                      

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Training Deep Learning Models Q&A

Erin Farr

May 19, 2023

title of post
The estimated impact of Deep Learning (DL) across all industries cannot be understated. In fact, analysts predict deep learning will account for the majority of cloud workloads, and training of deep learning models will represent the majority of server applications in the next few years. It’s the topic the SNIA Cloud Storage Technologies Initiative (CSTI) discussed at our webinar “Training Deep Learning Models in the Cloud.” If you missed the live event, it’s available on-demand at the SNIA Educational Library where you can also download the presentation slides. The audience asked our expert presenters, Milind Pandit from Habana Labs Intel and Seetharami Seelam from IBM several interesting questions. Here are their answers: Q. Where do you think most of the AI will run, especially training? Will it be in the public cloud or will it be on-premises or both [Milind:] It's probably going to be a mix. There are advantages to using the public cloud especially because it's pay as you go. So, when experimenting with new models, new innovations, new uses of AI, and when scaling deployments, it makes a lot of sense. But there are still a lot of data privacy concerns. There are increasing numbers of regulations regarding where data needs to reside physically and in which geographies. Because of that, many organizations are deciding to build out their own data centers and once they have large-scale training or inference successfully underway, they often find it cost effective to migrate their public cloud deployment into a data center where they can control the cost and other aspects of data management. [Seelam]: I concur with Milind. We are seeing a pattern of dual approaches. There are some small companies that don't have the right capital necessary nor the expertise or teams necessary to acquire GPU based servers and deploy them. They are increasingly adopting public cloud. We are seeing some decent sized companies that are adopting this same approach as well. Keep in mind these GPU servers tend to be very power hungry and so you need the right floor plan, power, cooling, and so forth. So, public cloud definitely helps you have easy access and to pay for only what you consume. We are also seeing trends where certain organizations have constraints that restrict moving certain data outside their walls. In those scenarios, we are seeing customers deploy GPU systems on-premises. I don't think it's going to be one or the other. It is going to be a combination of both, but by adopting more of a common platform technology, this will help unify their usage model in public cloud and on-premises. Q. What is GDR? You mentioned using it with RoCE. [Seelam]: GDR stands for GPUDirect RDMA. There are several ways a GPU on one node can communicate to a GPU on another node. There are three different ways (at least) of doing this: The GPU can use TCP where GPU data is copied back into the CPU which orchestrates the communication to the CPU and GPU on another node. That obviously adds a lot of latency going through the whole TCP protocol. Another way to do this is through RoCEv2 or RDMA where CPUs, FPGAs and/or GPUs actually talk to each other through industry standard RDMA channels. So, you send and receive data without the added latency of traditional networking software layers. A third method is GDR where a GPU on one node can talk to a GPU on another node directly. This is done through network interfaces where basically the GPUs are talking to each other, again bypassing traditional networking software layers. Q. When you are talking about RoCE do you mean RoCEv2? [Seelam]: That is correct I'm talking only about RoCEv2. Thank you for the clarification. Q. Can you comment on storage needs for DL training and have you considered the use of scale out cloud storage services for deep learning training? If so, what are the challenges and issues? [Milind]: The storage needs are 1) massive and 2) based on the kind of training that you're doing, (data parallel versus model parallel). With different optimizations, you will need parts of your data to be local in many circumstances. It's not always possible to do efficient training when data is physically remote and there's a large latency in accessing it. Some sort of a caching infrastructure will be required in order for your training to proceed efficiently. Seelam may have other thoughts on scale out approaches for training data. [Seelam]: Yes, absolutely I agree 100%. Unfortunately, there is no silver bullet to address the data problem with large-scale training. We take a three-pronged approach. Predominantly, we recommend users put their data in object storage and that becomes the source of where all the data lives. Many training jobs, especially training jobs that deal with text data, don't tend to be huge in size because these are all characters so we use object store as a source directly to read the data and feed the GPUs to train. So that's one model of training, but that only works for relatively smaller data sets. They get cached once you access the first time because you shard it quite nicely so you don't have to go back to the data source many times. There are other data sets where the data volume is larger. So, if you're dealing with pictures, video or these kinds of training domains, we adopt a two-pronged approach. In one scenario we actually have a distributed cache mechanism where the end users have a copy of the data in the file system and that becomes the source for AI training. In another scenario, we deployed that system with sufficient local storage and asked users to copy the data into that local storage to use that local storage as a local cache. So as the AI training is continuing once the data is accessed, it's actually cached on the local drive and subsequent iterations of the data come from that cache. This is much bigger than the local memory. It’s about 12 terabytes of cache local storage with the 1.5 terabytes of data. So, we could get to these data sets that are in the 10-terabyte range per node just from the local storage. If they exceed that, then we go to this distributed cache. If the data sets are small enough, then we just use object storage. So, there are at least three different ways, depending on the use case on the model you are trying to train. Q. In a fully sharded data parallel model, there are three communication calls when compared to DDP (distributed data parallel). Does that mean it needs about three times more bandwidth? [Seelam]: Not necessarily three times more, but you will use the network a lot more than you would use in a DDP. In a DDP or distributed data parallel model you will not use the network at all in the forward pass. Whereas in an FSDP (fully sharded data parallel) model, you use the network both in forward pass and in backward pass. In that sense you use the network more, but at the same time because you don't have parts of the model within your system, you need to get the model from the other neighbors and so that means you will be using more bandwidth. I cannot give you the 3x number; I haven't seen the 3x but it's more than DDP for sure. The SNIA CSTI has an active schedule of webinars to help educate on cloud technologies. Follow us on Twitter @sniacloud_com and sign up for the SNIA Matters Newsletter, so that you don’t miss any.                      

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Michael Hoard

Mar 9, 2023

title of post
A digital twin (DT) is a virtual representation of an object, system or process that spans its lifecycle, is updated from real-time data, and uses simulation, machine learning and reasoning to help decision-making. Digital twins can be used to help answer what-if AI-analytics questions, yield insights on business objectives and make recommendations on how to control or improve outcomes. It's a fascinating technology that the SNIA Cloud Storage Technologies Initiative (CSTI) discussed at our live webcast “Journey to the Center of Massive Data: Digital Twins.” If you missed the presentation, you can watch it on-demand and access a PDF of the slides at the SNIA Educational Library. Our audience asked several interesting questions which are answered here in this blog. Q. Will a digital twin make the physical twin more or less secure? A. It depends on the implementation. If DTs are developed with security in mind, a DT can help augment the physical twin. Example, if the physical and digital twins are connected via an encrypted tunnel that carries all the control, management, and configuration traffic, then a firmware update of a simple sensor or actuator can include multi-factor authentication of the admin or strong authentication of the control application via features running in the DT, which augments the constrained environment of the physical twin. However, because DTs are usually hosted on systems that are connected to the internet, ill-protected servers could expose a physical twin to a remote intruder. Therefore, security must be designed from the start. Q. What are some of the challenges of deploying digital twins? A. Without AI frameworks and real-time interconnected pipelines in place digital twins’ value is limited. Q. How do you see digital twins evolving in the future? A. Here are a series of evolutionary steps:
  • From Discrete DT (for both pre- and post-production), followed by composite DT (e.g assembly line, transportation systems), to Organization DT (e.g. supply chains, political parties).
  • From pre-production simulation, to operational dashboards of current state with human decisions and control, to autonomous limited control functions which ultimately eliminate the need for individual device manager SW separate from the DT.
  • In parallel, 2D DT content displayed on smartphones, tablets, PCs, moving to 3D rendered content on the same, moving selectively to wearables (AR/VR) as the wearable market matures leading to visualized live data that can be manipulated by voice and gesture.
  • Over the next 10 years, I believe DTs become the de facto Graphic User Interface for machines, buildings, etc. in addition to the GUI for consumer and commercial process management.
Q. Can you expand on your example of data ingestion at the edge please? Are you referring to data capture for transfer to a data center or actual edge data capture and processing for digital twin. If the latter, what use cases might benefit? A. Where DTs are hosted and where AI processes are computed, like inference or training on time-series data, don’t have to occur in the same server or even the same location. Nevertheless, depending upon the expected time-to-action and time-to-insight, plus how much data needs to be processed and the cost of moving that data, will dictate where digital twins are placed and how they are integrated within the control path and data path. Example, a high-speed robotic arm that must stop if a human puts their hand in the wrong space, will likely have an attached or integrated smart camera which is capable of identifying (inferring) a foreign object. It will stop itself and an associated DT will receive notice of an event after the fact. A digital twin of the entire assembly line may learn of the event from the robotic arm’s DT and inject control commands to the rest of the assembly line to gracefully slow down or stop. Both DT of the discrete robotic arm and the composite DT of the entire assembly are likely executing on compute infrastructure on the premises in order to react quickly. Whereas, the “what if” capabilities of both types of DTs may run in the cloud or local data center as the optional simulation capability of the DT are not subjected to real or near real-time round-trip time-to-action constraints and may require more compute and storage capacity than is locally available. The point is the “Edge” is a key part of the calculus to determine where DTs operate. Meaning, time-actionable-insights, cost of data movement, governance restrictions of data movement, and the availability / cost of compute and store infrastructure, plus access to Data Scientists, IT professionals, and AI frameworks is increasingly driving more and more automation processing to the “Edge” and its natural for DTs to follow the data. Q. Isn’t Google maps also an example of a digital twin (especially when we use it to drive based on our directions we input and start driving based on its inputs)? A. Good question! It is a digital representation of physical process (a route to a destination) that ingests data from sensor (other vehicles whose operators are using Google Maps driving instructions along some portion of the route.) So, yes. DTs are digital representations of physical things, processes or organizations that share data. But Google maps is an interesting example of a self-organizing composite DT, meaning lots of users acting both sensors (aka discrete DTs) and selective digital viewers of the behavior of many physical cars moving through a shared space. Q. You brought up an interesting subject around regulations and compliance. Considering that some constructions would require approvals from regulatory authorities, how would a digital twin (especially when we have pics that re-construct / re-model soft copies of the blueprints based on modifications identified through the 14-1500 pics) comply to regulatory requirements? A. Some safety regulations in various regions of the world apply to processes. E.g Worker safety in factories. Time to certify is very slow as lots of documentation is compiled and analyzed by humans. DTs could use live data to accelerate documentation, simulation or replays of real data within digital twins and could potentially enable self-certification of new or reconfigured process, assuming that regulatory bodies evolve. Q. Digital twin captures the state of its partner in real time. What happens to aging data? Do we need to store data indefinitely? A, Data retention can shrink as DTs and AI frameworks evolve to perform ongoing distributed AI model refreshing. As AI models refresh more dynamically, the increasingly fewer and fewer anomalous events become the gold used for the next model refresh. In short, DTs should help reduce how much data is retained. Part of what DT can be built to do is to filter out compliance data for long-term archival. Q. Do we not run a high risk when model and reality do not align? What if we trust the twin too much? A. Your question targets more general challenges of AI. There is a small but growing cottage industry evolving in parallel with DT and AI. Analysts refer to it as Explainable AI, whose intent is to explain to mere mortals how and why an AI model results in the predictions and decisions it makes. Your concern is valid, and for this reason we should expect that humans will likely be in the control loop wherein the DT doesn’t act autonomically for non-real-time control functions.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Subscribe to Artificial Intelligence