Ethernet in the Age of AI Q&A

Raguraman Sundaram

Jan 11, 2025

title of post

 AI is having a transformative impact on networking. It’s a topic that the SNIA Data, Storage & Networking Community covered in our live webinar, “Ethernet in the Age of AI: Adapting to New Networking Challenges.” The presentation explored various use cases of AI, the nature of traffic for different workloads, the network impact of these workloads, and how Ethernet is evolving to meet these demands. The webinar audience was highly engaged and asked many interesting questions. Here are the answers to them all. Q. What is the biggest challenge when designing and operating an AI Scale out fabric? A. The biggest challenge in designing and operating an AI scale-out fabric is achieving low latency and high bandwidth at scale. AI workloads, like training large neural networks, demand rapid, synchronized data transfers between thousands of GPUs or accelerators. This requires specialized interconnects, such as RDMA, InfiniBand, or NVLink, and optimized topologies like fat-tree or dragonfly to minimize communication delays and bottlenecks. Balancing scalability with performance is critical; as the system grows, maintaining consistent throughput and minimizing congestion becomes increasingly complex. Additionally, ensuring fault tolerance, power efficiency, and compatibility with rapidly evolving AI workloads adds to the operational challenges. Unlike standard data center networks, AI fabrics handle intensive east-west traffic patterns that require purpose-built infrastructure. Effective software integration for scheduling and load balancing is equally essential. The need to align performance, cost, and reliability makes designing and managing an AI scale-out fabric a multifaceted and demanding task. Q. What are the most common misconceptions about AI scale-out fabrics? A. The most common misconception about AI scale-out fabrics is that they are the same as standard data center networks. In reality, AI fabrics are purpose-built for high-bandwidth, low-latency, east-west communication between GPUs, essential for workloads like large language model (LLM) training and inference. Many believe increasing bandwidth alone solves performance issues, but factors like latency, congestion control, and topology optimization (e.g., fat-tree, dragonfly) are equally critical. Another myth is that scaling out is straightforward—adding GPUs without addressing communication overhead or load balancing often leads to bottlenecks. Similarly, people assume all AI workloads can use a single fabric, overlooking differences in training and inference needs. AI fabrics also aren’t plug-and-play; they require extensive tuning of hardware and software for optimal performance. Q. How do you see the future of AI Scale-out fabrics evolving over the next few years? A. AI scale-out fabrics is going to have more and more Ethernet. Ethernet-based fabrics, enhanced with technologies like RoCE (RDMA over Converged Ethernet), will continue to evolve to deliver the low latency and high bandwidth required for large-scale AI applications, particularly in training and inference of LLMs. Emerging standards like Ethernet 800GbE and beyond will provide the throughput needed for dense, GPU-intensive workloads. Advanced congestion management techniques, such as DCQCN, Multipathing, Packet trimming etc, will improve performance in Ethernet-based fabrics by reducing packet loss and latency. Ethernet's cost-effectiveness, ubiquity, and compatibility with hybrid environments will make it a key enabler for AI scale-out fabrics in both cloud and on-premises deployments. The convergence of CXL over Ethernet may eventually enable memory pooling and shared memory access across components within scale-up systems, supporting the increasing memory demands of LLMs. The need for having Ethernet for scale-up is going to be on the rise as well. Q. What are the best practices for staying updated with the latest trends and developments? Can you recommend any additional resources or readings for further learning? A. There are several papers and research articles on the internet, some of them are listed in the webinar slide deck. Following Ultra Ethernet Consortium and SNIA are the best ways to learn about networking related updates. Q. Is NVLink a standard? A. No, NVLink is not an open standard. It is a proprietary interconnect technology developed by NVIDIA. It is specifically designed to enable high-speed, low-latency communication between NVIDIA GPUs and, in some cases, between GPUs and CPUs in NVIDIA systems. Q. What's the difference between collections and multicast? A. It is tempting to think that collections and multicast are similar, for example the collectives like Broadcast. But they are in principle different and address different requirements. Collections are high-level operations for distributed computing, while multicast is a low-level network mechanism for efficient data transmission. Q. What's the support lib/tool/kernel module for enabling Node1 GPU1-> Node2 GPU2->GPU fabric -> Node2 GPU2? It seems some Host level knowledge, not TOR level. A. Yes, the topology discovery and optimal path for routing the GPU messages from the source depends on the Host software and is not TOR dependent. The GPU applications end up using the MPI APIs for communication between the nodes in the cluster. These MPI APIs are made aware of the GPU topologies by the respective extension libraries provided by the GPU vendor. For instance, NVIDIA's NCCL and AMD's RCCL libraries provide option to mention static GPU topology in the system through an XML file (via NCCL_TOPO_FILE or RCCL_TOPO_FILE) that can be loaded when initializing the stack. The MPI GPU aware library extensions from NVIDIA/AMD would then leverage this provided topology information to send the messages to the appropriate GPU. An example NCCL topology is here: https://github.com/nebius/nccl-topology/blob/main/nccl-topo-h100-v1.xml. There are utilities such as nvidia-smi/rocm-smi that are used in the initial discovery. The automatic topology detection and calculation of optimal paths for MPI could be made available as part of GPU vendor's CCL library as well. For instance, NCCL provides such functionality by reading the /sys from the host and building PCI topology of GPU/NICs. The SNIA Data, Storage & Networking Community provides vendor-neutral education on a wide range of topics. Follow us on LinkedIn and @SNIA for upcoming webinars, articles, and content.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Hidden Costs of AI Q&A

Erik Smith

Mar 14, 2024

title of post
At our recent SNIA Networking Storage Forum webinar, “Addressing the Hidden Costs of AI,” our expert team explored the impacts of AI, including sustainability and areas where there are potentially hidden technical and infrastructure costs. If you missed the live event, you can watch it on-demand in the SNIA Educational Library. Questions from the audience ranged from training Large Language Models to fundamental infrastructure changes from AI and more. Here are answers to the audience’s questions from our presenters. Q: Do you have an idea of where the best tradeoff is for high IO speed cost and GPU working cost? Is it always best to spend maximum and get highest IO speed possible? A: It depends on what you are trying to do If you are training a Large Language Model (LLM) then you’ll have a large collection of GPUs communicating with one another regularly (e.g., All-reduce) and doing so at throughput rates that are up to 900GB/s per GPU! For this kind of use case, it makes sense to use the fastest network option available. Any money saved by using a cheaper/slightly less performant transport will be more than offset by the cost of GPUs that are idle while waiting for data. If you are more interested in Fine Tuning an existing model or using Retrieval Augmented Generation (RAG) then you won’t need quite as much network bandwidth and can choose a more economical connectivity option. It’s worth noting that a group of companies have come together to work on the next generation of networking that will be well suited for use in HPC and AI environments. This group, the Ultra Ethernet Consortium (UEC), has agreed to collaborate on an open standard and has wide industry backing. This should allow even large clusters (1000+ nodes) to utilize a common fabric for all the network needs of a cluster. Q: We (all industries) are trying to use AI for everything.  Is that cost effective?  Does it cost fractions of a penny to answer a user question, or is there a high cost that is being hidden or eaten by someone now because the industry is so new? A: It does not make sense to try and use AI/ML to solve every problem. AI/ML should only be used when a more traditional, algorithmic, technique cannot easily be used to solve a problem (and there are plenty of these). Generative AI aside, one example where AI has historically provided an enormous benefit for IT practitioners is Multivariate Anomaly Detection. These models can learn what normal is for a given set of telemetry streams and then alert the user when something unexpected happens. A traditional approach (e.g., writing source code for an anomaly detector) would be cost and time prohibitive and probably not be anywhere nearly as good at detecting anomalies. Q: Can you discuss typical data access patterns for model training or tuning? (sequential/random, block sizes, repeated access, etc)? A: There is no simple answer as the access patterns can vary from one type of training to the next. Assuming you’d like a better answer than that, I would suggest starting to look into two resources:
  1. Meta’s OCP Presentation: “Meta’s evolution of network for AI” includes a ton of great information about AI’s impact on the network.
  2. Blocks and Files article: “MLCommons publishes storage benchmark for AI” includes a table that provides an overview of benchmark results for one set of tests.
Q: Will this video be available after the talk? I would like to forward to my co-workers. Great info. A: Yes. You can access the video and a PDF of the presentations slides here. Q: Does this mean we’re moving to fewer updates or write once (or infrequently) read mostly storage model?  I’m excluding dynamic data from end-user inference requests. A: For the active training and finetuning phase of an AI model the data patterns are very read heavy. There is quite a lot of work done before a training or finetuning job begins that is much more balanced between read & write. This is called the “data preparation” phase of an AI pipeline. Data prep takes existing data from a variety of sources (inhouse data lake, dataset from a public repo, or a database) and performs data manipulation tasks to accomplish data labeling and formatting at a minimum. So, tuning for just read may not be optimal. Q: Fibre Channel seems to have a lot of the characteristics required for the fabric. Could a Fibre Channel fabric over NVMe be utilized to handle the data ingestion for AI component on dedicated adapters for storage (disaggregate storage)? A: Fibre Channel is not a great fit for AI use cases for a few reasons:
  • With AI, data is typically accessed as either Files or Objects, not Blocks, and FC is primarily used to access block storage.
  • If you wanted to use FC in place of IB (for GPU to GPU traffic) you’d need something like an FC-RDMA to make FC suitable.
  • All of that said, FC currently maxes out at 128GFC and there are two reasons why this matters:
    1. AI optimized storage starts at 200Gbps and based on some end user feedback, 400Gbps is already not fast enough.
    2. GPU to GPU traffic bandwidth requirements require up to 900GB/s (7200Gbps) of throughput per GPU, that’s about 56 128GFC interfaces per GPU.
Q: Do you see something like GPUDirect storage from NVIDIA becoming the standard?  So does this mean NVMe will win? (over FC or TCP?)  Will other AI chip providers have to adopt their own GPUDirect-like protocol? A: It’s too early to say whether or not GPUDirect storage will become a de facto standard or if alternate approaches (e.g., pNFS) will be able to satisfy the needs of most environments. The answer is likely to be “both”. Q: You’ve mentioned demand for higher throughput for training, and lower latency for inference. Is there a demand for low cost, high capacity, archive tier storage? A: Not specifically for AI. Depending on what you are doing, training and inference can be latency or throughput sensitive (sometimes both). Training an LLM (which most users will never actually attempt to do) requires massive throughput from storage for reads and writes, literally the faster the better when loading data into the GPUs or when the GPUs are saving checkpoints. An inference workload wouldn’t require the same throughput as training would but to the extent that it needs to access storage, it would certainly benefit from low latency. If you are trying to optimize AI storage for anything but performance (e.g., cost), you are probably going to be disappointed with overall performance of the system. Q: What are the presenters’ views about industry trend to run workload or train a model? is it in the cloud datacenters like AWS or GCP or On-prem? A: It truly depends on what you are doing. If you want to experiment with AI (e.g., an AI version of a “Hello World” program), or even something a bit more involved, there are lots of options that allow you to use the cloud economically. Check out this collection of colab notebooks for an example and give it a try for yourself. Once you get beyond simple projects, you’ll find that using cloud-based services will become prohibitively expensive and you’ll quickly want to start running you training jobs on-prem, the downside to this is the need to manage the infrastructure elements yourself, this assumes that you can even get the right GPUs, although there are reports that supply issues are easing in this space. The bottom line is, whether or not to run on-prem or in the cloud is still a question of answering the question, can you realistically get the same ease of use and freedom from HW maintenance from your own infrastructure as you could from a CSP.  Sometimes the answer is yes. Q: Does AI accelerator in PC (recently advertised for new CPUs) have any impact/benefit on using large public AI models? A: AI accelerators in PCs will be a boon for all of us as it will enable inference at the edge. It will also allow exploration and experimentation on your local system for building your own AI work. You will, however, want to focus on small or mini models at this time. Without large amounts of dedicated GPU memory to help speed things up only the small models will run well on your local PC. That being said, we will continue to see improvements in this area and PCs are a great starting point for AI projects. Q: Fundamentally — Is AI radically changing what is required from storage? Or is it simply accelerating some of the existing trends of reducing power, higher density SSDs, and pushing faster on the trends in computational storage, new NVMs transport modes (such as RDMA), and pushing for ever more file system optimizations? A: From the point of view of a typical enterprise storage deployment (e.g., Block storage being accessed over an FC SAN), AI storage is completely different. Storage is accessed as either Files or Objects, not as blocks and the performance requirements already exceed the maximum speeds that FC can deliver today (i.e., 128GFC). This means most AI storage is using either Ethernet or IB as a transport. Raw performance seems to be the primary driver in this space right now rather than reducing power consumption or Increasing density. You can expect protocols such as GPUDirect and pNFS to become increasingly important to meet performance targets. Q: What are the innovations in HDDs relative to AI workloads? This was mentioned in the SSD + HDD slide. A: The point of the SSD + HDD slide was to point out the introduction of SSDs:
  1. dramatically improved overall storage system efficiency, leading to a dramatic performance boost. This performance boost impacted the amount of data that a single storage port could transmit onto a SAN and this had a dramatic impact on the need to monitor for congestion and congestion spreading.
  2. didn’t completely displace the need for HDDs, just as GPUs won’t replace the need for CPUs. They provide different functions and excel at different types of jobs.
Q: What is the difference between (1) Peak Inference, (2) Mainstream Inference, (3) Baseline Inference, and (4) Endpoint Inference, specifically from a cost perspective? A: This question was answered Live during the webinar (see timestamp 44:27) the following is a summary of the responses: Endpoint inference is inference that is happening on client devices (e.g., laptops, smartphones) where much smaller models that have been optimized for the very constrained power envelope of these devices. Peak inference can be thought about as something like Chat GPT or Bings AI chatbot, where you need large / specialized infrastructure (e.g., GPUs, specialized AI Hardware accelerators). Mainstream and Baseline inference is somewhere in between where you’re using much smaller models or specialized models. For example, you could have a mistral 7 billion model which you have fine-tuned for your enterprise use case of document summarization or to find insights in a sales pipeline, and these use cases can employ much smaller models and hence the requirements can vary. In terms of cost the deployment of these models for edge inference would be low as compared to peak inference like a chat GPT which would be much higher. In terms of infrastructure requirements some of the Baseline and mainstream inference models can be served just by using a CPU alone or with a CPU plus a GPU, or with a CPU plus a few GPUs, or CPU plus a few AI accelerators. CPUs available today do have built AI accelerators which can provide an optimized cost solution for Baseline and mainstream inference which will be the typical scenario in many enterprise environments. Q: You said utilization of network and hardware is changing significantly but compared to what? Traditional enterprise workloads or HPC workloads? A: AI workloads will drive network utilization unlike anything the enterprise has ever experienced before. Each GPU (of which there are currently up to 8 in a server) can currently generate 900GB/s (7200 Gbps) of GPU to GPU traffic. To be fair, this GPU to GPU traffic can and should be isolated to a dedicated “AI Fabric” that has been specifically designed for this use. Along these lines new types of network topologies are being used. Rob mentioned one of them during his portion of the presentation (i.e., the Rail topology). Those end users already familiar with HPC will find many of the same constraints and scalability issues that need to be dealt with in HPC environments also impact AI infrastructure. Q: What are the key networking considerations for AI deployed at Edge (i.e. stores, branch offices)? A: AI at the edge is a talk all on its own. Much like we see large differences between training, fine tuning, and inference in the data center, inference at the edge has many flavors and performance requirements that differ from use case to use case. Some examples are a centralized set of servers ingesting the camera feeds for a large retail store, aggregating them, and making inferences as compared to a single camera watching an intersection and using an on-chip AI accelerator to make streaming inferences. All forms of devices from medical test equipment, your car, or your phone are all edge devices with wildly different capabilities.       The post Hidden Costs of AI Q&A first appeared on SNIA on Network Storage.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Storage for Automotive Q&A

Tom Friend

Jan 10, 2022

title of post
At our recent SNIA Networking Storage Forum (NSF) webcast “Revving up Storage for Automotive” our expert presenters, Ryan Suzuki and John Kim, discussed storage implications as vehicles are turning into data centers on wheels. If you missed the live event, it is available on-demand together with the presentations slides. Our audience asked several interesting questions on this quickly evolving industry. Here are John and Ryan’s answers to them. Q: What do you think the current storage landscape is missing to support the future of IoV [Internet of Vehicles]? Are there any identified cases of missing features from storage (edge/cloud) which are preventing certain ideas from being implemented and deployed? [Ryan] I would have to say no, currently there are no missing features in edge or cloud storage that are preventing ideas from being implemented. If anything, more vehicles need to adopt both wireless connectivity and the associated systems (IVI, ADAS/AD) to truly realize IoV. This will take some time as these technologies are just beginning to be offered in vehicles today. There are 200 million vehicles on the road in the US while in a typical year 17 million new vehicles are sold. [John] My personal opinion is no—the development of the IoV is currently limited by a combination of AI training power in the datacenter, compute power within the vehicles, wireless bandwidth (such as waiting for the broader rollout of 5G), and the development of software for new vehicles. Possibly the biggest limit is the slow rate of replacement of existing non-connected vehicles with IoV-capable. The IoV will definitely require more and possibly smarter storage in the datacenter, cloud and edge, but that storage is not what is limiting or blocking the faster rollout of IoV. Q: Talking from a long-term view, is on-board storage the way to go or will we be shifting to storage at the network edge given high bandwidth network like 5G is flourishing? [Ryan] On-board storage will remain in vehicles and continue to grow because vehicles must be fully operational from a driving perspective even if a wireless connection (5G or otherwise) cannot be established. For example, systems in the vehicle required for safe driving (ADAS/AD) must operate independent of an outside connection. In addition, data collected during operation may need to be stored in the event of a slow or intermittent connection to avoid loss of data. Q: What is the anticipated hourly storage needed? At one point this was in the multiple TB range. [John] HD video (1080p at 30 frames per second) requires from 2-4 GB/hour and 4K video requires 15-20 GB/hour, so if a car has 6 HD cameras and a few additional sensors being recorded, the hourly storage need for a normal ADAS would be 8-30 GB/hour. However, a car being used to train, develop or test ADAS/AD systems would collect multiple video angles, more types of data and higher-resolution video/audio/radar/lidar/performance data, possibly requiring 1-5 TB per hour. Q: Do you know of any specific storage requirement, design etc. in the car or the backend, specifically for meeting the UNECE 155/156? It’s specifically for software update, hence the storage question [Ryan] Currently, there are no specific automotive requirements for storage products to meet UNECE 155/156. This regulation was developed by a regional commission of the UN focused on Europe. While security is a concern and will grow as cars become more connected, in my opinion, an international regulation/standard needs to be agreed upon to ensure a consistent level of security for all vehicles in all regions. Q: Does automotive storage need to be ASIL-B or ASIL-D certified? [Ryan] Individual storage components are not ASIL certified as the certification is completed at the system level. For example, systems like vision ADAS, anti-lock braking, and power steering (self-steering), require ASIL-D certification, the highest compliance level. Typically, components that mention a specific level of ASIL compliance have been evaluated at a system hardware level. Q. What type of endurance does automotive storage need, given the average or 99% percentile lifespan of a modern car? [Ryan] It depends on how the storage device is being used. If the device is used for code/application storage such as the AI Inference, the endurance requirement will be relatively low as it only needs to support periodic updates of the code and updates of high-definition maps. Storage devices used for data logging on the other hand, require a higher endurance level as data is written during vehicle operation, uploaded to the cloud later typically through a WiFi connection and then erased. This cycle is repeated every time the vehicle is driven. Q. Will 5G change how much data vehicles can send and receive while driving? [John] Eventually yes, because 5G allows higher wireless/cellular data rates. However, 5G antennas also have shorter range, so more of those antennas and base stations are required for coverage. This means 5G will roll out first in urban centers and will take time to roll out in more rural areas, and vehicles that drive to rural areas will not be able to count on always using the higher 5G data rates. 5G will also be used to connect vehicles in defined environments such as a school campus, bus/truck depot, factory, warehouse or police station. For example, a robot operating only within a warehouse could count on having 5G access all the time, and a bus, police car or ADAS/AD training car could store terabytes of data in the vehicle and upload it easily over a local 5G connection once it returns to the garage or station. Q. In autonomous driving, are all the AI compute capabilities and AI rules or training stored inside each car? Or are AD cars relying somewhat on AI running somewhere in the cloud? [John] Most of the AI rules for actually driving (AI inferencing) must be stored inside each car because there isn’t enough time to consult a computer (or additional rules) stored in the cloud and use them for real-time driving decisions. The training data and machine learning training algorithms used to create the training rules are typically stored in the cloud or in a corporate data center. Updated training rules, navigation data, and vehicle system software updates can all be stored in the cloud and pushed out to vehicles on a periodic basis. Traffic or weather data can be stored in the cloud and sent to vehicles (or to phones in vehicles) as often as several times each minute. Q. Does the chip shortage mean car companies are putting less storage inside new cars than they think they should? [Ryan] Not from what I have seen.  For vehicles currently in production, the designs are locked and with a limited number of vehicles OEMs can produce, they have shifted production to higher-end models to maximize profit. This means the systems in these vehicles may actually use higher amounts of storage to support the features. For new vehicle development, storage capacities continue to grow in order to enable new applications including IVI and ADAS. [John] Generally no, the manufacturers are still putting in whatever amount of storage they originally planned for each vehicle and simply limiting the number of vehicles built based on the supply of semiconductors, and the limitations tend to be across several types of chips, not just memory or storage chips. It’s possible in some cars they are using older, different, or more expensive storage components than originally planned in order to get around chip shortages, but the total amount of storage is unlikely to decrease. Q. Can typical data storage inside a car be upgraded or expanded? [Ryan] Due to the shock and vibration vehicles encounter during operation, storage devices typically come in a BGA package and are soldered onto a PCB for higher reliability. Increasing the density would require replacing the PCB for a new board with a higher capacity storage device. Some new vehicles are installing external USB ports that can use USB drives to store non-critical information such as security camera footage while the vehicle is parked. Q. Given the critical nature of AD systems or even engine control software, do car makers do anything special with their storage to ensure high availability or high uptime? How does a car deal with storage failure? [Ryan] In the case of autonomous driving, this is a safety critical system and the reliability is examined at a system level. In an AD system, there are typically multiple SOCs not only to handle the complex computational tasks, but also for redundancy. In the event the main SOC system fails, another SOC can take over to ensure the vehicle continues to operate safely. From a storage standpoint, each SOC typically uses its own storage device. Q. You know those black boxes they put in planes (or cars) to record data in case of a crash? Those boxes are designed to survive crashes. Why can’t they build the whole car out of the same stuff? [Ryan] While this would provide an ultimate level of safety for passengers, it is unfortunately not economically feasible. To scale a black box with the approximate volume of a 2.5” hard drive to over 120 ft3 (interior passenger and cargo volume) of a standard mid-size vehicle would be cost prohibitive. [John] It would be too expensive and possibly too heavy to build the entire car like a “black box” data recorder. Also, a black box just needs to be designed to make one small component or data storage very survivable while the entire car needs to act as an impact protection and energy absorption system that maximizes the survivability of the occupants during and after an accident. Q. What prevents hackers from breaching automotive systems and modifying the car’s software or deleting critical data? [John] Automotive systems are typically designed with fewer remote access paths and tighter security to make it harder to breach the system. Usually, the systems require encrypted keys from the vehicle manufacturer to access the systems remotely, and some updates or data deletion may be possible only with physical access to the car’s data port. Also, certain data may be stored on flash or persistent memory within the vehicle to make it harder to delete. Still even with these precautions, a mistake or bug in the vehicle’s software or firmware could allow a hacker to gain unauthorized access in rare cases. Q. Would most automotive storage run as block, file, or object storage? [John] Most of the local storage inside a vehicle and anything storing standardized databases or small logs would probably be block storage, as that typically is easy to use for local storage and/or structured data. Data center storage for AI or ADAS training, vehicle design, or aerodynamic/crash/FEA simulation is usually file-based storage to allow for easy sharing and technical computing across multiple servers. Any archived data for vehicle design, training, simulation, videos, telemetry that is stored outside the vehicle is most likely to be object storage because these are typically larger files with unstructured data that don’t change after creation and need to be retained for a long time. Q. Does automotive storage need to use redundancy like RAID or erasure coding? [Ryan] No, current single-device storage solutions with built-in ECC provide the required reliability.  Implementing a RAID system or erasure encoding would require multiple drives significantly driving up the cost.  Electronics currently account for 40% of a new vehicle’s total cost and it is expected to continue growing.  Switching from an existing solution that meets system requirements to a storage solution that is multiple times the cost is not practical.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Revving Up Storage for Automotive

Tom Friend

Nov 8, 2021

title of post
Each year cars become smarter and more automated. In fact, the automotive industry is effectively transforming the vehicle into a data center on wheels. Connectedness, autonomous driving, and media & entertainment all bring more and more storage onboard and into networked data centers. But all the storage in (and for) a car is not created equal. There are 10s if not 100s of different processors on a car today. Some are attached to storage, some are not and each application demands different characteristics from the storage device. The SNIA Networking Storage Forum (NSF) is exploring this fascinating topic on December 7, 2021 at our live webcast “Revving Up Storage for Automotive” where industry experts from both the storage and automotive worlds will discuss:
  • What’s driving growth in automotive storage?
  • Special requirements for autonomous vehicles
  • Where automotive data is typically stored?
  • Special use cases
  • Vehicle networking & compute changes and challenges
Start your engines and register today to join us as we drive into the future!

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Tom Friend

Oct 20, 2021

title of post
What types of storage are needed for different aspects of AI? That was one of the many topics covered in our SNIA Networking Storage Forum (NSF) webcast “Storage for AI Applications.” It was a fascinating discussion and I encourage you to check it out on-demand. Our panel of experts answered many questions during the live roundtable Q&A. Here are answers to those questions, as well as the ones we didn’t have time to address. Q. What are the different data set sizes and workloads in AI/ML in terms of data set size, sequential/ random, write/read mix? A. Data sets will vary incredibly from use case to use case. They may be GBs to possibly 100s of PB. In general, the workloads are very heavily reads maybe 95%+. While it would be better to have sequential reads, in general the patterns tend to be closer to random. In addition, different use cases will have very different data sizes. Some may be GBs large, while others may be <1 KB. The different sizes have a direct impact on performance in storage and may change how you decide to store the data. Q. More details on the risks associated with the use of online databases? A. The biggest risk with using an online DB is that you will be adding an additional workload to an important central system. In particular, you may find that the load is not as predictable as you think and it may impact the database performance of the transactional system. In some cases, this is not a problem, but when it is intended for actual transactions, you could be hurting your business. Q. What is the difference between a DPU and a RAID / storage controller? A. A Data Processing Unit or DPU is intended to process the actual data passing through it. A RAID/storage controller is only intended to handle functions such as data resiliency around the data, but not the data itself. A RAID controller might take a CSV file and break it down for storage in different drives. However, it does not actually analyze the data. A DPU might take that same CSV and look at the different rows and columns to analyze the data. While the distinction may seem small, there is a big difference in the software. A RAID controller does not need to know anything about the data, whereas a DPU must be programmed to deal with it. Another important aspect is whether or not the data will be encrypted. If the data will encrypted, a DPU will have to have additional security mechanisms to deal with decryption of the data. However, a RAID-based system will not be affected. Q. Is a CPU-bypass device the same as a SmartNIC? A. Not entirely. They are often discussed together, but a DPU is intended to process data, whereas a SmartNIC may only process how the data is handled (such as encryption, handle TCP/IP functions, etc.).  It is possible for a SmartNIC to also act as a DPU where the data itself is processed. There are new NVMe-oF™ technologies that are beginning to allow FPGA, TPD, DPU, GPU and other devices direct access to other servers’ storage directly over a high-speed local area network without having to access the CPU of that system. Q. What work is being done to accelerate S3 performance with regard to AI? A. A number of companies are working to accelerate the S3 protocol. Presto and a number of Big Data technologies use it natively. For AI workloads there are a number of caching technologies to handle the re-reads of training on a local system. Minimizing the performance penalty Q. From a storage perspective, how do I take different types of data from different storage systems to develop a model? A. Work with your project team to find the data you need and ensure it can be served to the ML/DL training (or inference) environment in a timely manner. You may need to copy (or clone) data on to a faster medium to achieve your goals. But look at the process as a whole. Do not underestimate the data cleansing/normalization steps in your storage analysis as it can prove to be a bottleneck. Q. Do I have to “normalize” that data to the same type, or can a model accommodate different data types? A. In general, yes. Models can be very sensitive. A model trained on one set of data with one set of normalizations may not be accurate if data that was taken from a different set with different normalizations is used for inference. This does depend on the model, but you should be aware not only of the model, but also the details of how the data was prepared prior to training. Q. If I have to change the data type, do I then need to store it separately? A. It depends on your data, “do other systems need it in the old format?” Q. Are storage solutions that are right for one form of AI also the best for others? A. No. While it may be possible to use a single solution for multiple AIs, in general there are differences in the data that can necessitate different storage. A relatively simple example is large data (MBs) vs. small data (~1KB). Data in that multiple MBs large example can be easily erasure coded and stored more cost effectively. However, for small data, Erasure Coding is not practical and you generally will have to go with replication. Q. How do features like CPU bypass impact performance of storage? A. CPU bypass is essential for those times when all you need to do is transfer data from one peripheral to another without processing. For example, if you are trying to take data from a NIC and transfer it to a GPU, but not process the data in any way, CPU bypass works very well. It prevents the CPU and system memory from becoming a bottleneck. Likewise, on a storage server, if you simply need to take data from an SSD and pass it to a NIC during a read, CPU bypass can really help boost system performance. One important note: if you are well under the limits of the CPU, the benefits of bypass are small. So, think carefully about your system design and whether or not the CPU is a bottleneck. In some cases, people will use system memory as a cache and in these cases, bypassing CPU isn’t possible. Q. How important is it to use All-Flash storage compared to HDD or hybrid? A. Of course, It depends on your workloads. For any single model, you may be able to make due with HDD. However, another consideration for many of the AI/ML systems is that their use can quite suddenly expand. Once there is some amount of success, you may find that more people will want access to the data and the system may experience more load. So beware of the success of these early projects as you may find your need for creation of multiple models from the same data could overload your system. Q. Will storage for AI/ML necessarily be different from standard enterprise storage today? A. Not necessarily. It may be possible for enterprise solutions today to meet your requirements. However, a key consideration is that if your current solution is barely able to handle its current requirements, then adding an AI/ML training workload may push it over the edge. In addition, even if your current solution is adequate, the size of many ML/DL models are growing exponentially every year.  So, what you provision today may not be adequate in a year or even several months.  Understanding the direction of the work your data scientists are pursuing is important for capacity and performance planning.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Storage for Applications Webcast Series

John Kim

Sep 8, 2021

title of post
Everyone enjoys having storage that is fast, reliable, scalable, and affordable. But it turns out different applications have different storage needs in terms of I/O requirements, capacity, data sharing, and security.  Some need local storage, some need a centralized storage array, and others need distributed storage—which itself could be local or networked. One application might excel with block storage while another with file or object storage. For example, an OLTP database might require small amounts of very fast flash storage; a media or streaming application might need vast quantities of inexpensive disk storage with extra security safeguards; while a third application might require a mix of different storage tiers with multiple servers sharing the same data. This SNIA Networking Storage Forum “Storage for Applications” webcast series will cover the storage requirements for specific uses such as artificial intelligence (AI), database, cloud, media & entertainment, automotive, edge, and more. With limited resources, it’s important to understand the storage intent of the applications in order to choose the right storage and storage networking strategy, rather than discovering the hard way that you’ve chosen the wrong solution for your application. We kick off this series on October 5, 2020 with “Storage for AI Applications.” AI is a technology which itself encompasses a broad range of use cases, largely divided into training and inference. In this webcast, we’ll look at what types of storage are typically needed for different aspects of AI, including different types of access (local vs. networked, block vs. file vs. object) and different performance requirements. And we will discuss how different AI implementations balance the use of on-premises vs. cloud storage. Tune in to this SNIA Networking Storage Forum (NSF) webcast to boost your natural (not artificial) intelligence about application-specific storage. Register today. Our AI experts will be waiting to answer your questions.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Q&A on the Ethics of AI

Jim Fister

Mar 25, 2021

title of post
Earlier this month, the SNIA Cloud Storage Technologies Initiative (CSTI) hosted an intriguing discussion on the Ethics of Artificial Intelligence (AI). Our expert, Rob Enderle, Founder of The Enderle Group, and Eric Hibbard, Chair of the SNIA Security Technical Work Group, shared their experiences and insights on what it takes to keep AI ethical. If you missed the live event, it Is available on-demand along with the presentation slides at the SNIA Educational Library. As promised during the live event, our experts have provided written answers to the questions from this session, many of which we did not have time to get to. Q. The webcast cited a few areas where AI as an attacker could make a potential cyber breach worse, are there also some areas where AI as a defender could make cybersecurity or general welfare more dangerous for humans? A. Indeed, we address several different scenarios where AI running at a speed of thought and reaction is much faster than human reaction. Some that we didn’t address are the impact of AI on general cybersecurity. Phishing attacks using AI are getting more sophisticated, and an AI that can compromise systems with cameras or microphones has the ability to pick up significant amounts of information from users. As we continue to automate a response to an attack there could be situations where an attacker is misidentified, and an innocent person is charged by mistake. AI operates at large scale, sometimes making decisions on data that it not apparent to humans looking at the same data. This might cause an issue where an AI believes a human is in the wrong in ways that we could not otherwise see. An AI might also overreach to an attack, for instance noticing there is an attempt to hack into the infrastructure of a company, shutting down that infrastructure in an abundance of caution could leave workers with no power, lights, or air conditioning. Some water-cooling systems if shut down suddenly will burst and that could cause both safety and severe damage issues. Q. What are some of the technical and legal standards that are currently in place that are trying to regulate AI from an ethics standpoint?  Are legal experts actually familiar enough with AI technology and bias training to make informed decisions? A. The legal community is definitely aware of AI. As an example, the American Bar Association Science and Technology Law Section’s (ABA SciTech) Artificial Intelligence & Robotics Committee has been active since at least 2008. ABA SciTech is currently planning its third National Institute Artificial Intelligence (AI) and Robotics for October 2021 in which AI ethics will figure prominently. That said, case law on AI ethics/bias in the U.S. is still limited, but expected to grow as AI becomes more prevalent in business decisions and operations. It is also worth noting that international standards on AI ethics/bias either exist or are under development. For example, the IEEE 7000 Standards Working Groups are already developing standards for the future of ethical intelligent and autonomous technologies. In addition, ISO/IEC JTC 1/SC 42 is developing AI and Machine Learning standards that includes ethics/bias as an element. Q. The webcast talked a lot about automated vehicles and the work done by companies in terms of safety as well as in terms of liability protection.  Is there a possibility that these two conflict? A. In the webcast we discussed the fact that autonomous vehicle safety requires a multi-layered approach that could include connectivity in-vehicle, with other vehicles, with smart city infrastructure, and with individuals’ schedules and personal information. This is obviously a complex environment, and current liability process makes it difficult for companies and municipalities to work together without encountering legal risk. For instance, let’s say an autonomous car sees a pedestrian in danger and could place itself between the pedestrian and that danger. But it doesn’t because the resulting accident could result in the vehicle attracting liability. Or, hitting ice on a corner, turning control over to the driver so the driver is clearly responsible for the accident even though the autonomous system could be more effective at reducing the chance of a fatal outcome. Q. You didn’t discuss much on AI as a teacher. Is there a possibility that AI could be used to educate students, and what are some of the ethical implications of AI teaching humans? A. An AI can scale to individually-focused custom teaching plans far better than a human could.  However, AIs aren’t inherently unbiased and were they’re corrupted through their training they will perform consistently with that training. If the training promotes unethical behavior that is what the AI will teach. Q. Could an ethical issue involving AI become unsolvable by current human ethical standards?  What is an example of that, and what are some steps to mitigate that circumstance? A. Certainly, ethics are grounded in rules and those rules aren’t consistent and are in flux.  These two conditions make it virtually impossible to assure the AI is truly ethical because the related standard is fluid.  Machines like immutable rules, ethics rules aren’t immutable. Q. I can’t believe that nobody’s brought up HAL from Arthur C. Clarke’s 2001 book. Wasn’t this a prototype of AI ethics issues? A. We spent some time at the end of the session, where Jim mentioned that our “Socratic Forebearers” were some of the early science fiction writers such as Clarke and Isaac Asimov. We spent some time discussing Asimov’s Three Laws of Robotics and how Asimov and others later theorized how smart robots could get around the three laws. In truth, there’s been decades of thought into the ethics of an artificial intelligence, and we’re fortunate to be able to build on that as we address what are now real-world problems.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

The Effort to Keep Artificial Intelligence Ethical

Jim Fister

Feb 11, 2021

title of post
Artificial Intelligence (AI) technologies are possibly the most substantive and meaningful change to modern business. The ability to process large amounts of data with varying degrees of structure and form enables giant leaps in insight to drive revenue and profit. Likewise, governments and society have significant opportunity for improvement of the lives of the populace through AI. However, with the power that AI brings comes the risks of any technology innovation. The SNIA Cloud Storage Technologies Initiative (CSTI) will explore some of the ethical issues that can arise from AI at our live webcast on March 16, 2021 “The Ethics of Artificial Intelligence.” Our expert speakers, Rob Enderle, President and Principal Analyst at The Enderle Group and Eric Hibbard, Chair of the SNIA Security Technical Work Group, will join me for an interactive discussion on:
  • How making decisions at the speed of AI could be ethically challenging
  • Examples of how companies have structures to approach AI policy
  • The pitfalls of managing the human side of AI development
  • Potential legal implications of using AI to make decisions
  • Advice for addressing potential ethics issues before they are unsolvable
It’s sure to be an enlightening discussion on an aspect of AI that is seldom explored. Register today. We look forward to seeing you on March 16th.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Keeping Up with 5G, IoT and Edge Computing

Michael Hoard

Oct 1, 2020

title of post
The broad adoption of 5G, Internet of things (IoT) and edge computing will reshape the nature and role of enterprise and cloud storage over the next several years. What building blocks, capabilities and integration methods are needed to make this happen? That will be the topic of discussion at our live SNIA Cloud Storage Technologies webcast on October 21, 2020 “Storage Implications at the Velocity of 5G Streaming.” Join my SNIA expert colleagues, Steve Adams and Chip Maurer, for a discussion on common questions surrounding this topic, including: 
  • With 5G, IoT and edge computing – how much data are we talking about?
  • What will be the first applications leading to collaborative data-intelligence streaming?
  • How can low latency microservices and AI quickly extract insights from large amounts of data?
  • What are the emerging requirements for scalable stream storage – from peta to zeta?
  • How do yesterday’s object-based batch analytic processing (Hadoop) and today’s streaming messaging capabilities (Apache Kafka and RabbitMQ) work together?
  • What are the best approaches for getting data from the Edge to the Cloud?
I hope you will register today and join us on October 21st. It’s live so please bring your questions!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

J Metz

Sep 14, 2020

title of post
Last month, the SNIA Cloud Storage Technologies Initiative was fortunate to have artificial intelligence (AI) expert, Parviz Peiravi, explore the topic of AI Operations (AIOps) at our live webcast, “IT Modernization with AIOps: The Journey.” Parviz explained why the journey to cloud native and microservices, and the complexity that comes along with that, requires a rethinking of enterprise architecture. If you missed the live presentation, it’s now available on demand together with the webcast slides. We had some interesting questions from our live audience. As promised, here are answers to them all: Q. Can you please define the Data Lake and how different it is from other data storage models?           A. A data lake is another form of data repository with specific capability that allows data ingestion from different sources with different data types (structured, unstructured and semi-structured), data as is and not transformed. The data transformation process Extract, Load, Transform (ELT) follow schema on read vs. schema on write Extract, Transform and Load (ETL) that has been used in traditional database management systems. See the definition of data lake in the SNIA Dictionary here. In 2005 Roger Mougalas coined the term Big Data, it refers to large volume high velocity data generated by the Internet and billions of connected intelligent devices that was impossible to store, manage, process and analyze by traditional database management and business intelligent systems. The need for a high-performance data management systems and advanced analytics that can deal with a new generation of applications such as Internet of things (IoT), real-time applications and, streaming apps led to development of data lake technologies. Initially, the term “data lake” was referring to Hadoop Framework and its distributed computing and file system that bring storage and compute together and allow faster data ingestion, processing and analysis. In today’s environment, “data lake” could refer to both physical and logical forms: a logical data lake could include Hadoop, data warehouse (SQL/No-SQL) and object-based storage, for instance. Q. One of the aspects of replacing and enhancing a brownfield environment is that there are different teams in the midst of different budget cycles. This makes greenfield very appealing. On the other hand, greenfield requires a massive capital outlay. How do you see the percentages of either scenario working out in the short term? A. I do not have an exact percentage, but the majority of enterprises using a brownfield implementation strategy have been in place for a long time. In order to develop and deliver new capabilities with velocity, greenfield approaches are gaining significant traction. Most of the new application development based on microservices/cloud native is being implemented in greenfield to reduce the risk and cost using cloud resources available today in smaller scale at first and adding more resources later. Q. There is a heavy reliance upon mainframes in banking environments. There’s quite a bit of error that has been eliminated through decades of best practices. How do we ensure that we don’t build in error because these models are so new? A. The compelling reasons behind mainframe migration – beside the cost – is ability to develop and deliver new application capabilities, business services and making data available to all other applications. There are four methods for mainframe migration:
  • Data migration only
  • Re-platforming
  • Re-architecting
  • Re-factoring
Each approach provides enterprises different degrees of risk and freedom.  Applying best practices to both application design/development and operational management, is the best way to ensure smooth application migration from a monolith to a new distributed environment such as microservices/cloud native. Data architecture plays a pivotal role in the design process in addition to applying Continuous Integration and Continuous Delivery (CI/CD) process. Q. With the changes into a monolithic data lake, will we be seeing different data lakes with different security parameters, which just means that each lake is simply another data repository? A. If we follow a domain-driven design principal, you could have multiple data lakes with specific governance and security policies appropriate to that domain. Multiple data lakes could be accessed through data virtualization to mimic a monolithic data lake; this approach is based on a logical data lake architecture. Q. What’s the difference between multiple data lakes and multiple data repositories? Isn’t it just a matter of quantity? A. Looking from Big Data perspective, a data lake is not only stored data but also provides capabilities to process and analyze data (e.g. Hadoop framework/HDFS). New trends are emerging that separate storage and compute (e.g., disaggregated storage architectures) hence some vendors use the term “data lake” loosely and offer only storage capability, while others provide both storage and data processing capabilities as an integrated solution. What is more important than the definition of data lake is your usage and specific application requirements to determine which solution is a good fit for your environment.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Subscribe to AI