What
types of storage are needed for different aspects of AI? That was one of the
many topics covered in our SNIA Networking Storage Forum (NSF) webcast “Storage for AI Applications.” It was a fascinating discussion and I encourage
you to check it out on-demand.
Our panel of experts answered many questions during the live roundtable
Q&A. Here are answers to those questions, as well as the ones we didn’t
have time to address.
Q. What are the different data set sizes and workloads in AI/ML in
terms of data set size, sequential/ random, write/read mix?
A. Data sets will vary incredibly from use case to use case. They
may be GBs to possibly 100s of PB. In general, the workloads are very heavily
reads maybe 95%+. While it would be better to have sequential reads, in general
the patterns tend to be closer to random. In addition, different use cases will
have very different data sizes. Some may be GBs large, while others may be <1
KB. The different sizes have a direct impact on performance in storage and may
change how you decide to store the data.
Q. More details on the risks associated with the use of online databases?
A. The biggest risk with using an online DB is that you will be
adding an additional workload to an important central system. In particular,
you may find that the load is not as predictable as you think and it may impact
the database performance of the transactional system. In some cases, this is
not a problem, but when it is intended for actual transactions, you could be
hurting your business.
Q. What is the difference between a DPU and a RAID / storage
controller?
A. A Data Processing Unit or DPU is intended to process the actual
data passing through it. A RAID/storage controller is only intended to handle
functions such as data resiliency around the data, but not the data itself. A
RAID controller might take a CSV file and break it down for storage in
different drives. However, it does not actually analyze the data. A DPU might
take that same CSV and look at the different rows and columns to analyze the
data. While the distinction may seem small, there is a big difference in the
software. A RAID controller does not need to know anything about the data,
whereas a DPU must be programmed to deal with it. Another important aspect is
whether or not the data will be encrypted. If the data will encrypted, a DPU
will have to have additional security mechanisms to deal with decryption of the
data. However, a RAID-based system will not be affected.
Q. Is a CPU-bypass device the same as a SmartNIC?
A. Not entirely. They are often discussed together, but a DPU is
intended to process data, whereas a SmartNIC may only process how the data is
handled (such as encryption, handle TCP/IP functions, etc.). It is possible for a SmartNIC to also act as
a DPU where the data itself is processed. There are new NVMe-oF
technologies that
are beginning to allow FPGA, TPD, DPU, GPU and other devices direct access to
other servers’ storage directly over a high-speed local area network without
having to access the CPU of that system.
Q. What work is being done to accelerate S3 performance with
regard to AI?
A. A number of companies are working to accelerate the S3 protocol.
Presto and a number of Big Data technologies use it natively. For AI workloads there
are a number of caching technologies to handle the re-reads of training on a
local system. Minimizing the performance penalty
Q. From a storage perspective, how do I take different types of data from different storage systems to develop a model?
A. Work with your project team to find the data you need and
ensure it can be served to the ML/DL training (or inference) environment in a
timely manner. You may need to copy (or clone) data on to a faster medium to
achieve your goals. But look at the process as a whole. Do not underestimate
the data cleansing/normalization steps in your storage analysis as it can prove
to be a bottleneck.
Q.
Do I have to “normalize” that data to the same type, or can a model accommodate
different data types?
A. In general, yes. Models can be very sensitive. A model trained
on one set of data with one set of normalizations may not be accurate if data
that was taken from a different set with different normalizations is used for inference.
This does depend on the model, but you should be aware not only of the model,
but also the details of how the data was prepared prior to training.
Q.
If I have to change the data type, do I then need to store it separately?
A. It depends on your data, “do other systems need it in the
old format?”
Q.
Are storage solutions that are right for one form of AI also the best for
others?
A. No. While it may be possible to use a single solution for
multiple AIs, in general there are differences in the data that can necessitate
different storage. A relatively simple example is large data (MBs) vs. small
data (~1KB). Data in that multiple MBs large example can be easily erasure
coded and stored more cost effectively. However, for small data, Erasure Coding
is not practical and you generally will have to go with replication.
Q.
How do features like CPU bypass impact performance of storage?
A. CPU bypass is essential for those times when all you need to do
is transfer data from one peripheral to another without processing. For
example, if you are trying to take data from a NIC and transfer it to a GPU,
but not process the data in any way, CPU bypass works very well. It prevents
the CPU and system memory from becoming a bottleneck. Likewise, on a storage
server, if you simply need to take data from an SSD and pass it to a NIC during
a read, CPU bypass can really help boost system performance. One important
note: if you are well under the limits of the CPU, the benefits of bypass are
small. So, think carefully about your system design and whether or not the CPU
is a bottleneck. In some cases, people will use system memory as a cache and in
these cases, bypassing CPU isn’t possible.
Q.
How important is it to use All-Flash storage compared to HDD or hybrid?
A. Of course, It depends on your workloads. For any single model,
you may be able to make due with HDD. However, another consideration for many
of the AI/ML systems is that their use can quite suddenly expand. Once there is
some amount of success, you may find that more people will want access to the
data and the system may experience more load. So beware of the success of these
early projects as you may find your need for creation of multiple models from
the same data could overload your system.
Q.
Will storage for AI/ML necessarily be different from standard enterprise
storage today?
A. Not necessarily. It may be possible for enterprise solutions
today to meet your requirements. However, a key consideration is that if your
current solution is barely able to handle its current requirements, then adding
an AI/ML training workload may push it over the edge. In addition, even if your
current solution is adequate, the size of many ML/DL models are growing
exponentially every year. So, what you
provision today may not be adequate in a year or even several months. Understanding the direction of the work your
data scientists are pursuing is important for capacity and performance
planning.

Leave a Reply