Meeting the data needs of artificial intelligence

22 May, 2018
Rick Janowski
IBM

Artificial intelligence (AI) is playing an increasingly critical role in business. By 2020, 30 percent of organizations that fail to apply AI will not be operationally and economically viable, according to one report[1].  And in a survey, 91 percent of infrastructure and operations leaders cite “data” as a main inhibitor of AI initiatives[2]. What does a data professional need to know about AI and its data requirements in order to support his or her organization’s AI efforts?

Many factors have converged in recent years to make AI viable, including the growth of processing power and advances in AI techniques, notably in the area of deep learning (DL). Unlike traditional programming in which a programmer provides the computer with each step that it has to take to accomplish some task, deep learning requires the computer to learn for itself.  In the case of visual object recognition, for example, there is no way to program a computer with the steps needed to recognize a given object which may present itself in different locations, at different angles, in different lighting conditions, perhaps partially obscured by some other object and so forth.  Instead, the computer is trained by being given thousands of examples of images containing the object until it can consistently recognize it.

This kind of training requires lots of data. One recommendation is to start with at least 100,000 examples – and each example can be large: an image or a voice recording, for example. Different stages of training and deployment of a deep learning system have different data and processing requirements.  For the training stage, there may be years of data to process, and it can take weeks or even months to complete.   By contrast with these extended time frames, once deployed, the system may need to respond in seconds.

Obviously, given the data volumes involved, storage capacity is an important consideration during the training stage.  The data may also be in different formats in different systems, so multi-protocol capability may be needed. The data may also be geographically dispersed, an additional factor the storage system needs to handle. Once deployed, fast access to the data becomes particularly important to support the response requirements of users and applications, which typically need answers in seconds.

A system such as IBM Spectrum Scale is perfectly suited to meeting these requirements.  It is a high-performance system that can scale out to handle petabytes or exabytes of data.  It supports a wide variety of protocols for accessing files or objects. For Hadoop applications, it provides direct access to data without having to copy the data to HDFS, as is usually required. Avoiding the overhead of copying data between systems lowers costs by saving space and also speeds time to results.

IBM Spectrum Scale is a software-defined solution that can be deployed on a customer’s choice of platform, or it can be delivered as a complete solution in the form of IBM Elastic Storage Server (ESS).  The capacity and performance capabilities of IBM Spectrum Scale and ESS are well illustrated by the US Department of Energy CORAL project, currently on track to build the world’s fastest supercomputer.  ESS will be providing the 250PB of storage the system requires, with performance requirements that include 2.5 TB/second single stream IOR and the creation of 2.6 million 32K files per second.

IBM Spectrum Scale and IBM Elastic Storage Server undergo constant improvement.  The latest version of IBM Spectrum Scale incorporates enhancements to the install and upgrade process, the GUI, and system health capabilities, along with scalability and performance tuning for Transparent Cloud Tiering up to one billion files, and file audit logging enhancements.

Meanwhile, ESS now offers models incorporating the superior performance of IBM Spectrum Scale version 5.0 with performance improvements designed to meet the requirements of the CORAL supercomputer.  ESS is also bringing out its first hybrid models incorporating both flash and disk storage in a single unit, allowing improved handling of different kinds of data such as video and analytics within a single environment.

Constant improvements, along with decades of experience in the most challenging customer environments, ensure that IBM Spectrum Scale and IBM Elastic Storage Server will continue to lead the way in managing the data that is a key element in the success of any deep learning project. Visit our website to learn more about IBM Spectrum Scale and IBM Elastic Storage Server.

[1] Gartner Predicts 2018: Compute Infrastructure 

[2] Gartner AI State of The Market – and Where HPC intersects

The post Meeting the data needs of artificial intelligence appeared first on IBM IT Infrastructure Blog.