What frustrates data scientists working on a Hadoop project?

04 April, 2017
Chandra Mukhyala
IBM

Recently I talked to a data scientist experimenting with various Hadoop distributions and learned that the first step involved in any Hadoop project — before any analytics jobs can be run — is to copy the data to be analyzed into HDFS storage. HDFS is the Hadoop Distributed File System purpose built for Hadoop-based analytics. It’s a scalable file system that runs on a storage-rich commodity server-based cluster.

But the copy process can take days, depending on how large the data set is; it’s not uncommon to copy hundreds of terabytes of data into HDFS because it typically involves copying multiple data sets from different business functions. The whole point of big data analytics is to find insights by analyzing data across the various departments or functions in an enterprise. And by the time large sets of data are copied into HDFS the data becomes stale. This is what frustrates data scientists — the time taken to copy data and then the analysis of what is now stale data.

So how do we solve this problem? What if there was no need to copy your enterprise data to an isolated storage silo? That’s exactly what IBM Spectrum Scale storage allows you to do — you can run Hadoop analytics directly on IBM Spectrum Scale storage and avoid the whole copy-to-HDFS headache. IBM Spectrum Scale is an industry-proven high-performance scalable storage based on IBM’s General Parallel File System (GPFS). If you store all of your enterprise data with IBM Spectrum Scale, then there is no need to copy that data to HDFS to run Hadoop analytics because Spectrum Scale supports HDFS APIs.  Not having to copy your enterprise data to an isolated storage silo improves your productivity, and there are some other key efficiencies Spectrum Scale offers over HDFS:

Unified storage:  Analytics results are immediately available to any enterprise application through industry standard file or block sharing protocols like NFS, SMB or iSCSI, and also to modern web applications that use object interfaces like S3 or Swift.

Erasure coding: Spectrum Scale uses erasure coding for data protection and availability, whereas HDFS keeps 3 copies of data, which can quickly add up as the Hadoop cluster grows – a very common problem in large enterprises. With erasure coding the overhead is 20 percent compared to 200 percent with 3 copies of data.

Shared storage: HDFS is a shared nothing (SN) architecture in which a cluster grows by adding new nodes where each new node adds both compute and storage. Spectrum Scale can be deployed in shared nothing or in shared storage mode for workloads that demand high read/write throughput by decoupling storage from compute.

Tier to the cloud: Spectrum Scale has built-in policy-based tiering which allows your older or cold data to tier to cost-effective cloud storage automatically.

Last but not least, Spectrum Scale is going to be supported by Hortonworks, a key Hadoop distribution. See the press release here.

Those are some of the key advantages of Spectrum Scale over HDFS. They not only eliminate the headaches of copying your enterprise data to isolated storage but also improve the efficiency of Hadoop storage and the productivity of your data scientists.

To learn more about how IBM Spectrum Scale storage can help improve your Hadoop jobs, join us at the IBM booth at the DataWorks Summit in Munich.i

The post What frustrates data scientists working on a Hadoop project? appeared first on IBM Systems Blog: In the Making.