Optimizing data lake infrastructure

06 December, 2018
Anandakumar Mohan
IBM

In today’s world, data is the new oil, and there’s a great need to preserve that data for exploration and to derive value. A “data lake” acts as a repository that consolidates an organization’s data into a governed and well-managed environment that supports both analytics and production workloads. It embraces multiple data platforms, such as relational data warehouses, Apache Hadoop clusters and analytical appliances, and manages them together.

Most companies aspire to become more data-driven, but many organizations are struggling to deliver on their strategy. This is primarily because of how data has been traditionally managed, with several point-to-point connections, new interfaces built over time as quick solutions, and a large amount of data that needs to be stored and moved across these systems. Addressing these challenges requires innovative solutions.

In this blog post, we’ll talk about infrastructure solutions from IBM Spectrum Scale and IBM Power Systems that can help you address these challenges and optimize your data lake infrastructure.

The challenge with traditional Hadoop

Hadoop technology is the basis for many data lake solutions. Traditional Hadoop with shared-nothing architecture brings in nodes with direct-attached disks to form a cluster. The compute from each node is used to run Hadoop jobs, and storage is used to form a Hadoop Distributed File System (HDFS). A capacity scheduler (YARN) reruns jobs in the case of node failure, and HDFS maintains redundancy by typically retaining three copies of data across nodes. If you need more storage capacity, irrespective of compute capacity requirements, additional nodes can be added. This architecture of directly attaching storage to compute typically results in underutilized compute farms and makes it difficult to scale compute and storage independently.

IBM Spectrum Scale offers the capability to independently scale compute and storage. Here’s how it works:

  • IBM Spectrum Scale emulates the HDFS API through Hadoop transparency connector and allows Hadoop ecosystem applications to run seamlessly without the need to make any application level code changes. IBM Spectrum Scale software-defined storage enables an efficient shared storage model, allowing Hadoop workloads to scale storage independently of compute. It can be used either as a horizontally scalable appliance (IBM Elastic Storage Server) or customized IBM Spectrum Scale implementation with varied storage servers in the back end.
  • IBM Elastic Storage Server includes RAID software, eliminating the need for three-way replication, resulting in up to 60 percent storage capacity reduction over traditional Hadoop architecture.
  • IBM Spectrum Scale storage tiering capability helps with seamlessly moving data to the right storage tiers.
  • IBM Spectrum Scale is certified with the popular Hadoop distribution from Hortonworks and is compatible with the Hortonworks Data Platform (HDP) and Hortonworks Data Flow (HDF) products.

Simplified infrastructure for data movement 

Organizations typically need to maintain separate copies of the same data for traditional and analytics applications due to disparate storage technologies and interfaces. This results in additional storage requirements and time spent copying data across systems. What’s desired is a unified storage layer that can virtualize storage across different technologies and support different interfaces such as a Unix file system, HDFS, Object interface and so forth.

IBM Spectrum Scale can virtualize different storage technologies such as flash, SSD, spinning disk and tape drives onto a unified file system name space providing advanced tiering and replication capabilities. Its comprehensive support of data access protocols and APIs, including NFS, SMB, Object, POSIX file system and HDFS, helps with in-place analytics capabilities to simplify the infrastructure and optimize data movement.

As an example, one could host common data on the IBM Spectrum Scale file system, mount as POSIX file system on an RDBMS system and access the same data in Hadoop through the HDFS API. With this approach, in-place analytics brings analytics to where data is, rather than moving data around, helping with reduced infrastructure footprint and faster data ingest.

Lack of cognitive infrastructure poses roadblocks

An organization’s existing infrastructure can pose roadblocks for scalability and efficiency. This new era of data requires a different approach to computing. Moore’s law no longer delivers the needed processor technology innovation to keep up with the demands of big data. Traditional infrastructure cannot handle high data volumes and process compute-intensive AI algorithms to derive insights. It requires a collaborative approach with innovation at various layers, including the processor/chip, system boards, I/O interconnects, accelerators, I/O adapters, and software to build cognitive systems.

IBM Power Systems addresses various requirements for such workloads with cognitive systems. Here are some key highlights:

  • OpenPower Foundation brings in collaborators across the industry to optimize and innovate on the Power processor and system platform to build custom systems for large-scale data centers

evolving from compute systems to cognitive systems

  • High-speed PCIe4 interconnect on POWER9 systems, one of the first in the industry to move data efficiently across the system
  • Innovative Coherent Accelerator Processor Interface (CAPI) removes the overhead and complexity of I/O subsystem, allowing an accelerator (such as FPGA) to operate as an extension of an application
  • High-speed Nvidia NVLINK connectivity (total of 600GB/s) between processor and GPU accelerator catering to high performance deep learning AI workloads
  • Workload optimized processor and server architecture with separate server lineup for scale up and scale out workloads supporting AIX, IBM i and Linux operating systems
  • Superior price performance over competitor platforms resulting in lower infrastructure costs

IBM Spectrum Scale and IBM Power Systems technologies bring in the needed innovation for optimization and efficiency for new-age workload requirements and data lake infrastructure.

With constant pressure to optimize IT costs, large investments such as data lakes are under scrutiny for alternate architectures and optimization options. Building data lakes requires an end-to-end approach from infrastructure to software stack. IBM Spectrum Scale and Power Systems provide a strong infrastructure alternative and choice to clients looking at constructing and optimizing their data lake solution.

IBM Systems Lab Services offers a wide range of services on Cognitive Solutions. If you’re interested in talking to Lab Services about your data lake infrastructure optimization, contact us today.

The post Optimizing data lake infrastructure appeared first on IBM IT Infrastructure Blog.