The incredible shrinking data center

17 May, 2017
Dr. Vincent Natoli
IBM

What if you could shrink your data center to a tenth of its size while increasing the performance of your most critical number-crunching business applications? A few weeks ago, Stone Ridge Technology and IBM demonstrated just that when they announced the results of a huge simulation using 30 IBM Power Systems NVLink servers housing 120 NVIDIA Tesla P100 GPU cards. Stone Ridge used ECHELON, their commercial reservoir simulator, and the IBM cluster to run a 1 billion cell model in 1.5 hours. In context, previously published benchmarks on models this size required on the order of 500 server nodes and took 20 hours.

Reservoir simulation: Compute bits guiding drill bits

Reservoir simulators are number-crunching technical software applications that energy companies use to evaluate different development strategies for their oil fields. Simulators allow them to explore many “what-if” scenarios to find the one optimized for cost and minimal environmental impact before a drill bit hits the earth. Drilling wells is expensive; up to $10 million dollars on land and $100 million dollars offshore, and simulation provides critical cost savings.

In reservoir simulation, petroleum reservoirs are viewed as comprised of many three-dimensional cubes, much like pixels make up a two-dimensional screen. Those cubes are referred to as “cells”.  Engineers are interested in physical properties present in each cell, like pressure and the amounts of oil, gas and water. These and many other properties change from cell to cell and from moment to moment as the fluids flow underground. The physics of this time evolution is captured in mathematics and represented in software like ECHELON.

Typical reservoir models used in the industry range in size from a few hundred thousand to a few million cells. If we think of one of these models as a standard definition HDTV, then the billion-cell example would have 250 times the resolution of a 4K television. It gives enormous resolution and clarity, but comes with high computational cost. Figure 1 shows the reservoir model simulated by Stone Ridge and IBM. The simulator solves billions of equations over and over again as it marches forward in time. Each solve yields a billion different properties; one for each of the billion different cells. It’s a very time-consuming calculation and previous attempts took upwards of 20 hours to simulate 60 years of production. That’s where GPUs, IBM Power and ECHELON come in.

Figure 1: One billion cell Bactrian model simulated by Stone Ridge and IBM. Displaying porosity variation by color.

GPUs, IBM Power and ECHELON: A perfect storm of computing awesomeness

GPUs have grown steadily more powerful than CPUs for compute-intensive workloads over the last decade. In fact, today a state-of-the-art GPU is about 10 times better than a state-of-the-art CPU for these workloads. ECHELON is unique in that it is the only reservoir simulator that runs all numerically-intense calculations on GPU. Because of this, its performance scales nearly linearly with GPU capability.

When a reservoir model gets big, as in a billion cells big, it’s too large for a single GPU and must be run using multiple GPUs — and those GPUs need to communicate with each other. Here’s where the IBM Power Systems NVLink platform brings things together, because each NVLink server node houses four GPUs and uses a super-high bandwidth data highway to connect them. It’s the only commercial server that offers this fast interconnect between the GPUs and the server’s CPU.

The incredible shrinking data center

Each IBM Power Systems NVLink server delivers four NVIDIA Tesla P100 GPUs and is so computationally powerful that it takes about 18 standard x86 nodes to match its performance (See figure 2). Because it’s so dense and because ECHELON can efficiently harness that power on the GPUs, Stone Ridge and IBM were able to execute this billion cell calculation in 1.5 hours on just 30 servers.  In short, fewer server nodes were used to achieve faster results. Even factoring in the additional costs associated with the GPUs, there was nearly a 75 percent reduction in hardware costs. ECHELON can simulate up to 32 million cells on a single IBM Power NVLink server node, delivering startlingly fast performance to an industry hungry for new ways to drive down costs.

echelon versus a multi-core CPU

Figure 2: It takes 18 standard x86 servers to match the 2.88TB/s memory bandwidth of a single IBM Power Systems NVLink server node. ECHELON can simulate up to 32 million cells on a single node.

Reservoir simulation is mathematically and algorithmically similar to other business-critical engineering and scientific simulation disciplines such as computational fluid dynamics, climate modeling and structural dynamics. It was not too long ago that conventional wisdom argued that applications such as these, with hundreds of diverse computational kernels, were too complicated for GPUs. ECHELON, IBM Power and the NVIDIA Tesla P100 argue by demonstration against this point. They clearly indicate that there are very large performance and efficiency gains to be realized from a concerted effort to design from inception using the most advanced algorithms running on state-of-the-art GPUs and hardware infrastructure.

Prime the pump with IBM Power and NVIDIA NVLink

The potential for GPUs connected to Power processors with NVIDIA NVLink is immense for tackling data-intensive challenges. You can read more about other popular supported applications here. In addition, you can see how IBM’s NVLink server is perfect for distributed computing in large HPC clusters by reading this whitepaper.

What other workloads do you think you can address with NVLink and GPUs on Power? Let us know in the comments below.

The post The incredible shrinking data center appeared first on IBM Systems Blog: In the Making.