Following the initial rise of Hadoop, data teams across industries have adopted Apache Spark as the go-to framework for distributed big data processing. The open-source platform has largely replaced Hadoop’s Mapreduce by enabling faster in-memory processing of datasets, and handling use cases that Hadoop could not manage. Spark is also more accessible in terms of APIs, and backed with adequate fault tolerance.
However, with the amount of data in the world predicted to grow to 221 zettabytes by 2026, it’s difficult for organizations to get a grip on the information they have. At current processing speeds, companies will face latencies in business applications like analytics. And if they move to increase speeds, the costs rise.
That’s why teams should look at the option of accelerating Spark with GPUs, via Rapids, said Sameer Raheja, senior director of engineering at Nvidia, at the ongoing GTC 2023 conference.
>>Follow VentureBeat’s ongoing Nvidia GTC spring 2023 coverage<<
GPU-accelerated Apache Spark
To handle future data demands with Spark, Raheja suggested running the framework with Nvidia GPUs. A plugin jar like Rapids Accelerator for Apache Spark, he said, can allow Spark batch processing to run on GPUs without any code changes.
This, he said, will not only enable teams to run massive data jobs faster at a lower cost than is possible with CPUs, it will also drive power savings.
Rapids Accelerator for Apache Spark combines the power of the Rapids cuDF library and the scale of the Spark distributed computing framework. The Rapids Accelerator library also has a built-in accelerated shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and remote direct memory access capabilities.
Using the Nvidia decision support benchmark — an adaptation of the industry-standard TPC-DS benchmark, with 100 modified queries — the company compared a Rapids-based GPU-accelerated Google cloud dataproc Spark distribution with one based on CPUs. The GPU nodes did a power run of all 100 queries in just 31 minutes, versus 176 minutes taken by the CPU nodes.
Since the GPU run took less time, it also proved to be more affordable than CPU nodes, costing just $7.20 as against $32.52 for the CPU run. The GPU run was five times more power-efficient.
“For anyone who’s running big data workloads and managing a budget … performance, cost and efficiency are key factors, and Rapids Accelerator for Spark addresses all three,” Raheja emphasized.
He added that similar benchmark results were witnessed on other clouds and Spark distributions with configurations closely matching that of Dataproc. For example, Rapids-accelerated AWS EMR distribution saw a 42% cost savings, while AWS Databricks Photon and Azure Databricks Photon delivered 39% and 34% cost savings, respectively.
How it works
The key to these benefits is Apache Spark 3, which brings column-based processing and resource-aware custom resource scheduling capabilities. This allows teams to schedule tasks on accelerator resources like GPUs.
“You can continue to write your application in the APIs you’re familiar with — SQL, Python, R, Java and Scala. Spark provides distributed and scale-up compute power; Spark 3.x provides resource-aware scheduling; and the Rapids Accelerator for Apache Spark plugin provides transparency for applications to run on Nvidia GPUs, enabling acceleration in cooperation with [the] Spark core engine’s built-in processor,” Raheja said.
Currently, the Rapids Spark accelerator is available on and built into Amazon EMR, Cloudera CDP, Databricks ML runtime, Azure Synapse Analytics, Google Cloud Dataproc, and open-source Apache Spark 3.x distributions, either on-premises or in the cloud.
The 2023 Nvidia GTC event runs through March 23.