Comparing performance of Spark DataFrames API to Spark RDD

At adsquare we use state of the art, big data technology to tackle the huge amount of mobile advertising data that we need to service, and store for off-line analysis. In this post the performance of the new Apache Spark DataFrames API is compared with the standard Spark RDD API using real data. The benchmarks provide interesting insights into which kind of operations are more suitable for the DataFrames API.

1. adsquare’s data challenge
At adsquare we handle huge influx of data – that is redirected into two data stores – 1) A real-time Spark-streaming aggregation pipeline that sends a subset of data to a Cassandra cluster, & 2) a KafkaFlume pipeline that packs all geo-data into parquet-files (columnar scheme) on a HDFS.

On average more than half a billion of geo-events flow in every day, which corresponds to almost 1 TB of AVRO serialized data. On HDFS this amounts to around 120 GB of compressed Parquet files per day.

For analyses that truly require large data sets – we use Apache Spark, which provides very efficient, fast, distributed, in-memory analytics capabilities. At adsquare we use the standard RDD framework of Spark in the streaming pipeline, as well as in off-line analysis. In addition, for exploratory analysis, we also used the SparkSQL API. Interesting features of the new Sparks DataFrames API, such as SQL table-like functionalities, compatibility with Python Pandas DataFrames (which we regularly use), and the distributed queries, were a hit with the Data Scientist and Engineers at adsquare. We have enthusiastically started to try our hands on this.

2. What is a Spark DataFrame?
Spark provides a number of different analysis approaches on a cluster environment. The DataFrame API introduced in version 1.3 provides a table-like abstraction (in the way of named columns) for storing data in-memory, and provides a mechanism for distributed SQL engine. The ability to create simple SQL like queries, and an API that provides function calls – such as select, filter, and groupBy – is especially useful for exploratory analysis. More ever, the ease of use also lets more Business Intelligence professionals to get their hands dirty playing with big-data silos on non-SQL stores like HDFS.

3. Comparing DataFrames to RDD API
Though SQL-like query engines on non-SQL data stores is not a new concept (c.f., Hive, Shark, etc.), we were intrigued by the reports that the optimizations built into the DataFrames make it comparable in speed to the usual Spark RDD API, which in turn is well known to be much faster than Hadoop map-reduce.

I decided to put these new technologies to our own tests and run it on a sample of moderate-sized real-world data.

3.1 Data Model
Before I explain the test cases and share my results – a few points about our data model and read-out scheme are in order. Firstly, the decision to store data in parquet files on HDFS predated the emergence of Spark DataFrames. It was primarily chosen to enable fast read-out of only a subset of fields in the files. Nevertheless, it is a lucky coincidence that the named-column abstraction in DataFrames uses parquet files as its default format for on-disk data. Secondly, for loading the data into regular RDDs – we use AVRO serialization framework. The choice to use AVRO for abstracting our data model, was due to its rich set of features such as – support for schema evolution, compression mechanism, ‘splittability’, multi-language support, and a good code generation mechanism (small things to make our lives a bit easier).

3.2 Building the Tests
In the following tests, for the sake of simplicity, I selected only three atomic fields, viz., created-time, Publisher ID, and App ID, out of more than 30 fields (some of which are complex). Furthermore, I chose enough data (around 120 Million events) to require around a minute of run-time on our cluster, while ensuring that the result sets were small enough to allow collect() to be called (see Table for details). In the RDD version schema projection and predicate pushdown are used explicitly. In DataFrames a select is used to restrict the fields read, and the optimizations come entirely out of the box with the DataFrames Catalyst optimizer.

So without much ado let me define the aggregations done in the different test cases we constructed, and present the result of our benchmarks:


3.3 Results
The elapsed-times were calculated by averaging over multiple iterations, and the error bars are 1 standard deviation. At 1 standard-deviation level there is a significant difference in the performance between the DF and the RDD API for all test cases.

Among these, only in Case 2, the RDD implementation outperforms the DataFrames implementation. For Cases 3, and 4 the DataFrames version is clearly much faster than the RDD.

Especially interesting are:

Case2: Creation of a new column in DF seems to be slower than a corresponding map to transform a field in the RDD.

Case 4: In RDD, map(…) makes a tuple out of two atomic fields. In DataFrames, a GroupBy on 2 fields seems very fast in comparison. Note that the events with null values in the App ID field are removed by the filter(…).

4. Concluding remarks
It is very interesting to see that the new DataFrame API is faster than the RDD API for simple grouping and aggregations. Case 2, however shows that when a complex operation requires a new named-column to be generated – the RDD API wins, but not by much. More complex test, perhaps in a follow-up post, could show where plain RDDs are clearly better than DataFrames. For exploratory analysis, and generating aggregated statistics on large datasets, the DataFrame API is indeed very easy to use and also faster.

5. Scala code sample
View the full Scala code sample

6. Cluster Hardware and run-time resources allocated to Spark
The Spark Cluster resources used were:
Deploy mode: Yarn, client
Executors: 3
Executors Cores: 2
Driver Memory: 3 GB

Acknowledgements: I would like to thank my colleagues at adsquare – in particular Olaf Krische, Senior Developer, and Matthias Kirmse, Data Scientist, for their useful contributions to this post.

About adsquare
adsquare is Europe’s leading platform for mobile audience data. We leverage consumers’ local context and mobile behaviour for programmatic advertising and help advertisers and agencies to pinpoint their target group to make ads more relevant. Through partnerships with ad networks, publishers and global exchange platforms, adsquare offers clients access to audiences segments and attributes at scale. Our platform works completely without cookies and has been awarded the ePrivacyseal for complying with strict European data protection laws. Our biggest asset is infrastructure, and we’re always on the look out for new tech talent. Click here to check out our current openings.

Back to the blog