Leveraging Spark for Simple Mapping

This post will discuss how the standalone Apache Spark framework can be leveraged to perform processing of large files in batch. This specific use case looks at the addition of two columns to a large flat file.

Preparation

Generate Rows

To begin the use case, tRowGenerator was used to generate 100,000,000 rows using the schema below.

Leveraging Spark for Simple Mapping - tRowGenerator

A simple tMap was added to convert the date formats, and a tFileOutputDelimited to output the 100,000,000 rows to a flat CSV file.

Leveraging Spark for Simple Mapping - Job 1

The generated file had a file size of 6.10 GB.

Prepare Jobs

There were three approaches tested for the use case:

  1. Local/Java – Use a standard job to add two columns from a local flat file and output to a local flat file.
  2. Local/Spark – Use a Big Data Batch job with the Spark standalone framework to add two columns from a local flat file and output to a local flat file.
  3. Local/S3/Spark – Use a Big Data Batch job with the Spark standalone framework to add two columns from a flat file in a S3 bucket and output a flat file back to a S3 bucket.

The mapping for each of these jobs added two columns and did not make any other transformations:

Leveraging Spark for Simple Mapping - tMap

Local/Java

In the standard job, a tFileInputDelimited, tMap, and tFileOutputDelimited component were used:

Leveraging Spark for Simple Mapping - Job 2

Local/Spark

The Big Data Batch job with the Spark standalone framework processing the file locally was simply duplicated from the standard job:

Leveraging Spark for Simple Mapping - Duplicate Dialog

Local/S3/Spark

Lastly, in the Big Data Batch job using the Spark standalone framework, a connection was made to the S3 bucket using the tS3Configuration component:

Leveraging Spark for Simple Mapping - S3 Job

Environment

The environment used in this test was a single AWS EC2 m3.2xlarge instance with 8 vCPU Intel Xeon E5-2670 processors, 30 GB Memory and 2×80 GB SSD’s running Talend Data Fabric 6.1.1 on Windows Server 2012.

Spark 1.4 framework was used in local mode and the tuning parameters were set as follows:

Leveraging Spark for Simple Mapping - Spark Config

Results

The two metrics captured from the jobs were the time processed and the rows processed per second. The Local/Java job processed in a mediocre time of of 17.56 minutes and at a 94,895 rows per second. Once Spark was introduced, the timings improved and subsequently rows processed per second also improved. However, leveraging Spark with local files demonstrated a dignified time of 10.7 minutes at 155,085 rows per second.

Rows Processed (per second)

The more rows processed, the better.

Time to Process

The lower the time, the better.

Conclusion

The inclusion of processing a file from S3 in this case showed better time than processing locally without Spark. However, once Spark was used on a local file, it outperformed the other two jobs. The process could have been faster if there was a Hadoop cluster running Spark which could distribute the processing of a local file on the cluster. Even without a Hadoop cluster, leveraging standalone Spark increases the performance of processing large files.

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: