This post will discuss how the standalone Apache Spark framework can be leveraged to perform processing of large files in batch. This specific use case looks at the addition of two columns to a large flat file.
To begin the use case, tRowGenerator was used to generate 100,000,000 rows using the schema below.
A simple tMap was added to convert the date formats, and a tFileOutputDelimited to output the 100,000,000 rows to a flat CSV file.
The generated file had a file size of 6.10 GB.
There were three approaches tested for the use case:
- Local/Java – Use a standard job to add two columns from a local flat file and output to a local flat file.
- Local/Spark – Use a Big Data Batch job with the Spark standalone framework to add two columns from a local flat file and output to a local flat file.
- Local/S3/Spark – Use a Big Data Batch job with the Spark standalone framework to add two columns from a flat file in a S3 bucket and output a flat file back to a S3 bucket.
The mapping for each of these jobs added two columns and did not make any other transformations:
In the standard job, a tFileInputDelimited, tMap, and tFileOutputDelimited component were used:
The Big Data Batch job with the Spark standalone framework processing the file locally was simply duplicated from the standard job:
Lastly, in the Big Data Batch job using the Spark standalone framework, a connection was made to the S3 bucket using the tS3Configuration component:
The environment used in this test was a single AWS EC2 m3.2xlarge instance with 8 vCPU Intel Xeon E5-2670 processors, 30 GB Memory and 2×80 GB SSD’s running Talend Data Fabric 6.1.1 on Windows Server 2012.
Spark 1.4 framework was used in local mode and the tuning parameters were set as follows:
The two metrics captured from the jobs were the time processed and the rows processed per second. The Local/Java job processed in a mediocre time of of 17.56 minutes and at a 94,895 rows per second. Once Spark was introduced, the timings improved and subsequently rows processed per second also improved. However, leveraging Spark with local files demonstrated a dignified time of 10.7 minutes at 155,085 rows per second.
Rows Processed (per second)
The more rows processed, the better.
Time to Process
The lower the time, the better.
The inclusion of processing a file from S3 in this case showed better time than processing locally without Spark. However, once Spark was used on a local file, it outperformed the other two jobs. The process could have been faster if there was a Hadoop cluster running Spark which could distribute the processing of a local file on the cluster. Even without a Hadoop cluster, leveraging standalone Spark increases the performance of processing large files.