This brief will outline volume and scalability testing beginning with a classic ETL design and then leveraging the MapReduce and Spark frameworks natively with Talend.
In the classic ETL design, a tab-separated value file was parsed, then processed against a lookup file and a simple transformation performed. This source file was picked up from either a S3 bucket or on-premise SFTP. The main dataset contained over 4 million rows at 39 GB and the lookup contained 35 million rows at 6.7 GB.
The environment used in this benchmark was a single AWS EC2 m4.10xlarge instance with 40 vCPU, 160 GB of memory and EBS-Optimized storage.
Overall, trying to leverage parallelization did not make a drastic impact on performance. Distributing the processing gave better performance, but leveraging Spark gave the best performance.
|Standard||13 min||13 min||16 min||20 min||22.3 min|
|Standard with Parallelization||14 min||16 min||17 min||18 min||18 min|
|Standard with Distributed Execution||7.15 min||7.26 min||7.6 min||8 min||8.4 min|
|Local Spark||3 min|
- Standard with Parallelization – Trying to scale vertically is only efficient for a volume of data and is highly dependent on the way the data is distributed and what type of computations are required.
- Standard with Distributed Processing – Performance can be substantially improved if multiple machines can be leveraged into a grid and work off a shared location.
- Local Spark – Without any additional machines or a Spark cluster, the library is simply embedded in the job and can be deployed locally.
The MapReduce/Spark design began as a classic ETL design and was simply refactored into the respective framework with just a few simple clicks in Talend. The full dataset used in this design contained 10.5 million rows with a 101 GB file size.
The environment used in this benchmark was a 4 nodes AWS EMR cluster using r3.2xlarge instances. Each instance had 8 vCPU, 61 GB of memory, and 160 GB SSD for storage.
Testing the initial 4 million/39 GB data set from the Classic ETL design with MapReduce/Spark gave respectable results on the EMR cluster.
|Rows Processed||4,045,881 (~39 GB)||6,468,093 (~62 GB)||10,513,974 (~101 GB)|
|MapReduce||6 min||8 min||14 min|
|Spark||4 min||5 min||9 min|
Without the need for MapReduce or Spark skills, a classic ETL design was refactored in Talend to leverage these powerful frameworks. Should additional nodes be added to a cluster, Talend can easily scale due its native Hadoop support.