Horizontal and Vertical Scalability Brief

This brief will outline volume and scalability testing beginning with a classic ETL design and then leveraging the MapReduce and Spark frameworks natively with Talend.

Classic ETL

In the classic ETL design, a tab-separated value file was parsed, then processed against a lookup file and a simple transformation performed. This source file was picked up from either a S3 bucket or on-premise SFTP. The main dataset contained over 4 million rows at 39 GB and the lookup contained 35 million rows at 6.7 GB.


The environment used in this benchmark was a single AWS EC2 m4.10xlarge instance with 40 vCPU, 160 GB of memory and EBS-Optimized storage.


Overall, trying to leverage parallelization did not make a drastic impact on performance. Distributing the processing gave better performance, but leveraging Spark gave the best performance.

Rows Processed500,0001,000,0002,000,0003,000,0004,045,881
Standard13 min13 min16 min20 min22.3 min
Standard with Parallelization14 min16 min17 min18 min18 min
Standard with Distributed Execution7.15 min7.26 min7.6 min8 min8.4 min
Local Spark3 min
  • Standard with Parallelization – Trying to scale vertically is only efficient for a volume of data and is highly dependent on the way the data is distributed and what type of computations are required.
  • Standard with Distributed Processing – Performance can be substantially improved if multiple machines can be leveraged into a grid and work off a shared location.
  • Local Spark – Without any additional machines or a Spark cluster, the library is simply embedded in the job and can be deployed locally.


The MapReduce/Spark design began as a classic ETL design and was simply refactored into the respective framework with just a few simple clicks in Talend. The full dataset used in this design contained 10.5 million rows with a 101 GB file size.


The environment used in this benchmark was a 4 nodes AWS EMR cluster using r3.2xlarge instances. Each instance had 8 vCPU, 61 GB of memory, and 160 GB SSD for storage.


Testing the initial 4 million/39 GB data set from the Classic ETL design with MapReduce/Spark gave respectable results on the EMR cluster.

Rows Processed4,045,881 (~39 GB)6,468,093 (~62 GB)10,513,974 (~101 GB)
MapReduce6 min8 min14 min
Spark4 min5 min9 min


Without the need for MapReduce or Spark skills, a classic ETL design was refactored in Talend to leverage these powerful frameworks. Should additional nodes be added to a cluster, Talend can easily scale due its native Hadoop support.

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: