Organizations using legacy ETL solutions are used to spending egregious amounts of money on cores, connectors, and sometimes even on large data volumes. Since the introduction of Hadoop, the need to lower costs has not phased many of these legacy ETL solutions which have been designed for a batch world. Up until Spark, high scale processing power was only available with MapReduce. Talend was the first data integration provider to natively leverage Spark for batch and streaming processes. Now, using these frameworks allows for powerful processing and impressive speeds without proprietary software or hardware.
Talend was benchmarked during a proof of concept which required comparisons of MapReduce to Spark. The data set was approximately 500 million rows at a size of 15.9 GB compressed and 47.6 GB raw. The process involved performing a mapping, transformation, aggregation, sort and limit to retrieve the top 50 results. The execution was done on a Hadoop environment. Since Talend deploys code natively for MapReduce or Spark through YARN, there is no installation of software on Hadoop or licensing for each Hadoop node.
Beginning with MapReduce, processing on this data set took 4.97 minutes. Enter Spark; processing completeed in 2.06 minutes.
Both these results were using default settings and were not optimized.