ETL Offload to Hadoop

In the space of Big Data, a common pattern found is offloading a traditional data warehouse into a Hadoop environment. Whether it be for primary use or to only store “cold” data, Talend makes it painless to offload.

Many organizations trying to optimize their data architecture have leveraged Hadoop for their cold data or to maintain an archive. With the native code generation for Hadoop, Talend can make this process easy.

The offload process starts at the database; which is typically an expensive appliance but can be virtually any JDBC compliant database. Next, the table list is pulled and contextualized, along with a list of the columns in the table. Once this has been completed, a simple statement is prepared which will iterate through each table and offload to Hadoop.

This becomes a straight forward process when visualized as a Talend job:

The last Offload piece will execute a prepared statement natively on Hadoop using either MapReduce or Spark. Since Talend provides a graphical design environment, it can easily change between frameworks such as MapReduce or Spark for batch processing. The generated code can be deployed directly through YARN or as a standalone process.

This generic ingestion framework and others can be easily built out with Talend Studio thanks to its Open Source design. This means there are no proprietary run times or engines required to harness the full power of a Hadoop environment.

One thought on “ETL Offload to Hadoop

  • July 25, 2017 at 5:43 am

    How about schema for each table?


Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: