Data Integration tools: ETL Tools

The main objective of an "ETL tool" is to gather the content of various databases or operational systems across your organization and move/copy/transfer all this data into a centralized database. This "centralized database" is often named a "Data warehouse". Alternatively ETL tools are also used to perform complex data transformations that are difficult to make using simple SQL statements.

Differences between tools

The differences between all these tools are:

  • Speed & Scalability

    Usually ETL tools are transferring (or processing) huge amount of data between different servers. Let's assume that, each day, you need to process 100 TB of data but, due to the large volume of data, you require 28h of computing time. You see that there is a problem here: you need 28 hours of computing time every day. This is why, most of the time, Scalability comes hand-in-hand with Speed. The slowest ETL's are Talend, Pentaho and Rapid Insight (this last one is very very very slow). Talend has some small "memory issues" that prevent it to process large volume of data. The technology behind Anatella is C++; it is thus very fast.

    To achieve very high throughput, one solution is to use several computers "in parallel" (Anatella, Datastage and Ab Initio, CloverETL offers this functionality), but then you need to hire more IT staff to manage your whole "fleet of computers".

    Scalability is not only about the raw volume of data that you can process. All processed data always ends-up in (large) tables. When you do "advanced predictive analytics" (like we do with the "TIMi suite"), you usually end-up with tables from 500 to 50000 columns. No other ETL tool than Anatella is able to handle such a large volume of columns.

  • Versatility

    Although the "SAS base" system is not really an ETL, many companies still use it as an ETL because of the large versatility of the data-transfomation that you can accomplish. However, this versatility has a price: very low speed and a horrible interface based on a "scripting technology from the late seventies".

    Anatella also integrates a scripting language, similar (but more versatile) to the one used in a SAS. Thanks to its advanced scripting language (based on JavaScript), Anatella is even more flexible than SAS.
  • Cost of ownership

    Datastage, Ab Initio and SAS are extremely expensive when you buy them. The interface of Datastage and SAS is based on a "scripting language", so you need to hire expensive, highly training consultants to use these software (because they need to know the particular, uncommon "scripting language" of these application).

    Slow ETL's also tend to become more expensive because you must hire more IT staff to manage your whole "fleet of computers".

  • User Interface

    There are basically two families of ETL tools:
    1. Script-Based ETL tool: Examples: SAS base, Datastage, Scriptella, ...
    2. Graph-based ETL tool: Examples: Kettle, CloverETL, Talend,...
    There is currently a real "war" between ETL practitioners to decide which ETL family is the "best one".

    Out ETL tool (Anatella) offers both a Script-based interface AND a Graph-Based interface.

    Script-based ETL tools are more versatile and offer to the final user a greater flexibility when defining new transformations. On the other hand, typically, it's very difficult to develop new data-transformations-script using the currently commercially available "script-based-ETL tools". The main "barrier" when using a script-based-ETL is that you have to learn still another new language that is possibly completely different from any other "common languages" known by most programmers (In particular, the "SAS language" is a really difficult to learn & use because this language has a terrible syntax with dozens of exceptions (I can never remember if I have to put a ";" character or a "," character at the end of the line, between the instructions).

    Anatella is one of the very few ETL tool that uses a "hybrid techlonology":
    • The simple transformations are described using "little boxes" (that is the most intuitive way to represent a data transformation and is a "de facto" standard for all the modern ETL tools).
    • Complex transformations are programmed using a scripting language based on JavaScript (standard ECMA-262) which is simple, complete and extremely versatile. JavaScript is one of the most widely used programming language currently used in the industry (it's used in every webpage!). You can leverage your already-existing-JavaScript-skills to become an ETL expert instantaneously!

Differences in results between tools

The differences in result between all these tools are:

  • None: given enough time and enough human resources, you can achieve the same result with all the above tools. The only exception that "I can think of" might be: When you are manipulating tables with a high number of columns (above 10000) (...and this situation is quite common when working on predictive analytics projects), then only Anatella can do it properly.

Available ETL tools: Anatella (our tool), Datastage, Ab initio, SAS, Talend, CloverETL, Pentaho, Rapid Insight ETL,...

Next: Review of Data Warehousing tools

Test the TIMi Suite for Free!