Nowadays, one of the biggest problems, most companies have to deal with, regards the way in which they integrate data from various sources, and then process and expose it, to satisfy business needs and get the necessary information to make the right decisions for improving existing products.
Yet, many companies are not aware of the benefits of data integration tools. Using such a tool is not uncommon at all. It is rather a MUST.
How do these tools help us? They ….
Increase the speed for implementation, testing and deployment
Decrease the estimates
Ensure the accuracy of the reports
What is ETL? ETL is a process used for databases, data warehouses especially, that extracts data from various sources, stores it in a desired format and structure, and loads the transformed data to a final destination.
To choose a reliable tool for data integration, various research and proof of concepts are created. After that, differences are taken into consideration and the final decision comes: the tool to go with. The minimum requirements for accepting a tool in the investigation phase are: integration processes that can move data from sources to target, abilities for error handling having the freedom to intervene on it, metadata manipulation and process monitoring.
I have chosen two random ETL tools: Oracle Data Integrator 12c (ODI 12c) and Talend Data Integration tool, trying to make a short introduction for each of these tools.
This is a complex data integration platform that satisfies all the needs: large volume of data, good performance and parallelism. ODI12c uses an E-LT architecture to guarantee a higher performance at a reduced cost. The main difference between ETL and E-LT architecture is the presence of an ETL server as in the image below:
The management pack of ODI 12c is illustrated below:
ODI 12c interface is straightforward and easy to use. It has 4 main components (tabs): Designer, Operator, Topology and Security Navigators, all connected to a Repository, as shown below:
Designer is used to create and modify the metadata, ensure data integrity and transformation. The main components used here are the data models (that keep all the details about the data, for example the tables, columns) and the projects (that refer to the rules for loading and transforming the data, such as mappings, procedures, variables etc.). The Designer tab components are: Projects, Models, Load Plans and Scenarios, Global Objects and Solutions.
Operator handles the execution and monitoring part, by helping developers perform debugs and code checks. Its components are: Session List (that keeps all the sessions split by data, agents etc), Hierarchical Sessions, Load Plans Executions, Scheduler, Load Plan and Scenarios and Solutions.
Topology is used to define the logical and physical architecture of the infrastructure.
Security encapsulates the user’s rights, roles and rights on objects.
Setting the source and the destination of the data, as well as the transformation on the data, is done in the Mappings component. We can find this component under the Designer tab. Using drag and drop here, we can build the entire data flow: define source, aggregation (transformation) or changes on data, to be able to populate the destination afterwards. Moreover, under this component, Knowledge Modules can be used to load and manipulate data.
Knowledge Module incorporates code templates used in a task from the entire process of data integration. Impact analysis on a KM can be done, KM cannot be run without using metadata and it has the advantage of not referencing physical objects (datastores, columns, physical path).
There are 6 KM categories:
Reverse-engineering: gets metadata from various sources
Check: as the name says, it checks the consistence of the data
Loading: loads data at staging level
Integration: transforms data and puts it into the destination
Journalizing: captures data changes at the staging level
The Package component is used to automate the data integration flow. Here, we can create a diagram describing the interaction among subcomponents.
The Scenario component is built from the generated code which is saved into Work Repository (mappings, packages, procedures, variables etc.). Even if the mappings, packages or procedures are modified, Scenario is affected only if we re-generate it. A Scenario can be run from ODI, command line or from a web service.
Talend is the first provider of an open source for data integration. It is built on the Eclipse RCP platform, making it easy to use, since many of us are familiar with this interface. Talend also uses the E-LT architecture.
Talend Data Integration architecture is illustrated below:
The interface components are: Repository, Design workspace, Palette, different configuration views and Outline/Code viewer . Repository includes the Business Models, Job Designs, Contexts, Code, SQL Templates, Metadata, Documentation and Recycle bin. Design workspace is used to create or modify Business Models or Job Design using the objects from Palette.
Creating a Business Model has a huge advantage, because it helps you identify several problems before starting the implementation itself. It also enables you to correct problems in the design phase. A Business Model can be created not only by developers, but also by the management team, using various shapes from Palette. In order to clarify the Business Model, we can add comments with details on added shapes.
A Job Design is made up of one or more interconnected components, by implementing a part or even an entire Business Model.
Data Mapping includes input data, variables list, expressions and output data, as well as the connections between them. One such component is tMap that enables various operations to be made on the data: transformations, concatenations, filters etc. tMap offers a solution that would help with large volumes of data mappings and consists of saving temporary data on the disk and not in memory.
You can run a Job Design in several ways. It can be exported under various forms: sh or bat script, WAR, ZIP and it can run from command line, Java application, Talend Data Integration tool.
ODI 12c and Talend Data Integration are similar: they have the same E-LT architecture; they are based on Java; they have an easy to use interface and they can be installed on any operating system. Maybe one of the great advantages ODI 12c brings is the fact that it integrates well with all Oracle products, especially with Oracle Exadata (compute and storage system), ensuring a very good performance. In a recent study conducted by a research company, it was outlined that the recovery after fail feature is better implemented in ODI 12c rather than in Talend.
Taking into consideration the Oracle expertise and experience gathered over the years, this might offer us more confidence in using ODI 12c. Moreover, it is worth looking into a tool such as Talend Data Integration that was developed recently, and which will probably be a real competition for ODI 12c in the near future.