WHAT DOES ETL MEAN?
ETL stands for Extract, Transform, Load, and means reading (or extracting) data from one or more data sources before processing (or transforming) them and then loading them into a sink – normally a database. The sources are also typically databases in the majority of cases. ELT does the same thing, but in a slightly different order.
With ELT, we are usually dealing with mass data; so (for example) the task might be to regularly push the inventory in the ERP system into the database of the web shop. This can quickly add up to several thousand records – or even several tens of thousands. And these massive volumes of data pose a special challenge to an ETL tool.
Incidentally, we are talking specifically about ETL tools here, since we shouldn’t waste any time on custom-developed programs. Even if these are tailored precisely to perform a specific task or to be ever so slightly faster, standard professional tools have so many other advantages (flexible configurations, monitoring, faster implementation of new tasks) that the DIY method simply isn’t worthwhile. The same goes for all the other concepts we have discussed, by the way – EDI, EAI, and so on
What counts when it comes to ETL?
Speed, speed, and did we mention speed? No in all seriousness: of course there are other important factors at play, but don’t forget that we are dealing with mass data. To explain this, let’s take another look at the example of transferring the stock list from the ERP system into the web shop database.
In the simplest case, this can be done by simply reading the entire ERP database and pushing it into that of the web shop. Yet the tables at least will probably look different, so we will need to modify the structure of the data slightly. Then the ERP system might have special codes for the different countries, their respective tax rates, and so on, so the web shop will need country codes such as DE, AT, CH as well as direct percentage values for tax. Those are just a few simple examples of possible transformations.
Now we have a problem: we want to have a clean data set – or in other words, a snapshot of the stock list in the ERP where everything matches up precisely. That means we can’t simply fire off a select statement to retrieve a few tens or hundreds of thousands of items from the source database over the course of a few minutes, since that database may be
constantly changing. The result would be inconsistent data. A simple solution to this would be to run our process in the middle of the night when the inventory won’t change. But the problem there would be that the data behind the web shop would become less accurate as the day went on. So maybe we should make the reconciliation multiple times per day instead – perhaps every three hours. Yet that would mean we wouldn’t be able to make any changes to the ERP system while it was being read, as the tables would be locked. That wouldn’t be ideal either, if it took a few minutes and happened multiple times per day. This is why our motto here is ‘need for speed’!
We also have the same problem in the target database. If we replace the entire database each time, the tables will have to be emptied and refilled for each reconciliation, leaving the web shop unusable while the process is running.
There are a few different possible approaches here. For example, we could use a time stamp for the most recent change to ensure that we only retrieve records from the source that have changed since the last reconciliation. That would reduce the quantity of data, but could also slow down the reading process due to the comparison contained in the where-clause in the statement. Alternatively, we could tell the ERP system to track what has changed since the last transfer and to automatically make these data available at regular intervals. Finally, if the data in both databases need to be synchronised at all times, we could immediately hand every change over to the ETL tool, which would then pass it on to the web shop database. That would spare lengthy shutdowns on both sides, but would result in more work. For instance, if an item changed multiple times between two full reconciliations, it would only be transferred once; whereas with instant handovers, each change would be passed on every single time it happened.
We should make sure we don’t forget about the “T” part of the process – the transformation. Speed plays a role here too, albeit a less important one than when we are reading and writing the source and target databases. After all, we aren’t blocking any other systems during this phase. Yet it also shouldn’t take a whole hour to process hundreds of thousands
of records, since otherwise the data will be obsolete by the time they reach their destination.
In the simplest case (as mentioned above), a transformation of this kind might be the replacement of country or tax codes. Yet it might also involve eliminating duplicates or otherwise cleaning or enriching the data. As such, we often need to access files with conversion tables, additional databases and other third-party systems. We may also need to call up web services or other SOA services. As you can see, there are similarities here with EAI processes.
That brings us to a very important point. As mentioned above, specialist ELT tools are generally better than generic branded solutions (and for more reasons and examples of this kind of software, take a look at Wikipedia). However, because this kind of process often involves steps that appear in EAI (and EDI) processes, it is worth considering in each case whether you really need to buy a dedicated piece of ETL software, or whether it would be better to acquire an EAI tool that can carry out the same tasks. That certainly won’t let you achieve the extreme speeds that specialist ETL tools can reach, but the performance of the EAI software might well be more than enough for your requirements, and you can then go on to use the system for other purposes too.
Not forgetting that you will be able to keep all your important processes in one central location and monitor them too, instead of keeping an eye on two different systems. Or on three, if you also want to deploy an EDI system – which is unavoidable for almost every company nowadays. In other words, the ideal solution in many cases is a single system that is at home in all three worlds and can meet your requirements across the board – even though it may not be able to operate at the optimal speeds that specialist tools can achieve in extreme situations such as ETL (and SOA-type functionality might come into play here too…).