Joy Mundy's 10 Favorite Steps for Developing Solid ETL

Tuesday, 9 February, 2016 - 12:50

1. Start from a solid dimensional model

As a member of the (now defunct) Kimball Group, I am perhaps obligated to state this as a foundational principle. However, I firmly believe the dimensional model and the Kimball conformed Bus Matrix architecture is the best design for a DW/BI system. Get the model right, build solid ETL to populate it with integrity, and 95% of the battle is won.

2. Develop a scalable ETL architecture

Stop! Take your hands off your keyboard and do some thinking before you begin development. Whether you plan your architecture in advance, or simply write code until you’re done, you will end up with an architecture. Surely we all agree that a planned architecture is better than one that has evolved organically. Key questions include:

Should you purchase and use an ETL tool. (Hint: most organizations do.) How will you use that ETL tool?
What is your strategy for the Cloud? Will you put the data warehouse in a cloud (internal or external)? What about ETL? How will you move data?
How will your ETL system be instrumented? In other words, how will jobs be launched, monitored, logged, and managed?
Who will design the templates for your ETL modules? Where will those templates be stored and what processes will ensure that your developers adhere to consistent standards?
How often will you stage data within the ETL flow?
How big is one ETL module? (Hint: generally accepted best practice is to make each module small, usually one module for each target table.)

3. Plan before you begin each table

Even once you know how your ETL code will look, there are some table-specific planning tasks that you should finish before writing a line of code, or clicking a single widget in an ETL tool:

Profile the source data thoroughly. You need to know what are the primary keys (not always a trivial task if your source isn’t well behaved); the foreign key relationships, the min/max/average lengths and values for each source column, preponderance of nulls, and so on. You need to know your data! If you think ETL is primarily about moving data from point A to point B, you will be unhappily surprised during acceptance testing.
Create a complete source to target map, including pseudocode for complex transforms. I feel silly writing this bullet point, but I can’t tell you how many times I see ETL modules begun without this step.
Document any unusual considerations (if any) for each table. Truth be told, the vast majority of ETL is just here to there. But it’s the exceptions that create challenges.

4. Always stage immediately upon extract,

for audit-ability and restart-ability. This is another widely accepted best practice, which I repeat in this list because I so often see this step skipped.

5. Avoid row-by-row wherever possible

We need our ETL to finish before the next ice age. ETL tools generally avoid row-by-row for inserts, but may take the slower path for updates and deletes. (If your update/delete volume is modest, it’s not the end of the world.) Hand-coded ETL is more inclined to use row-by-row, as in cursors, which is a performance killer.

6. Don’t be afraid to improve data

This recommendation is as much about the dimensional design as the ETL. But I have talked with many teams who seem religiously devoted to maintaining the sanctity of the data as it’s stored in the source system. So many source systems are junk… please let’s take this opportunity to improve things for the user community. Simple improvements include adding explanatory descriptions, groupings, and alternative hierarchies to dimensional data. I almost always prefer to implement improvements as new columns, rather than updating existing data in place.

7. Push complex improvements upstream

A lot of bandwidth in ETL fora centers on complex topics such as de-duplication. I say, push that work upstream! You (the DW / ETL team) will never win by tackling these problems. We want the ETL job to start at, say, midnight and finish at 5am without intervention. The thorny de-duplication problems fundamentally require human intervention. They are much better addressed during the work day, by a team that is tasked with conflict resolution. The answer is to push problems up to the source system, likely a Master Data Management (MDM) system.

8. Set high standards for automation

I spoke with someone over the New Year’s holidays who was on data warehouse load duty for the weekend, and couldn’t attend a party. Our ETL system should require intervention from the ETL team only monthly or even quarterly. Let’s design the system so the common problems are solved automatically; less common issues are handled by the system operations team; and only the truly unforeseen requires us to get out of bed. Or abstain from champagne.

9. Don’t sweat the technology

All the popular tools are widely used and successful. Unless you have extreme data volumes, adopt whatever tool you already have a license for, expertise with, or fondness for. ETL versus ELT… who cares? Only the vendors’ marketing teams.

10. Test continuously

If you can make the bandwidth to invest in an infrastructure – and processes – to test continuously, from developers’ unit tests to system integration testing to user acceptance testing, you will win. This isn’t a small task, but the ROI is off the charts. It’s the right thing to do.