News

Joy Mundy's 10 Favorite Steps for Developing Solid ETL

1. Start from a solid dimensional model

As a member of the (now defunct) Kimball Group, I am perhaps obligated to state this as a foundational principle. However, I firmly believe the dimensional model and the Kimball conformed Bus Matrix architecture is the best design for a DW/BI system. Get the model right, build solid ETL to populate it with integrity, and 95% of the battle is won.

2. Develop a scalable ETL architecture

Stop! Take your hands off your keyboard and do some thinking before you begin development. Whether you plan your architecture in advance, or simply write code until you’re done, you will end up with an architecture. Surely we all agree that a planned architecture is better than one that has evolved organically. Key questions include:

Should you purchase and use an ETL tool. (Hint: most organizations do.) How will you use that ETL tool?
What is your strategy for the Cloud? Will you put the data warehouse in a cloud (internal or external)? What about ETL? How will you move data?
How will your ETL system be instrumented? In other words, how will jobs be launched, monitored, logged, and managed?
Who will design the templates for your ETL modules? Where will those templates be stored and what processes will ensure that your developers adhere to consistent standards?
How often will you stage data within the ETL flow?
How big is one ETL module? (Hint: generally accepted best practice is to make each module small, usually one module for each target table.)

3. Plan before you begin each table

Even once you know how your ETL code will look, there are some table-specific planning tasks that you should finish before writing a line of code, or clicking a single widget in an ETL tool:

Profile the source data thoroughly. You need to know what are the primary keys (not always a trivial task if your source isn’t well behaved); the foreign key relationships, the min/max/average lengths and values for each source column, preponderance of nulls, and so on. You need to know your data! If you think ETL is primarily about moving data from point A to point B, you will be unhappily surprised during acceptance testing.
Create a complete source to target map, including pseudocode for complex transforms. I feel silly writing this bullet point, but I can’t tell you how many times I see ETL modules begun without this step.
Document any unusual considerations (if any) for each table. Truth be told, the vast majority of ETL is just here to there. But it’s the exceptions that create challenges.

4. Always stage immediately upon extract,

for audit-ability and restart-ability. This is another widely accepted best practice, which I repeat in this list because I so often see this step skipped.

5. Avoid row-by-row wherever possible

We need our ETL to finish before the next ice age. ETL tools generally avoid row-by-row for inserts, but may take the slower path for updates and deletes. (If your update/delete volume is modest, it’s not the end of the world.) Hand-coded ETL is more inclined to use row-by-row, as in cursors, which is a performance killer.

6. Don’t be afraid to improve data

This recommendation is as much about the dimensional design as the ETL. But I have talked with many teams who seem religiously devoted to maintaining the sanctity of the data as it’s stored in the source system. So many source systems are junk… please let’s take this opportunity to improve things for the user community. Simple improvements include adding explanatory descriptions, groupings, and alternative hierarchies to dimensional data. I almost always prefer to implement improvements as new columns, rather than updating existing data in place.

7. Push complex improvements upstream

A lot of bandwidth in ETL fora centers on complex topics such as de-duplication. I say, push that work upstream! You (the DW / ETL team) will never win by tackling these problems. We want the ETL job to start at, say, midnight and finish at 5am without intervention. The thorny de-duplication problems fundamentally require human intervention. They are much better addressed during the work day, by a team that is tasked with conflict resolution. The answer is to push problems up to the source system, likely a Master Data Management (MDM) system.

8. Set high standards for automation

I spoke with someone over the New Year’s holidays who was on data warehouse load duty for the weekend, and couldn’t attend a party. Our ETL system should require intervention from the ETL team only monthly or even quarterly. Let’s design the system so the common problems are solved automatically; less common issues are handled by the system operations team; and only the truly unforeseen requires us to get out of bed. Or abstain from champagne.

9. Don’t sweat the technology

All the popular tools are widely used and successful. Unless you have extreme data volumes, adopt whatever tool you already have a license for, expertise with, or fondness for. ETL versus ELT… who cares? Only the vendors’ marketing teams.

10. Test continuously

If you can make the bandwidth to invest in an infrastructure – and processes – to test continuously, from developers’ unit tests to system integration testing to user acceptance testing, you will win. This isn’t a small task, but the ROI is off the charts. It’s the right thing to do.

Kimball Resources

This includes our most popular articles and Design Tips, the Kimball techniques and dimensional modeling glossary, our Toolkits’ tools and utilities, and more. Check out the Resources tab on their website.

Additionally, they have compiled a second edition of The Kimball Group Reader (Wiley 2016) containing a fully remastered library of our published content.

Business Analytics for All

Business Analytics for All (ba4all) manages a community where business and technical professionals come together to share ideas, connect with peers and get must-have insights in order to maximize the business impact of information management and business analytics initiatives.

Connect>

Establish contact, not only through the omnipresent social media, but also in real life. We may forget it all too easily but social contact is essential for the happiness of any human being. At Business Analytics for All, you get the chance to connect with your peers from all over Belgium. Expanding your network will not get easier than this!

Share>

Sharing our ideas, that is what we are doing constantly, both in our professional and personal lives. If you want to make progress in any field, sharing is the name of the game. An inspiring keynote, a presentation, a round table ... Business Analytics for All offers lots of opportunities to share best practices and to get insights from the best public speakers. Expect your career development to get a serious boost!

Insight>

If you manage to Connect and to Share ideas and best practices, then you can get Insight. Business Analytics for All provides must-have insights and best practices for maximizing the business impact of information management and business analytics initiatives. You will gain the insight necessary to achieve analytic excellence in today's challenging business environment.

You can find us on:

Belgium: www.ba4all.be
Netherlands: www.ba4all.nl
Sweden: www.ba4all.se

Kimball Design Tip #161 Managing Large Dimensional Design Teams

But including business representatives on the design team obviously increases the size of the group. In many organizations, the resulting team will be a small group of four to eight individuals. In these situations, managing the design process is relatively straightforward. The team needs to gather on a regular basis, focus on the effort at hand, and follow a defined process to complete the modeling effort.

However, in larger organizations, especially when the scope includes tackling enterprise-wide conformed dimensions, the design team may be considerably larger. In recent years we’ve participated in design projects with over 20 participants representing different departments. Large design teams introduce several additional complications that need to be overcome.

The first obstacle is ensuring the consistent participation of team members in all the design sessions. Everyone involved has normal day-to-day job responsibilities, in addition to their design team involvement. Inevitably participants will face pressing issues outside the design process that require their presence. The larger the group, the more frequently these absences will occur. When an individual misses important deliberations regarding key issues, the team will need to circle back with the individual, revisit the discussion and design options, and then perhaps reconsider earlier decisions. This discourse may be important to the overall design, but it negatively impacts the team’s productivity. Excessive backtracking and rehashing is frustrating and draining to the group.

With large design teams, you should avoid overly aggressive scheduling to ensure the highest level of consistent participation. Don’t schedule full weeks or even full days for design sessions. We suggest limiting design sessions to no more than three days in a week; Tuesday through Thursday seems to work best. Instead of full day sessions, schedule two sessions per day, each two and a half hours in duration. Start the morning session a little later than the normal start time, take a two-hour midday break, and finish before the normal end time. In addition to Mondays and Fridays, this schedule gives participants time at the beginning, middle and end of each day to schedule meetings, deal with email, and other daily responsibilities. Each participant only needs to allocate fifteen hours per week to the design sessions. In exchange, each participant is expected to firmly commit this time to the design team. The goal is full participation in all design sessions resulting in greater overall productivity and minimal backtracking.

When the focus of the design effort is on core conformed dimensions, it is important all the business representatives participate since the goal is enterprise agreement on the key attributes that must be conformed across the organization. However, once the team’s attention turns to specific business processes and the associated fact table designs, it is often possible to excuse some of the business representatives not involved/interested in a particular business process for several design sessions.

Remember that any effort to define core conformed dimensions across business processes requires a clear and urgent message from senior management that they expect the effort to produce results. IT by itself cannot “herd the cats through the door.” Make sure you have clear and visible guidance from senior management before you start the dimension conforming process or you will be wasting your time.

Occasionally, gnarly design challenges will arise. Often these issues are relevant or only thoroughly understood by a small group of participants. Trying to work through these very specific issues may be counterproductive for most members of the larger design team. In these situations, it makes more sense to table the discussions during the general design sessions and assign a smaller workgroup to work through the issues and then bring the conclusions/recommendations back to the larger group.

Effective facilitation is often another large design team challenge. Ideally, the lead dimensional data modeler has the required skills to facilitate the group. However, it is sometimes necessary to team a skilled facilitator alongside the dimensional data modeler. In either case, make sure the facilitator and/or modeler possess the key skills required to guide a large team effort:

Deep knowledge of dimensional data modeling concepts and techniques, including the pros and cons of design alternatives.
Understanding of the organization’s business processes and the associated business requirements surrounding those processes within the design effort’s scope.
Self-confidence to appreciate when to remain neutral on an issue and when to push back. Occasionally, the facilitator/modeler needs to take a contrary position to help participants clearly articulate their requirements and concerns.
Keen listening skills. Some participants will not be well versed in dimensional modeling, yet will be communicating key requirements which they’re unable to express in modeling terms.
Strong facilitation skills to draw out participants, adequately debate key issues, control wandering discussions, retain focus on the goal, and ultimately, ensure success.

We also suggest that one team member be assigned to documenting the design and outstanding issues during the sessions. In large group designs, the facilitator/modeler is primarily focused on understanding the requirements and translating those requirements into an optimal dimensional model. Their work requires considerable discussion and evaluation of design options. It’s a productivity gain if the facilitator/modeler doesn’t need to slow down the process to capture design decisions.

Bob Becker

Bob Becker is a member of Kimball Group. He has worked with business managers and IT professionals to prioritize, justify and implement large-scale decision support and data warehousing systems since 1990. Regardless of the industry, he is highly skilled at identifying business requirements, facilitating organizational consensus and designing dimensional data models.

Updated new edition of Ralph Kimball's groundbreaking book!

Chapter 2 of the new Toolkit provides a concise guide to our dimensional modeling techniques. We think this guide is so important that we have placed the content on our website in its entirety. Of course, if you want to follow the links for the full explanations and example use cases of each technique, you will need to read the Toolkit book!

The techniques are organized into nine major groupings. Here’s a teaspoon sip of the techniques in each section:

Fundamental concepts – business requirements, collaborative design workshops, business processes, facts, dimensions, grain, and final deployment in star schemas or OLAP cubes
Basic fact table techniques – fact table structure, three types of fact tables, factless fact tables, aggregate fact tables, and consolidated fact tables
Basic dimension table techniques – dimension table structure, surrogate keys, durable and supernatural keys, drilling down, degenerate dimensions, denormalized dimensions, hierarchies in dimensions, calendar dimensions, role-playing dimensions, junk dimensions, snowflaked dimensions, and outrigger dimensions
Integration via conformed dimensions – conformed dimensions, shrunken dimensions, drilling across, enterprise bus architecture and matrix, detailed implementation bus matrix, and opportunity/stakeholder matrix
Dealing with slowly changing dimension (SCD) attributes – the original three SCD techniques: type 1 (overwrite), type 2 (add row), type 3 (add attribute), plus five additional techniques covering advanced SCD situations
Dealing with dimension hierarchies – fixed depth positional hierarchies, slightly ragged hierarchies, fully ragged hierarchies with bridge tables or pathstring attributes
Advanced fact table techniques – fact table surrogate keys, centipede fact tables, fact/dimension dilemma, lag and duration facts, allocated facts, multiple currencies, multiple units of measure, and late arriving facts
Advanced dimension techniques – dimension to dimension joins, multivalued dimensions and bridge tables, behavior tag time series, aggregated facts as dimension attributes, dynamic value bands, text comments, step dimensions in sequential processes, hot swappable dimensions, abstract dimensions, audit dimensions, and late arriving dimensions
Special purpose schemas – supertype and subtype schemas, real-time fact tables, and error event schemas

In many ways, the Kimball dimensional modeling techniques are the heart of our approach. Anyone who understands the full set of over 80 techniques surely has the foundation for designing dimensional data warehouse and business intelligence systems. Try testing yourself: can you describe each of the techniques and where they should be used?

We hope you find this concise reference for the “official” Kimball dimensional modeling techniques to be useful and actionable. Now it’s time to “go forth and be dimensional!”

Ralph Kimball

Ralph Kimball is the founder of the Kimball Group and Kimball University where he has taught data warehouse design to more than 10,000 students. He is known for the best selling series of Toolkit books. He started with a Ph.D. in man-machine systems from Stanford in 1973 and has spent nearly four decades designing systems for users that are simple and