Ralph Kimball, et al.
Your data is coming from an operational system. So what kind of data entry problems can you expect? This tip from Ralph Kimball, Laura Reeves, Margy Ross, and Warren Thornthwaite's book, The Data Warehouse Lifecycle Toolkit
The actual content of the data presents a substantial challenge to the warehouse. The data coming out of the operational systems is often surprisingly messy. Problems like the following are common:
Inconsistent or incorrect use of codes and special characters. The Gender field might contain values like "M," "F," "m," "f," "y," "n," "u," and blank. It is common to find values that can't be right when you begin to explore the data.
A single field is used for unofficial or undocumented purposes. Comment fields and extra address lines often contain information gold, but it's usually free-form entry and needs to be cleaned up.
Overloaded codes. Many systems, especially older ones built in the days when saving bits was important, use a single code for multiple purposes. For example, "when the catalog is mailed to one of our existing customers, the code refers to their computed likelihood-to-purchase score; when the catalog is mailed to someone on a purchase list, the code refers to which vendor sold us the list." We're not making this up.
Evolving data. Systems that have evolved over time may have old data and new data that use the same fields for different purposes or where the meaning of the codes have changed over time.
- Missing, incorrect, or duplicate values. Names and addresses are the classic example of this problem. Transaction systems do not need to collect information that is useful for business analysis to send out the invoices. In other words, the exact name of a customer isn't necessary, as long as the customer is not offended (it could be IBM or I.B.M. or Tom Jones or Thomas Jones). But they can end up looking like two different customers in the warehouse.
The historical responsibilities and incentive structure around the operational systems are at the root of this confusion. These systems were created to accomplish specific tasks, like take orders, ship products, or send out invoices. The name of the customer doesn't have to be exactly the same for them to successfully receive the product--the ABC company will still receive shipment with most any spelling variation. (This is not to say that correct spelling isn't important. There is another rumor in our little industry about a well-known actress who closed her sizable investment account when they couldn't spell her name right--Barbra with two a's.) If the orders are shipped and the invoices are sent out on time, nothing else matters--until the data warehouse project comes along.
Adding the data validation checks in the entry system can add unacceptable overhead to the transaction. In a job function where transactions per second is the sole measure of success, you have to be pretty persuasive to add to the transaction time. Fixing the source system so they deliver accurate information is the right thing to do, but we advise you to choose your battles carefully and muster strong steering committee support for these initiatives.
Click on the book title to learn more about The Data Warehouse Lifecycle Toolkit.
Did you find this tip helpful? E-mail and let us know.
This was first published in July 2001