The "cleanliness" of data is very important to data mining and analysis projects. Data can be cleaned through a number of methods, but it is best when the data is entered or acquired properly to begin with. The following excerpt from Kimball, Reeves, Ross and Thornthwaite's book The Data Warehouse Lifecycle Toolkit
- Where there are alternatives, identify the highest quality source system: the organization's system of record.
Examine the source to see how bad it is. Our favorite, and rather low-tech, approach is to perform a frequency count on each attribute:
SELECT my_attribute, count(*) FROM source_table GROUP BY my_attribute ORDER BY 1;
- Upon scanning this list, you will immediately find minor variations in spelling. The attributes with low frequencies can be checked, by hand, if necessary, and corrected. Beyond that, tools exist to help comb through the data to derive the implicit business rules and identify problems.
- Raise problems with the steering committee. Data quality is a business problem, but the warehouse will suffer if the users believe the data is bad. They don't understand the difference between the warehouse and the source systems. All they know is the data in the warehouse is wrong. Clarify the magnitude of the problem and its source, but take care not to blindside the source systems folks in the process. Work with them to identify the causes and potential solutions, and include them in the presentation to the steering committee. They don't want to provide bad data--it just hasn't been their job to improve the data the business users are most interested in.
- Fix problems at the source if at all possible. The ideal solution is to work with the source systems groups to correct the causes of data problems. Fixing the data downstream does not solve the problems in the long run. You will create a complex set of rules that need to be watched and maintained to account for new oddities in the data. In fact, some problems should not be fixed in the warehouse. If the source system is assigning incorrect state codes, it needs to be fixed. Usually, the source system group does not have the resources available immediately, so we agree to fix the data in the data staging process until the next release of the source system. Don't count on support from the source system folks. As we've said before, it's just not their job to provide this kind of information (yet).
- Fix some problems during the data staging. For the problems you agree to fix, work with your business users to determine the appropriate business rules to clean these up in the transformation process. Often, there is a single pass during the initial load to clean up and align history with current data. Codes, descriptions and IDs change when the business changes (or when the system runs out of new product numbers).
- Don't fix all the problems. Some problems will simply need to pass through the warehouse and out to the business users. This can help management understand the breadth and severity of the issue when they see spurious entries appearing in their reports. It can also shoot you in the foot. Make sure they understand the nature and source of these problems.
- Use data cleansing tools against the data, and use trusted sources for correct values like address.
- Work with the source system owners to help them institute regular examination and cleansing of the source systems.
- If it's politically feasible, make the source systems team responsible for a clean extract. If they are on the hook for an extract of clean data, they may be more eager to revise the transaction system to fix the data problems.
For More Information
- Click here to learn more about "The Data Warehouse Lifecycle Toolkit"
- The best data warehousing and business intelligence web links: tips, tutorials, scripts, reviews, and much more.
- Have a data warehousing or BI tip to offer your fellow developers and administrators? The best tips submitted will receive a cool prize--submit your tip today!
This was first published in May 2001