Data quality management: Handling violations, part 1

Data quality management: Handling violations, part 1

A workable way to approach data quality management is to make it the "absence of intolerable defects." Here's a look at some potential intolerable defects in your environment and what can be done about them.

Data types constrain the values in a column ... to a degree. Any mix of 0-20 characters can go into a character (20) data type column. However, if this is a Name field, there are some characters that you would not expect to find in the column such as % and $. These would be "red flags" that the field contained inappropriate data. There are also numerous misspellings and incorrect alternative spellings of last names. Often, a manual review of column contents, with counts of each unique value will bring to light the one correct spelling.

There are two approaches to handling the violations. Usually a combination is best. You can "generalize" into rules the various formatting errors that are found in the field. Typical of these formatting errors found in name columns include:

  • Space in front of name
  • Two spaces between first and last name and/or middle initial
  • No period after middle initial
  • Inconsistent use of middle initial (sometimes used, sometimes not)
  • Use of all caps
  • Use of "&" instead of "and" when indicating plurality
  • Use of slash instead of hyphen

On and on it goes, especially in environments where original data entry is "free form," unconstrained

    Requires Free Membership to View

    When you register, you'll begin receiving targeted emails from my team of award-winning editorial writers on the latest customer relationship management (CRM)and call center technology issues today. Our goal is to keep you informed on the hottest issues facing this fast-changing industry.

    Hannah Smalltree, Editorial Director

    By submitting your registration information to SearchCRM.com you agree to receive email communications from TechTarget and TechTarget partners. We encourage you to read our Privacy Policy which contains important disclosures about how we collect and use your registration and other information. If you reside outside of the United States, by submitting this registration information you consent to having your personal data transferred to and processed in the United States. Your use of SearchCRM.com is governed by our Terms of Use. You may contact us at webmaster@TechTarget.com.

and without the use of master data as a reference.

It is not possible to generalize to rules things like use of initials and misspellings (i.e., William McNight instead of William McKnight) so they need to be handled separately. You can map the incorrect data to the correct data in your data warehouse's staging area. As new data is discovered (i.e., is Bill McKnight the same as William McKnight?), it is held out until review after which it can be mapped to incorrect or correct data and re-routed through the ETL process.

If adapting this approach, be sure procedurally the reviews are held quickly because data will be held out of the data warehouse until it is accounted for. The mapping would take place after the rules are applied.

With either or both approaches, I recommend actually bringing the "bad" value over as well since often users will want to know what the source data actually had in it.

Read part two of this tip.

For more information, check out this Learning Guide for Data Quality.

This was first published in May 2002

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.