Coding data cleansing in AWK

Coding data cleansing in AWK

A common object of ETL jobs is to implement data quality on the extracted data from the operational system during the process of transformation. One major phenomenon is the presence of this data in different formats, e.g.:

  1. Prof., Prof, Professor
  2. IBM, I.B.M., International Business Machine
  3. Mr, Mr. / Mrs, Mrs. / Dr, Dr. etc

There can be a wide range of variety for same data. Data from the operational systems may be coming from SQL Server to Oracle on Solaris. So, one solution comes through the use of scripts -- Korn Shell and AWK -- either applied directly or as exits in ETL tools.

The Steps Are:

  1. Data extracted using BCP out to a flat file.
  2. FTP file from the operational system server to the Solaris data warehouse server.
  3. Build a parameter file (also flat file tab delimited) that contains several lines. Each line contains 3 columns -- Column # to be compared, Bad Data (i.e., From), Good Data (i.e., Changed to).
  4. One AWK script is developed to transform the data from Bad to Good.
  5. Another shell script (Korn Shell in this case) is build to run the AWK script to make the process generalized, i.e., to work upon any extracted flat file and parameter file.

On Thursday, I will pass on the Shell script code.

For more information, check out SearchCRM's Best Web Links on Business Intelligence and Data Analysis.

    Requires Free Membership to View

    When you register, you'll begin receiving targeted emails from my team of award-winning editorial writers on the latest customer relationship management (CRM)and call center technology issues today. Our goal is to keep you informed on the hottest issues facing this fast-changing industry.

    Hannah Smalltree, Editorial Director

    By submitting your registration information to SearchCRM.com you agree to receive email communications from TechTarget and TechTarget partners. We encourage you to read our Privacy Policy which contains important disclosures about how we collect and use your registration and other information. If you reside outside of the United States, by submitting this registration information you consent to having your personal data transferred to and processed in the United States. Your use of SearchCRM.com is governed by our Terms of Use. You may contact us at webmaster@TechTarget.com.

Have a question about this strategy? Ask William now.


This was first published in April 2002

Join the conversationComment

Share
Comments

    Results

    Contribute to the conversation

    All fields are required. Comments will appear at the bottom of the article.

    Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.