Coding data cleansing in AWK

A common object of ETL jobs is to implement data quality on the extracted data from the operational system during the process of transformation. One major phenomenon is the presence of this data in different formats, e.g.:

  1. Prof., Prof, Professor
  2. IBM, I.B.M., International Business Machine
  3. Mr, Mr. / Mrs, Mrs. / Dr, Dr. etc

There can be a wide range of variety for same data. Data from the operational systems may be coming from SQL Server to Oracle on Solaris. So, one solution comes through the use of scripts -- Korn Shell and AWK -- either applied directly or as exits in ETL tools.

The Steps Are:

  1. Data extracted using BCP out to a flat file.
  2. FTP file from the operational system server to the Solaris data warehouse server.
  3. Build a parameter file (also flat file tab delimited) that contains several lines. Each line contains 3 columns -- Column # to be compared, Bad Data (i.e., From), Good Data (i.e., Changed to).
  4. One AWK script is developed to transform the data from Bad to Good.
  5. Another shell script (Korn Shell in this case) is build to run the AWK script to make the process generalized, i.e., to work upon any extracted flat file and parameter file.

On Thursday, I will pass on the Shell script code.

For more information, check out SearchCRM's Best Web Links on

    Requires Free Membership to View

Business Intelligence and Data Analysis.

Have a question about this strategy? Ask William now.


This was first published in April 2002

Join the conversationComment

Share
Comments

    Results

    Contribute to the conversation

    All fields are required. Comments will appear at the bottom of the article.

    Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.