Coding data cleansing in AWK

One challenge of implementing data quality on extracted data from operational systems is its presence in different formats.

A common object of ETL jobs is to implement data quality on the extracted data from the operational system during...

the process of transformation. One major phenomenon is the presence of this data in different formats, e.g.:

  1. Prof., Prof, Professor
  2. IBM, I.B.M., International Business Machine
  3. Mr, Mr. / Mrs, Mrs. / Dr, Dr. etc

There can be a wide range of variety for same data. Data from the operational systems may be coming from SQL Server to Oracle on Solaris. So, one solution comes through the use of scripts -- Korn Shell and AWK -- either applied directly or as exits in ETL tools.

The Steps Are:

  1. Data extracted using BCP out to a flat file.
  2. FTP file from the operational system server to the Solaris data warehouse server.
  3. Build a parameter file (also flat file tab delimited) that contains several lines. Each line contains 3 columns -- Column # to be compared, Bad Data (i.e., From), Good Data (i.e., Changed to).
  4. One AWK script is developed to transform the data from Bad to Good.
  5. Another shell script (Korn Shell in this case) is build to run the AWK script to make the process generalized, i.e., to work upon any extracted flat file and parameter file.

On Thursday, I will pass on the Shell script code.

For more information, check out SearchCRM's Best Web Links on Business Intelligence and Data Analysis.

Have a question about this strategy? Ask William now.

This was first published in April 2002

Dig Deeper on Data quality management in CRM



Find more PRO+ content and other member only offers, here.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: