A common object of ETL jobs is to implement data quality on the extracted data from the operational system during the process of transformation. One major phenomenon is the presence of this data in different formats, e.g.:
- Prof., Prof, Professor
- IBM, I.B.M., International Business Machine
- Mr, Mr. / Mrs, Mrs. / Dr, Dr. etc
There can be a wide range of variety for same data. Data from the operational systems may be coming from SQL Server to Oracle on Solaris. So, one solution comes through the use of scripts -- Korn Shell and AWK -- either applied directly or as exits in ETL tools.
The Steps Are:
- Data extracted using BCP out to a flat file.
- FTP file from the operational system server to the Solaris data warehouse server.
- Build a parameter file (also flat file tab delimited) that contains several lines. Each line contains 3 columns -- Column # to be compared, Bad Data (i.e., From), Good Data (i.e., Changed to).
- One AWK script is developed to transform the data from Bad to Good.
- Another shell script (Korn Shell in this case) is build to run the AWK script to make the process generalized, i.e., to work upon any extracted flat file and parameter file.
On Thursday, I will pass on the Shell script code.
For more information, check out SearchCRM's Best Web Links on
Have a question about this strategy? Ask William now.
This was first published in April 2002