So you've built your data warehouse and now you're ready to mine it...Not so fast! Depending on the data mining tool, you may have to do a significant amount of work to prepare your data for mining. The following tip from Ralph Kimball and Richard Mertz's The Data Webhouse Toolkit
Data Transformations for Webhouses in General
There is a set of basic data transformations you are probably already doing if you have a data warehouse. You are performing these transformations in your data extract system that pulls data from your legacy system, and you may be housing the data in your operational data store for cleaning and reformatting. The cleaned data is exported from the operational data store into one or more of your subject-oriented data marts. Although there may be many transformation steps in your data extract system, the ones of particular interest to data mining are:
- Resolving inconsistent legacy data formats, such as ACSII and EBCDIC, and resolving inconsistent data encoding, geographic spellings, abbreviations, and punctuation. Hopefully, you are doing this already.
- Stripping out unwanted fields. Legacy data contains fields that are useless from an analysis point of view. A data mining tool may waste cycles trying to find patterns these fields.
- Interpreting codes into text. Hopefully, these codes are already in your dimension tables so that adding the explanatory text is an easy change to the dimension table.
- Combining data such as customer data from multiple sources under a common key.
- Finding multiply used fields in your legacy data, when you must interpret the field's value based on the context of the legacy record. In some cases you may not even realize that you have a legacy field that is hiding multiple uses. Your data mining tool might go crazy figuring this out.
Data Transformations for all Forms of Data Mining
This group of data transformations may not be needed for standard reporting and analysis functions in a data warehouse but is required for just about every data mining application. Many of these transformations affect the numeric, additive facts in the central fact table of your dimensional data mart:
- Flag normal, abnormal, out of bounds or impossible facts. For each of these circumstances, it is better to mark the data with a status flag so that it can be optionally constrained into or out of the analysis than it is to delete the unusual value.
- Recognize random or noise values from context and mask out. When this case can be recognized, the random number should be replaced with a null value.
- Apply a uniform treatment to null values. Null values can often cause a data mining tool to hiccup. In many cases the null value is represented by a special value of what should be a legitimate fact. Null values are tricky because philosophically there are at least two kinds of nulls: there was no data at the time of measurement or the measurement process failed to deliver the data.
- Flag fact records with changed status. A very helpful data transformation is to add a special status indicator to a fact table record to show that the status of the account has just changed or is about to change. Without this a canceled account would be noticed by its absence in the next billing cycle, which is impractical in most database applications.
- Aggregate the data. Aggregations may be tempting either to reduce the sheer volume of data being analyzed or to provide analysis buckets that are statistically significant.
Special Data Transformations Depending on the Data Mining Tool
- Divide data into training, test, and evaluation sets. Nearly all data mining applications require that the raw input data be separated into three groups. Perhaps the data should be separated randomly into three control groups. The first group is used to train the data mining tool, the second is used to test the parameters established by the first and the third it the evaluation set of data.
- Adding computing fields as input or as targets. You may not have to modify your base schemas with these computed values depending on how complicated the added information is to compute at query time.
- Mapping continuous values into ranges. Some data mining tools like decision trees encourage you to "band" continuous values into discrete ranges.
- Normalizing values between 0 and 1. Neural network data mining tools usually require that all numeric values be mapped into a range of zero to one. Some recommend that this range should be a little larger than the observed data to accommodate new data that falls outside the actual data.
- Converting from textual to numeric or numeral category. Some data mining tools may operate only on numeric input. In these cases, discrete text values need to be assigned code.
- Emphasizing the unusual case abnormally to drive recognition. Many times a data mining tool is used to describe and recognize unusual cases, fraud, for instance. The problem is your training data may not contain any instances of fraud so you have to replicate or seed the training data with the desired target.
Although many of the data transformations we have described probably should be done in a general data warehouse environment, the demands of data mining really force the data cleaning issue. The purpose of data mining should be to discover meaningful patterns in your data, not to stumble over data cleanliness problems.
This was first published in April 2001