Processes and techniques in data mining

Can you please explain the differences between the process and techniques used in data mining? Which technique is best suited for what type of data ?

    Requires Free Membership to View

    When you register, you'll begin receiving targeted emails from my team of award-winning editorial writers on the latest customer relationship management (CRM)and call center technology issues today. Our goal is to keep you informed on the hottest issues facing this fast-changing industry.

    Hannah Smalltree, Editorial Director

    By submitting your registration information to SearchCRM.com you agree to receive email communications from TechTarget and TechTarget partners. We encourage you to read our Privacy Policy which contains important disclosures about how we collect and use your registration and other information. If you reside outside of the United States, by submitting this registration information you consent to having your personal data transferred to and processed in the United States. Your use of SearchCRM.com is governed by our Terms of Use. You may contact us at webmaster@TechTarget.com.

There are a wide variety of data mining techniques currently available. They include neural networks, decision trees, support vector machines, Bayesian networks, nearest neighbor classification, and many others. In addition, each technique typically contains a number of flavors, such as the CHAID, CART, and C4.5 types of decision trees.

The underlying process used to build models is the same regardless of the regardless of the specific technique chosen (modulo some minor variations primarily in the pre and post processing steps). The basic idea is that you take your data and split it into two parts. The first part is fed to the data mining system so that the model can be built. Once the completed model is available, the second data set is fed to the model to evaluate the model's quality. After the model builder is satisfied with the quality of the model, it can be applied to new data in a process called scoring. The model is then re-used over and over on new data, until the predictions that it makes are no longer of sufficient quality (this determination involves analyzing the actual behavior predicted by the model and comparing the two).

In terms of selecting the best technique for a particular type of data, there are no rules which can be relied on reliably. Some techniques might not be a good match for the raw data but most data mining systems can cope by transforming the data (e.g., neural networks typically need the inputs to be numbers between 0 and 1 so categorical data needs to be manipulated before a neural network model can be created). The best approach to model creation is to try a number of different techniques and then statistically compare the results. This involves some work, setting up the experiments, but it is the only way to consistently generate good models.


This was first published in September 2002

Join the conversationComment

Share
Comments

    Results

    Contribute to the conversation

    All fields are required. Comments will appear at the bottom of the article.