Can you please explain the differences between the process and techniques used in data mining?
Which technique is best suited for what type of data ?
Requires Free Membership to View
When you register, you'll begin receiving targeted emails from my team of award-winning editorial writers on the latest customer relationship management (CRM)and call center technology issues today. Our goal is to keep you informed on the hottest issues facing this fast-changing industry.
Hannah Smalltree, Editorial DirectorThere are a wide variety of data mining techniques currently available. They include neural networks, decision trees, support vector machines, Bayesian networks, nearest neighbor classification, and many others. In addition, each technique typically contains a number of flavors, such as the CHAID, CART, and C4.5 types of decision trees.
The underlying process used to build models is the same regardless of the regardless of the specific technique chosen (modulo some minor variations primarily in the pre and post processing steps). The basic idea is that you take your data and split it into two parts. The first part is fed to the data mining system so that the model can be built. Once the completed model is available, the second data set is fed to the model to evaluate the model's quality. After the model builder is satisfied with the quality of the model, it can be applied to new data in a process called scoring. The model is then re-used over and over on new data, until the predictions that it makes are no longer of sufficient quality (this determination involves analyzing the actual behavior predicted by the model and comparing the two).
In terms of selecting the best technique for a particular type of data, there are no rules which can be relied on reliably. Some techniques might not be a good match for the raw data but most data mining systems can cope by transforming the data (e.g., neural networks typically need the inputs to be numbers between 0 and 1 so categorical data needs to be manipulated before a neural network model can be created). The best approach to model creation is to try a number of different techniques and then statistically compare the results. This involves some work, setting up the experiments, but it is the only way to consistently generate good models.
This was first published in September 2002
Join the conversationComment
Share
Comments
Results
Contribute to the conversation