The youthful technology of data mining can sometimes be counterintuitive. So it's very important to understand the basics of the still-developing technology. This tip from Berson, Smith, and Thearling's book Building Data Mining Applications for CRM (McGraw-Hill) provides descriptions and insights into some of data mining's emerging technologies:
The common goals for all data mining applications and techniques include the detection, interpretation, and prediction of qualitative and/or quantitative patterns in data. To achieve these goals, data mining solutions employ a wide variety of machine learning, artificial intelligence, statistics, an database query processing. These techniques are also based on mathematical approaches such as multi-valued logic, approximation theory, and dynamic systems.
Let's take a closer look at the approaches that underlie the most contemporary research in data mining.
The induction approach is based on the principle of proceeding from the specific to the general (a reversal of the deduction process), and has its roots in machine learning and artificial intelligence (AI). It is usually implemented as a search through the space of possible scenarios or hypotheses, and employs some special criteria to arrive at a good generalization.
The database querying approach has its roots in the database management systems. It is based on the fact that because most corporate data stores are implemented as a form of data warehouse built on top of a relational database, then the process of data mining can be viewed as a form of a database query processing (albeit quite sophisticated). Research in this area is focused on two tacks:
- Enhancements of the semantic expressions of query languages such as SQL (structured query language) and OQL (object query language) to allow a data mining question (e.g., find all customers that have a propensity to buy more of a given product) to be defined with the constraints of the language grammar
- Enhancements of the underlying data model; this approach is dealing with the issue of whether the relational data model that is a good vehicle for data abstraction is also a good model for data mining
The compression approach is based on the computational learning theory and the feasibility of models based on the Minimum Description Length principle. Several commercial data mining systems use this approach to determine the effectiveness of uncovered patterns. The essence of the approach is as follows:
- Several data mining techniques can be applied to the same data set to potentially yield similar results.
- Rather than use several techniques and perform an exhaustive analysis of all patterns, the idea is to compress the entire answer space to a smaller but stronger set of discovered patterns that are easier to describe; thus the approach is called compression.
The approach of approximation and searching may appear counterintuitive because it starts with the exact model and intentionally introduces approximations in the assumption that some hidden structures and patterns can be uncovered. This approach applies to text searching and text mining as well. For instance, a technique called Latent Semantic Indexing (patented by Bellcore), uses linear algebraic matrix transformations and approximations to identify hidden patterns in word usage, thus enabling searches beyond simple keyword matching. This and similar approaches can improve the efficiency of search algorithms by contracting the space of possible patterns.
In short, data mining, although it has its roots in several established areas of science and technology, is still an emerging science and is a subject of continuous research.
For More Information
- Building Data Mining Applications for CRM
- The Best Data Analysis Web Links: tips, tutorials, scripts, and more.
- Have a data analysis tip to offer your fellow gurus? The best tips submitted will receive a cool prize--submit your tip today!
- Ask your technical data mining questions--or help out your peers by answering them--in our live discussion forums. Also, give us your feedback about this tip in the "Sound Off" forum.