# Noisy data

Real-world data tends to be incomplete, noisy, and inconsistent. Data cleaning routines attempt to fill in missing...

values, smooth out noise while identifying outliers, and correct inconsistencies in the data. This tip, from Jiawei Han and Micheline Kamber's book "Data Mining Concepts and Techniques," concentrates on how to smooth out noise:

First of all, what is noise? Noise is random error or variance in a measured variable. Given a numeric such as, say price, how can we "smooth" out the data to remove the noise? Let's look at the following data smoothing techniques:

1. Binning: Binning methods smooth a sorted data value by consulting its "neighborhood," this is, the values around it. The sorted values are distributed into a number of "buckets," or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. Here are several common binning techniques:

Sorted data for price (in dollars):
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equidepth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34

Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29

Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34

In these examples, the data for price are first sorted and then partitioned into equidepth bins of depth 3 (i.e., each bin contains three values). In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the width, the greater the effect of the smoothing. Alternatively, bins may be equiwidth, where the interval range of values in each bin is constant. Binning is also used as a discretization technique.

2. Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or "clusters." Intuitively, values that fall outside of the set of clusters may be considered outliers.

3. Combined computer and human inspection: Outliers may be identified through a combination of computer and human inspection. In one application, for example, an information-theoretic measure was used to help identify outlier patterns in a handwritten character database for classification. The measure's value reflected the "surprise" content of the predicted character label with respect to the known label. Outlier patterns may be informative (e.g., identifying useful data exceptions, such as different versions of the characters "0" or "7") or "garbage" (e.g., mislabeled characters). Patterns whose surprise content is above a threshold are output to a list. A human can then sort through the patterns in the list to identify the actual garbage ones. This is much faster than having to manually search though the entire database. The garbage patterns can then be excluded from use in subsequent data mining.

4. Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear regression involves finding the "best" line to fit two variables, so that one variable can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two variables are involved and the data are fit to a multidimensional surface. Using regression to find a mathematical equation to fit the data helps smooth out the noise.

Many methods for data smoothing are also methods for data reduction involving discretization. For example, the binning techniques described above reduce the number of distinct values per attribute. This acts as a form of data reduction for logic-based data mining methods, such as decision tree induction, which repeatedly make values comparisons on sorted data. Concept hierarchies are a form of data discretization that can also be sued for data smoothing. A concept hierarchy for price, for example, may map real price values into inexpensive, moderately_priced, and expensive, thereby reducing the number of data values to be handled by the mining process. Some methods of classification, such as neural networks, have built-in data smoothing mechanisms.

This was first published in February 2001

## Content

Find more PRO+ content and other member only offers, here.

Oldest

• ### Tableau to broaden data visualization tool with Tableau 10 update

Tableau is getting set to release the 10th version of its self-service reporting tool, with updates in several key areas intended...

• ### Embedded BI tools need to be put in their proper place

Analyst Howard Dresner discusses the potential business value of embedding business intelligence software in business ...

• ### Data scientists benefit when they pay attention to tools for big data

Data science tools typically revolved around machine learning, predictive modeling and visualization. But they should also ...

## SearchDataManagement

• ### What to know about Information Builders' Omni-Gen data governance tool

Omni-Gen from Information Builders features a variety of tools for enterprise data management, data governance and data best ...

• ### The chief data officer's dilemma: CDO role in flux

How to balance data safety with innovative big data expansion was at issue at an MIT symposium where the chief data officer role ...

• ### Navigate the data integration product buying process

The key to selecting a data integration product is to pick the tool that best meets your organization's needs -- not the one with...

## SearchSAP

• ### New releases of SAP Business One ERP for SMBs and SAP Single Sign-On

SAP released a new version of SAP Business One, its ERP for SMBs, with new functions for project management and intelligent ...

• ### How does the SAP IoT marketing hype compare to the reality?

SAP IoT offerings range from a 'foundation bundle' and cobranded gateways to applications for verticals including energy and ...

• ### Business One a viable route to mobile SAP applications for SMBs

Small companies in micro-vertical industries, such as food and beverage, find resellers -- and SAP itself -- eager to exploit ...

## SearchOracle

• ### Oracle cloud ERP gains ground with planned \$9.3B purchase of NetSuite

The Oracle cloud ERP chase could gain speed, thanks to a \$9.3B plan to buy cloud applications vendor NetSuite. The software ...

• ### Oracle high availability tools help DBAs avoid unplanned downtime

High availability features are critical to reducing unplanned downtime on Oracle databases. Database manager Ashish Kumar Mehta ...

• ### Don't rush into cloud databases without a well-grounded plan

As more companies move to the cloud, it's important for DBAs to know both the good and the bad about managing Oracle cloud ...

## SearchAWS

• ### Hybrid cloud push extends with Amazon EC2 Run Command

Amazon's hybrid cloud strategy continues to evolve with Amazon EC2 Run Command, extending its reach into customer data centers ...

• ### Control AWS traffic with routing policies

AWS offers four routing policies in Amazon Route 53 Traffic Flow. While each option balances heavy traffic loads, admins must ...

• ### Untangling hybrid cloud network confusion

Amazon VPC is an essential piece of the hybrid cloud puzzle, enabling enterprises to control public cloud configurations while ...

## SearchContentManagement

• ### Four components of a successful SharePoint governance plan

Many SharePoint implementations fail because of a lack of governance. Don't be a statistic and enable a smart Sharepoint ...

• ### Microsoft PowerApps: The next big thing in application development

PowerApps enables business users to close the gap and develop their own applications -- without needing to learn code.

• ### Three ways to import Power BI data into SharePoint

Whether you want to import raw data, use a URL or analyze data with Excel, here's the how-to on bringing Microsoft Power BI data ...

## SearchSalesforce

• ### Salesforce marketing automation makes one-to-one marketing a reality

Salesforce marketing automation is helping companies make communications less brute force, more personalized. But there are still...

• ### Politico Pro endorses Salesforce, SteelBrick for subscription sales

After implementing lead-to-cash software, subscriptions for Politico Pro content jump from 1,500 to 2,200 during high-profile ...

• ### Salesforce rallies in retail with purchase of Demandware e-commerce

Now that Salesforce has purchased Demandware, it's positioned to compete with vendors like Oracle and even Amazon for market ...

Close