# Noisy data

Real-world data tends to be incomplete, noisy, and inconsistent. Data cleaning routines attempt to fill in missing...

values, smooth out noise while identifying outliers, and correct inconsistencies in the data. This tip, from Jiawei Han and Micheline Kamber's book "Data Mining Concepts and Techniques," concentrates on how to smooth out noise:

First of all, what is noise? Noise is random error or variance in a measured variable. Given a numeric such as, say price, how can we "smooth" out the data to remove the noise? Let's look at the following data smoothing techniques:

1. Binning: Binning methods smooth a sorted data value by consulting its "neighborhood," this is, the values around it. The sorted values are distributed into a number of "buckets," or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. Here are several common binning techniques:

Sorted data for price (in dollars):
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equidepth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34

Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29

Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34

In these examples, the data for price are first sorted and then partitioned into equidepth bins of depth 3 (i.e., each bin contains three values). In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the width, the greater the effect of the smoothing. Alternatively, bins may be equiwidth, where the interval range of values in each bin is constant. Binning is also used as a discretization technique.

2. Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or "clusters." Intuitively, values that fall outside of the set of clusters may be considered outliers.

3. Combined computer and human inspection: Outliers may be identified through a combination of computer and human inspection. In one application, for example, an information-theoretic measure was used to help identify outlier patterns in a handwritten character database for classification. The measure's value reflected the "surprise" content of the predicted character label with respect to the known label. Outlier patterns may be informative (e.g., identifying useful data exceptions, such as different versions of the characters "0" or "7") or "garbage" (e.g., mislabeled characters). Patterns whose surprise content is above a threshold are output to a list. A human can then sort through the patterns in the list to identify the actual garbage ones. This is much faster than having to manually search though the entire database. The garbage patterns can then be excluded from use in subsequent data mining.

4. Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear regression involves finding the "best" line to fit two variables, so that one variable can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two variables are involved and the data are fit to a multidimensional surface. Using regression to find a mathematical equation to fit the data helps smooth out the noise.

Many methods for data smoothing are also methods for data reduction involving discretization. For example, the binning techniques described above reduce the number of distinct values per attribute. This acts as a form of data reduction for logic-based data mining methods, such as decision tree induction, which repeatedly make values comparisons on sorted data. Concept hierarchies are a form of data discretization that can also be sued for data smoothing. A concept hierarchy for price, for example, may map real price values into inexpensive, moderately_priced, and expensive, thereby reducing the number of data values to be handled by the mining process. Some methods of classification, such as neural networks, have built-in data smoothing mechanisms.

This was first published in February 2001

## Content

Find more PRO+ content and other member only offers, here.

Oldest

• ### SQL engines boost Hadoop query processing for big data users

Organizations with big data environments are turning to SQL-on-Hadoop software to speed up analytical queries and data ...

• ### Reality check needed to assess AI applications

When assessing the reality behind today's AI technology, businesses need to think about how it can perform in specific tasks ...

When it comes to building a data science team, businesses should expect to find workers from a variety of backgrounds rather than...

## SearchDataManagement

• ### Four factors for comparing the top Hadoop distributions

By examining the key characteristics presented here -- along with the top Hadoop distributions -- you can determine which ...

• ### Big data challenges traditional data modeling techniques

Surging big data is changing data modeling techniques, including schema creation. The word from Enterprise Data World 2016: Data...

• ### EBay helps drive new style of data engineering

Open source data engineering has become a way of life at e-commerce leader eBay, says the company's Debashis Saha. Kylin is one ...

## SearchSAP

• ### Integrate cloud to on-premises with HANA Cloud Integration

SAP offers a raft of prebuilt integrations that handle many of the key business processes between major cloud and on-premises ...

• ### Courtroom lessons from a failed SAP ERP implementation

A consulting firm's expert witness explains what SAP and a global manufacturer did -- and didn't do -- that led to a major SAP ...

• ### Building data visualization with SAP Fiori tools

Some BI developers will get by fine with features such as the Fiori Launchpad and Overview pages. Here's what's built into Fiori ...

## SearchOracle

• ### OAUG head Dues talks tech plans, Oracle cloud applications

OAUG president Patricia Dues talks about the technology that has the OAUG's attention and why it's important to learn about the ...

• ### ECCU shares ups, downs of Oracle Fusion Financials migration

Moving to Oracle Fusion Financials has been a mixed blessing for the Evangelical Christian Credit Union. It saved money, but had ...

• ### Oracle Enterprise Manager 13c gives DBAs new cloud tools

The latest version of Oracle Enterprise Manager is designed to make life easier for DBAs working in the cloud. Oracle Enterprise ...

## SearchAWS

• ### Amazon Inspector gives dev automated security assessment

Cloud vulnerabilities can quickly evolve into security threats; vigilance is a key in identifying weaknesses. Amazon Inspector ...

• ### AWS, partners' balancing act weighs on users, too

AWS partners are a critical part of the growing ecosystem, but the choice between third-party services and the waiting game for ...

• ### Words to go: AWS data storage

If you're confused about which data storage option is ideal for your enterprise, refer to our reference sheet on AWS tools and ...

## SearchContentManagement

• ### Document control practices in the age of HIPAA

The time has come to bring information governance stakeholders together to develop a practical plan for document management and ...

• ### E-signature application saves insurer time and money

Insurance company Unum has experienced a dramatic reduction in document-processing time throughout the company since adopting ...

• ### Slack kicks up dust in collaboration software tool market

Slack is taking aim at traditional communication tools like instant messenger and email. But its integration with other services ...

Close