Week 3 — Statistics and Mathematical model

Novia Pratiwi - est.2021

3 min readApr 13, 2020

Data pre-processing is an essential part of data analytics. It is needed because data can be noisy, inconsistent, and incomplete.

Lesson learned we could not underestimate each data method: data entry, data transmission, or data collection. There will be a discrepancy in naming convention, or duplicated records, even incoming data can have contradictions in data itself.

Method to do data pre-processing:

Data cleaning
Dealing with missing values
Solving missing data
Data smoothing using:
4a. Binning — Method to replace each data with similarity called ‘bin reps’
4b. Clustering — Replace each data value with cluster
4c. Regression — Replace each data value with regressed data

Binning methods do not capture the semantics of interval data. Distance-based partitioning may give more meaningful discretization and consider:

__ Density (number of points in an interval) and “closeness” of points in an interval

Online lectures will present the theoretical aspects of data mining. Today’s workshops focus on hands-on experience in data analytics tools, and the understanding and interpretation of the results. We are working in abalone-small.csv

EW means each bin is the same size, e.g.0–9, 10–19, etc. ED means there are (approximately) the same number of data points in each bin.

4 bins = 6–7 one each bin
(max-min)/4
= 36–12 / 4 bins

Height EW=4
Height ED=4

Bins have an equal width
1. each bin value is replaced with either:

= the bin name (discretization)

= the bin mean or the bin boundary (smoothing)

Q = How to make the choice? Equi-depth

AVERAGE OF EACH BIN is this one closer to the 12 or 17

Data integration might require a combination of data from multiple sources into a coherent common data store. Challenges would be:

Schema integration_ex=C_Number = Cust_ID = Cust#
Semantic heterogeneity
Data value conflicts (different representation or scales, etc)
Synchronization (especially important in sequence mining e.g web usage)
Meta data is often necessary for a successful data integration

Redundant attributes

Redundant — if the attributes can be derived from other attributes.

ex: Body mass index = Mass in kg/ Mass in m2

Correlation Analysis help to identify redundancies

=1 independent

>1 Positive

<1 Negative

Normalisation is attribute normalisation requires taking values spanning a specific range and representing in another range

Usual ranges is -1 to +1 and 0 to 1

Issues: this might introduce distortions or biases into the data. So, you need to understand the properties and potential weaknesses of the methods. Depending on the data mining tool you use, normalising the attributes can be helpful or even required.

Min-Max Normalisation

Positive means normalisation preserves all relationship of data values exactly
Negative means if a future input case falls outside the original data range, an “in out of bounds” error will occur

Normalisation dealing with out-of-range values

Ignore that the range has been exceeded
Ignore the out-of-range instances

z-score

Softmax & Signoid — it transforms the input data nonlinearly into the range [-1,1] using a signoid function. It calculates the mean and SD of the input data.

Aggregation

Week 3 — Statistics and Mathematical model

Binning methods do not capture the semantics of interval data. Distance-based partitioning may give more meaningful discretization and consider:

Data integration might require a combination of data from multiple sources into a coherent common data store. Challenges would be:

Correlation Analysis help to identify redundancies

Normalisation is attribute normalisation requires taking values spanning a specific range and representing in another range

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Novia Pratiwi - est.2021

No responses yet