Week 3 — Statistics and Mathematical model
Data pre-processing is an essential part of data analytics. It is needed because data can be noisy, inconsistent, and incomplete.
Lesson learned we could not underestimate each data method: data entry, data transmission, or data collection. There will be a discrepancy in naming convention, or duplicated records, even incoming data can have contradictions in data itself.
Method to do data pre-processing:
- Data cleaning
- Dealing with missing values
- Solving missing data
- Data smoothing using:
4a. Binning — Method to replace each data with similarity called ‘bin reps’
4b. Clustering — Replace each data value with cluster
4c. Regression — Replace each data value with regressed data
Binning methods do not capture the semantics of interval data. Distance-based partitioning may give more meaningful discretization and consider:
__ Density (number of points in an interval) and “closeness” of points in an interval
Online lectures will present the theoretical aspects of data mining. Today’s workshops focus on hands-on experience in data analytics tools, and the understanding and interpretation of the results. We are working in abalone-small.csv
EW means each bin is the same size, e.g.0–9, 10–19, etc. ED means there are (approximately) the same number of data points in each bin.

4 bins = 6–7 one each bin
(max-min)/4
= 36–12 / 4 bins
Height EW=4
Height ED=4
Bins have an equal width
1. each bin value is replaced with either:
= the bin name (discretization)
= the bin mean or the bin boundary (smoothing)
Q = How to make the choice? Equi-depth
AVERAGE OF EACH BIN is this one closer to the 12 or 17
Data integration might require a combination of data from multiple sources into a coherent common data store. Challenges would be:
- Schema integration_ex=C_Number = Cust_ID = Cust#
- Semantic heterogeneity
- Data value conflicts (different representation or scales, etc)
- Synchronization (especially important in sequence mining e.g web usage)
- Meta data is often necessary for a successful data integration
Redundant attributes
- Redundant — if the attributes can be derived from other attributes.
ex: Body mass index = Mass in kg/ Mass in m2
Correlation Analysis help to identify redundancies

=1 independent
>1 Positive
<1 Negative
Normalisation is attribute normalisation requires taking values spanning a specific range and representing in another range
Usual ranges is -1 to +1 and 0 to 1
Issues: this might introduce distortions or biases into the data. So, you need to understand the properties and potential weaknesses of the methods. Depending on the data mining tool you use, normalising the attributes can be helpful or even required.
Min-Max Normalisation
- Positive means normalisation preserves all relationship of data values exactly
- Negative means if a future input case falls outside the original data range, an “in out of bounds” error will occur
Normalisation dealing with out-of-range values
- Ignore that the range has been exceeded
- Ignore the out-of-range instances
z-score
Softmax & Signoid — it transforms the input data nonlinearly into the range [-1,1] using a signoid function. It calculates the mean and SD of the input data.
Aggregation