Week 3 — Statistics and Mathematical model

Novia Pratiwi - est.2021
3 min readApr 13, 2020

--

Data pre-processing is an essential part of data analytics. It is needed because data can be noisy, inconsistent, and incomplete.

Lesson learned we could not underestimate each data method: data entry, data transmission, or data collection. There will be a discrepancy in naming convention, or duplicated records, even incoming data can have contradictions in data itself.

Method to do data pre-processing:

  1. Data cleaning
  2. Dealing with missing values
  3. Solving missing data
  4. Data smoothing using:
    4a. Binning — Method to replace each data with similarity called ‘bin reps’
    4b. Clustering — Replace each data value with cluster
    4c. Regression — Replace each data value with regressed data

Binning methods do not capture the semantics of interval data. Distance-based partitioning may give more meaningful discretization and consider:

__ Density (number of points in an interval) and “closeness” of points in an interval

Online lectures will present the theoretical aspects of data mining. Today’s workshops focus on hands-on experience in data analytics tools, and the understanding and interpretation of the results. We are working in abalone-small.csv

EW means each bin is the same size, e.g.0–9, 10–19, etc. ED means there are (approximately) the same number of data points in each bin.

4 bins = 6–7 one each bin
(max-min)/4
= 36–12 / 4 bins

Height EW=4
Height ED=4

Bins have an equal width
1. each bin value is replaced with either:

= the bin name (discretization)

= the bin mean or the bin boundary (smoothing)

Q = How to make the choice? Equi-depth

AVERAGE OF EACH BIN is this one closer to the 12 or 17

Data integration might require a combination of data from multiple sources into a coherent common data store. Challenges would be:

  • Schema integration_ex=C_Number = Cust_ID = Cust#
  • Semantic heterogeneity
  • Data value conflicts (different representation or scales, etc)
  • Synchronization (especially important in sequence mining e.g web usage)
  • Meta data is often necessary for a successful data integration

Redundant attributes

  • Redundant — if the attributes can be derived from other attributes.

ex: Body mass index = Mass in kg/ Mass in m2

Correlation Analysis help to identify redundancies

Regression line chart

=1 independent

>1 Positive

<1 Negative

Normalisation is attribute normalisation requires taking values spanning a specific range and representing in another range

Usual ranges is -1 to +1 and 0 to 1

Issues: this might introduce distortions or biases into the data. So, you need to understand the properties and potential weaknesses of the methods. Depending on the data mining tool you use, normalising the attributes can be helpful or even required.

Min-Max Normalisation

  • Positive means normalisation preserves all relationship of data values exactly
  • Negative means if a future input case falls outside the original data range, an “in out of bounds” error will occur

Normalisation dealing with out-of-range values

  • Ignore that the range has been exceeded
  • Ignore the out-of-range instances

z-score

Softmax & Signoid — it transforms the input data nonlinearly into the range [-1,1] using a signoid function. It calculates the mean and SD of the input data.

Aggregation

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Novia Pratiwi - est.2021
Novia Pratiwi - est.2021

Written by Novia Pratiwi - est.2021

Curiosity to Data Analytics & Career Journey | Educate and inform myself and others about #LEARNINGTOLEARN and technology automation

No responses yet

Write a response