What is extracting data mining and create a machine learning model?
Predictive Analytics and Data Mining: Understanding the technical ins and outs of predictive analytics and data mining is important. However, these projects often consist of many data scientists working together for many weeks at a time. Successful projects need to be managed efficiently. And that means keeping control of the big picture objectives of individual projects.
Adult Income Census
This report will illustrate the effectiveness of several methods of classifiers to model out from testing out a number of 10,000 adult income censuses. We begin with obtaining and importing dataset to KNIME and Microsoft Excel, removed some missing data, partitioned the following training and test data set. Previously in our data exploration, our understanding will provide an overview of its uses and general information on getting started. This process will help to determine the most optimal classifier, also presented issues we came across, and how you solved the data mining using preprocessed and cleaned the data. Data classification allows to assign value to predict adult’s income and understand the visualization of area under curve to how much sensitive the variable to the classified predictors. Using Naive Bayes Algorithm theorem taught, we will start predicting whether our assumptions may or may not turn out to be true.
Data mining problem
After data has been processed and we are testing few of the classification methods to determine the most optimal classification methods. In KNIME, models are being learned and predicted, visualize by the ROC curve and Scorer matrix. If we visualize from our statistics nodes from our training model, we can summarized 76% of the adults earning less than or equal to 50k salary per annum, whereas the rest are earning more. This show the training was imbalanced data set. Also 2/3 of the gender for the recorded adult are male.
With many new rows of data compared to the data exploration, according to the dataset I interpret 13 attributes should not be removed, each one of attributes are factors of the adult salary range. We removed final weight as we could not understand whether it provide determinant to income level. The main procedure to analyse the problem is to make hypothesis and prove whether or not we can draw our assumption into a real insight. Whether or not age determine higher earning incomers, and if they studied longer, and lead to higher paying occupation, secondly, would marital status be decisive towards income level. After we explore the dataset, partition it into training and testing dataset, testing out few parameters, and then we could assume the outlier values such as older age group (e.g. retiree above 80+) need to be removed or just by changed the value, and we can bring it within a range.
In supervised learning model, implementation on determine best classifier take machine learning task to infera function from labeled called training data. This provide further process after overview the data exploration, data partitioning as an input to 70% fed to trained the data to the model’s learner and 30% of the rest will be fed as a testing data to the model’s predictor. The binarisation which is done in Excel will be pre-processed to excel reader in KNIME, and we can test out 4 classification methods which is suitable to predict the values in the remaining testing dataset.
With classification, there are need to explore which category an object belongs to.
categorical columns are: work class, education, marital_status, occupation, relationship, race, gender,
Note that the dataset is made up of categorical and continuous features. It also contains missing values. The
native_country. The continuous columns are: age, education_num, capital_gain, capital_loss, hours_per_week.
For majority data set in training covered by major continuous variable such as age range and working hours per week. To avoid over-fitting or force fitting that the model is too good to be true in real life, predicted class attribute should be visualize to the scorer node. Scorer can show us confusion matrix, accuracy statistics which display error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. ROC curve. After several classification methods showed chosen method with best area under curve and accuracy statics (e.g. F-measure, recall, and precision) respectively, then test set (another 10,000 rows) can be used for testing or evaluating the performance of a trained machine learning model. This output after cross-validation is to implemented to data set to test the model in the training phase (i.e. validation data set) and reduce error like overfitting. In simple terms, after learner and predictor nodes, this function nodes helps to transform the inputs into outputs in training set to fit the parameters i.e. salary and predicted salary and test output set is to assess the performance of the model.
Data mining procedure
Below are some of the following processes to extract insights from the data were carried out to predict and conduct the data mining:
In the first step every data scientists need to do is to understand the business problem, explore the data and becomes familiar with the analytics project. Preparing the data for modelling. Data cleaning plays a vital role in the exploration due to following reasons, raw input data comes from multiple sources, transforming it into a format that data analysts or data scientists can work with helps to increase the accuracy of the model in the classification technique predicted models.
In the second step, data scientist will detect the outliers, treating few missing values, change the variable (if necessary).
Decide Classification model evaluation
This is not the last step, instead more like a cycle to go back to our first step. After data modeling we can start running tested model, and analyze the outcome and tweak the classification method. This is an iterative step until the best possible outcome is achieved.
Validate the model using a new dataset
The chosen classifiers with best accuracy and are evaluated by taking several metrics into consideration like the f-measure, accuracy and area under the ROC curve. Based on that the best classifier is determined. Data scientist will start implementing the model and track the prediction result over the analysis of the performance of the training model.
1. Random forest
Random forest is a common clustering techniques in machine learning techniques on regression and classification tasks. K-means refer to pre-determined the number of clusters, we choose minimum of 4 number of clusters, This randomly generate k random points as the initial cluster centers. Then assign each point to the nearest cluster center. The ‘forest structure’ chooses the classification having the most votes. The random forest learner node and all the free parameters are available when training a dataset, used for both classification and regression settings.
Steps to build a random forest model:
- Select a target values and we chose ‘salary’ features from a total of ‘k’ features where k << m
- Among the ‘k’ features, calculate the learner node using the best split criteria using information gain ratio
- Split the node into several decision trees on bootstrapped training samples of data
- Repeat steps two and three until leaf nodes are finalized
- Build forest by repeating steps one to four for ’n’ times to create ’n’ number of trees
- Predictions and learner connected to visualize in both scorer and ROC Curve
Ensemble Learning is basically combining several individual models in the learner node and methods such as bagging and boosting together used to improve the predictability of the model. Methods that are commonly used are:
The idea of boosting methods is to combine few weak tree learners to form a stronger and improve its accuracy. The main ones are summed up in the workflow below:
3. Decision tree
Decision Tree Model in Data mining, this would our next classification method, mostly used to predict a categorical or a continuous target, whereby a ‘tree’ structure of rules over the input variables are used to classify or predict the cases according to the target variable (salary) and So, when tree is viewed with the highest variables chosen as the first split when we remove sub-nodes of a decision node, this process is called pruning or opposite process of splitting.
How to build a decision tree classification method:
1. From the training data set as, some partitioned data are put as an input.
2. Choose gini index in the quality measure, or we can also use ‘information gain rario’ split that maximizes the separation of the classes. A split is any test that divides the data into two sets. Another we will try using gain ratio.
3. Apply the split to the input data (divide step). 2 minimum of node would provide big tree left with 2 leaf nodes.
4. Re-apply steps one and two to the divided data.
5. Stop when you meet any stopping criteria.
6. This step is called pruning. Clean up the tree if you went too far doing splits.
How to determine the Growing the Decision Tree? How to evaluate the partitions? This questions can be showed using a ‘Contingency Table’. This node are used in KNIME as Crosstab (local). The lower the significance or p-value, the more likely that we reject this hypothesis, meaning that this income split is a discriminating factor.
Tree-based and ensemble can be classified in Classification and Regression Trees (CART)
We got higher areas under ROC and F-measure using information gain. based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attributes that return the highest information gain. From this imbalanced data and low entropy and high accuracy rate does not mean this is realistic in real world. Following these steps with a decision tree classifier. Using a naive Bayesian classifier, th highest association to salary is “marital-status” and “education-num” (years of education) to predict income level.
4. K-nearest neighbor (K-NN)
The K nearest neighbor algorithm can be used because it can compute the nearest neighbor and if it doesn’t have a value, it just computes the nearest neighbor based on all the other features.
When you’re dealing with K-means clustering or linear regression, you need to do that in your pre- processing, otherwise, they’ll crash. Decision trees also have the same problem, although there is some variance.
The k-nearest neighbor algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model.
Chosen classification model
The following table summarized the accuracy, F-measure, precision, recall, cohen’s kappa, and area under curve in a ROC curve will help to determine the accuracy for each targeted classifier.
ROC curve can only be used for binary classification, so you can only look at one species at a time as the positive class. This are factors to determine not only accuracy of the confusion matrix, but also, the ROC curve of the trained model:
- F-Measure (on positive — <=50k)
- Area Under Curve
- Cohen’s Kappa
- K-Nearest Neighbor
Basic measures derived from the confusion matrix:
1.Error Rate = (FP+FN)/(P+N)
3. Sensitivity(Recall or True positive rate) = TP/P
4. Specificity(True negative rate) = TN/N
5. Precision = (True positive) / (True Positive + False Positive)
6. Recall Rate = (True Positive) / (Total Positive + False Negative)
From the decision tree, and naïve-bayes association rules, observation proven that the association rules in these decision tree confirm the findings that marital status and education are good predictors of income level and determine adults whether or not they will earn more than 50k. The best classifier was observed to be Decision Tree Classification model as it showed higher Area under curve, F-Measure and Accuracy compared to the rest of the classification models.
Validation curve and in practical real world. Validation curves: plotting scores to evaluate models. Consideration to processing and computation time to execute the learner and predictor model also consideration. Support vector machine (SVM) consumed a lot of computation time was taken into account. When cross-validation and parameter loop optimization was adopted. The test dataset was pre-processed in the same way as the training dataset and using the decision tree predictor it was fed to the CSV writer filtering the columns “Predicted” and “ID”. The goal here is to train a binary classifier on the training dataset to predict the salary which has two possible values >50k and <=50k and evaluate the accuracy of the classifier with the test dataset. The CSV file was later uploaded in Kaggle.