Detect Fraud & Non Fraud Transaction with Machine Learning Tools and Neural Networks using Python
Introduction
This is the detail walk through for detecting fraud with the Keras using Tensor Flow in backend, with Imbalanced data. Purpose of this documentation is to design a personalized module. Let's start from requirements,
Requirements:
- Python 3
- Jupyter NoteBook
- Intermediate Knowledge of Python Libraries
- Hands-on experiance with Machine Learning Libraries
- Hands-on experience with Deep Learning Libraries Tensorflow and Keras
- Labeled Dataset
Guidelines to Read the document
1. I tried to make this article as simple as possible, most of the codes are not written down.
2. Codes are written in Italic scripts in Blue Color.
3. Steps are written down in Maroon Color
4. Output of the code are written dow under Output []
5. Figures are all generate from the algorithms.
4. Output of the code are written dow under Output []
5. Figures are all generate from the algorithms.
Goal of the Project
To Detect the Fraud Transaction with different classifiers and using Neural Network to best classifier.
#Install Tensorflow in Juypter Notebook
pip install --upgrade tensorflow
Let's start with Importing Libraries.
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
import matplotlib.patches as mpatches
import time
#Classifier Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import collections
#Install imblearn to deal with sampling problem
pip install imblearn
Importing More Libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
from collections import Counter
from sklearn.model_selection import KFold, StratifiedKFold
import warnings
warnings.filterwarnings("ignore")
Step [1]:
Let's upload the dataset from the files using pandas and check the head of the dataset.
df = pd.read_csv('/Users/sushiladhikari/Projects/fraud.csv')
df.head()
Output [1]
Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | ... | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | ... | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | ... | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | ... | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | ... | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
Step [2]:
It is better to know the datasets before moving forward, so lets describe the datasets.
Output [2]
Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 284807.000000 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | ... | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 284807.000000 | 284807.000000 |
mean | 94813.859575 | 3.919560e-15 | 5.688174e-16 | -8.769071e-15 | 2.782312e-15 | -1.552563e-15 | 2.010663e-15 | -1.694249e-15 | -1.927028e-16 | -3.137024e-15 | ... | 1.537294e-16 | 7.959909e-16 | 5.367590e-16 | 4.458112e-15 | 1.453003e-15 | 1.699104e-15 | -3.660161e-16 | -1.206049e-16 | 88.349619 | 0.001727 |
std | 47488.145955 | 1.958696e+00 | 1.651309e+00 | 1.516255e+00 | 1.415869e+00 | 1.380247e+00 | 1.332271e+00 | 1.237094e+00 | 1.194353e+00 | 1.098632e+00 | ... | 7.345240e-01 | 7.257016e-01 | 6.244603e-01 | 6.056471e-01 | 5.212781e-01 | 4.822270e-01 | 4.036325e-01 | 3.300833e-01 | 250.120109 | 0.041527 |
min | 0.000000 | -5.640751e+01 | -7.271573e+01 | -4.832559e+01 | -5.683171e+00 | -1.137433e+02 | -2.616051e+01 | -4.355724e+01 | -7.321672e+01 | -1.343407e+01 | ... | -3.483038e+01 | -1.093314e+01 | -4.480774e+01 | -2.836627e+00 | -1.029540e+01 | -2.604551e+00 | -2.256568e+01 | -1.543008e+01 | 0.000000 | 0.000000 |
25% | 54201.500000 | -9.203734e-01 | -5.985499e-01 | -8.903648e-01 | -8.486401e-01 | -6.915971e-01 | -7.682956e-01 | -5.540759e-01 | -2.086297e-01 | -6.430976e-01 | ... | -2.283949e-01 | -5.423504e-01 | -1.618463e-01 | -3.545861e-01 | -3.171451e-01 | -3.269839e-01 | -7.083953e-02 | -5.295979e-02 | 5.600000 | 0.000000 |
50% | 84692.000000 | 1.810880e-02 | 6.548556e-02 | 1.798463e-01 | -1.984653e-02 | -5.433583e-02 | -2.741871e-01 | 4.010308e-02 | 2.235804e-02 | -5.142873e-02 | ... | -2.945017e-02 | 6.781943e-03 | -1.119293e-02 | 4.097606e-02 | 1.659350e-02 | -5.213911e-02 | 1.342146e-03 | 1.124383e-02 | 22.000000 | 0.000000 |
75% | 139320.500000 | 1.315642e+00 | 8.037239e-01 | 1.027196e+00 | 7.433413e-01 | 6.119264e-01 | 3.985649e-01 | 5.704361e-01 | 3.273459e-01 | 5.971390e-01 | ... | 1.863772e-01 | 5.285536e-01 | 1.476421e-01 | 4.395266e-01 | 3.507156e-01 | 2.409522e-01 | 9.104512e-02 | 7.827995e-02 | 77.165000 | 0.000000 |
max | 172792.000000 | 2.454930e+00 | 2.205773e+01 | 9.382558e+00 | 1.687534e+01 | 3.480167e+01 | 7.330163e+01 | 1.205895e+02 | 2.000721e+01 | 1.559499e+01 | ... | 2.720284e+01 | 1.050309e+01 | 2.252841e+01 | 4.584549e+00 | 7.519589e+00 | 3.517346e+00 | 3.161220e+01 | 3.384781e+01 | 25691.160000 | 1.000000 |
Step [3]:
Now, we have to check for null values in datasets.
Output [3]
0
Step [4 ]:
The output here is zero, good for us, Here is the columns and dtype.
Output [4]
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class'], dtype='object')
Step [5 ]:
Here, the classes are heavily skewed so it needs to be fixed, if we continoues to use this dataset as it is it will cause lot of errors and can cause overfitting problem. I will fix this later on.
Output [5]
No Frauds 99.83 % of the dataset Frauds 0.17 % of the dataset
Step [6 ]:
Here is the visualization of the datasets, we can see that the dataset is not balanced.
Output [6]
Output [6]
Text(0.5, 1.0, 'Class Distributions \n (0: No Fraud || 1: Fraud)')
Step [7 ]:
Now, let's visualize the distribution of 'Amount' and 'Time' which is not scaled.
Output [7]
Step [8 ]:
As we can see from the above figures, 'Amount' and 'Time' are not scaled so we have to scale that for scaling we use following code.
from sklearn.preprocessing import StandardScaler, RobustScaler
#StandardScaler and RobustScaler both are less prone to outliers, which are important features for this project. Now you can see both 'Time' and 'Amount' is scaled and replaced by scaled values.
Output [8]
No Frauds 99.83 % of the dataset Frauds 0.17 % of the dataset Train: [ 30473 30496 31002 ... 284804 284805 284806] Test: [ 0 1 2 ... 57017 57018 57019] Train: [ 0 1 2 ... 284804 284805 284806] Test: [ 30473 30496 31002 ... 113964 113965 113966] Train: [ 0 1 2 ... 284804 284805 284806] Test: [ 81609 82400 83053 ... 170946 170947 170948] Train: [ 0 1 2 ... 284804 284805 284806] Test: [150654 150660 150661 ... 227866 227867 227868] Train: [ 0 1 2 ... 227866 227867 227868] Test: [212516 212644 213092 ... 284804 284805 284806] ---------------------------------------------------------------------------------------------------- Label Distributions: [0.99827076 0.00172924] [0.99827952 0.00172048]
Step [9 ]:
The next thing we need to do is to take sample and make equally distributed dataset, since the dataset we have are heavily skewed.
Output [9]
scaled_amount | scaled_time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | ... | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
106162 | 5.791798 | -0.174368 | -1.082758 | -0.508941 | 1.445683 | 1.971222 | -1.202002 | 0.523035 | 1.565089 | 0.032811 | ... | 0.770887 | 0.566066 | 0.909191 | 1.098293 | 0.340178 | -0.917628 | -0.260358 | 0.099337 | 0.285882 | 0 |
77099 | -0.237546 | -0.326660 | -0.075483 | 1.812355 | -2.566981 | 4.127549 | -1.628532 | -0.805895 | -3.390135 | 1.019353 | ... | 0.338598 | 0.794372 | 0.270471 | -0.143624 | 0.013566 | 0.634203 | 0.213693 | 0.773625 | 0.387434 | 1 |
93398 | -0.182771 | -0.238454 | -0.296012 | 0.968025 | 1.460175 | -0.082492 | 0.188890 | -0.621817 | 0.812876 | -0.131799 | ... | 0.189499 | -0.264155 | -0.573313 | -0.051576 | 0.270170 | -0.224096 | 0.026865 | 0.074772 | -0.135223 | 0 |
143731 | 3.056941 | 0.010385 | -2.207631 | 3.259076 | -5.436365 | 3.684737 | -3.066401 | -0.671323 | -3.696178 | 1.822272 | ... | 0.808336 | 0.920899 | 0.037675 | 0.026754 | -0.791489 | 0.176493 | -0.136312 | 1.087585 | 0.373834 | 1 |
96994 | -0.202194 | -0.219164 | 0.286302 | 1.399345 | -1.682503 | 3.864377 | -1.185373 | -0.341732 | -2.539380 | 0.768378 | ... | 0.270360 | 0.352456 | -0.243678 | -0.194079 | -0.172201 | 0.742237 | 0.127790 | 0.569731 | 0.291206 | 1 |
5 rows × 31 columns
Step [10 ]:
let's see how the sample looks after fixing the skewed figures.
Output [10]
Distribution of the Classes in the subsample dataset 1 0.5 0 0.5 Name: Class, dtype: float64
Step [11 ]:
Here we can see in Heatmap Subsample matrix is more correlated than the imbalanced matrix which shows if we have used the imbalanced data than there would be high chances of overfitting.
Output [11]
Setp [12 ]:
From the Heatmap above we can see that V10, V12, V14 & V17 are negatively correlated with the class, let's do box plotting and visualize the correlation. Here, we can see that more frauds exist in those indexes (Gree = No fraud, Red = fraud). Note: lower the feature values more likely it will be a fraud transaction.
Output [12]
Step [13 ]:
In the same way from the Heatmap we can figure out, columns V2, V4, V11, & V19 are positively correlated with the class, let's do the box plotting and visualize the correlation and outliers. Note: higher the features values the probability of fraud transaction increases.
Output [13]
Step [14 ]:
Further dive deeper to see how fraudulent transactions are distributed, I am using norm from scipy.stats and let's visualize this,
Output [14]
Step [15 ]:
Now, we are removing the outliers and multiplying the difference of interquartile range (q75-q25) with 1.5 times. We need to be very careful to choose the range of outliers if we remove a lot of outliers than that may cause features loss, and on contrary to that if we choose large outliers than that may result in overfitting problems.
Here, we can see the range of outliers for different columns.
Output [15]
Quartile 25: -9.692722964972385 | Quartile 75: -4.282820849486866 iqr: 5.409902115485519 Cut Off: 8.114853173228278 V14 Lower: -17.807576138200663 V14 Upper: 3.8320323237414122 Feature V14 Outliers for Fraud Cases: 4 V10 outliers:[-19.2143254902614, -18.8220867423816, -18.4937733551053, -18.049997689859396] -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- V12 Lower: -17.3430371579634 V12 Upper: 5.776973384895937 V12 outliers: [-18.683714633344298, -18.553697009645802, -18.047596570821604, -18.4311310279993] Feature V12 Outliers for Fraud Cases: 4 Number of Instances after outliers removal: 976 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- V10 Lower: -14.89885463232024 V10 Upper: 4.920334958342141 V10 outliers: [-15.1237521803455, -15.346098846877501, -18.9132433348732, -22.1870885620007, -14.9246547735487, -16.6496281595399, -15.563791338730098, -22.1870885620007, -14.9246547735487, -16.2556117491401, -20.949191554361104, -19.836148851696, -18.2711681738888, -24.5882624372475, -15.563791338730098, -22.1870885620007, -22.1870885620007, -15.2399619587112, -17.141513641289198, -23.2282548357516, -24.403184969972802, -15.2318333653018, -16.3035376590131, -16.7460441053944, -15.124162814494698, -15.2399619587112, -16.6011969664137] Feature V10 Outliers for Fraud Cases: 27 Number of Instances after outliers removal: 945
Step [16 ]:
Now let's do Box plotting after removing outliers, we can observe that outliers are reduced and only extreme remains in the Subsample Dataset.
Output [16]
Step [17 ]:
We are using t-SNE, PCA and TruncatedSVD to reduce dimension of high dimensional datasets.
t-SNE, helps to find the cluster of different featured data in two dimension with out removing alot of information.
PCA, is defined as an orthogonal linear transformation that transforms the data to a new coordinate system.
TruncatedSVD, this transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD)
And executed.
Output [17]
T-SNE took 5.3 s PCA took 0.059 s Truncated SVD took 0.0041 s
Step [18 ]:
Here is the Subsample Visualization for t-SNE, PCA and Truncated SVD
Output [18]
Step [19 ]:
The next is to use classifiers, we are using sklearn to split data and we will train our dataset using four classifiers to see which of the classifiers gives higher accuracy on fraud detection.
Output [19]
Classifiers: LogisticRegression Has a training score of 95.0 % accuracy score Classifiers: KNeighborsClassifier Has a training score of 93.0 % accuracy score Classifiers: SVC Has a training score of 94.0 % accuracy score Classifiers: DecisionTreeClassifier Has a training score of 92.0 % accuracy score
Step [20 ]:
let's see if we can further improve our score, for that, we need to identify what is the best parameters to use for classifiers, we are using GridsearchCV for that
We can see after running the code, with parameters from the grid search accuracy of the classifiers has been improved. Point to be noted here is, even the accuracy is improved the chance of overfitting remains high.
Output [20]
Logistic Regression Cross Validation Score: 94.72% Knears Neighbors Cross Validation Score 93.39% Support Vector Classifier Cross Validation Score 94.71% DecisionTree Classifier Cross Validation Score 93.65%Step [21 ]:
Before moving forward we need to implement the NearMiss Technique for cross-validating undersample. Undersampling refers to a group of techniques designed to balance the class distribution for a classification dataset that has a skewed class distribution.
Cross-Validation, please go to the link to know more
Output [21]
Train: [ 53548 53844 53924 ... 284804 284805 284806] Test: [ 0 1 2 ... 56963 56964 56965]
Train: [ 0 1 2 ... 284804 284805 284806] Test: [ 53548 53844 53924 ... 115164 117282 117955] Train: [ 0 1 2 ... 284804 284805 284806] Test: [113920 113921 113922 ... 179536 180189 180353] Train: [ 0 1 2 ... 284804 284805 284806] Test: [170870 170871 170872 ... 238445 239634 239935] Train: [ 0 1 2 ... 238445 239634 239935] Test: [227826 227827 227828 ... 284804 284805 284806] NearMiss Label Distribution: Counter({0: 492, 1: 492})
Step [22 ]:
Now we shuffle the undersampling and see how the learning curves visualize, for this, we are using ShuffleSplit and Learning_curve, please see the sklearn documentation if anyone wants to know more about the libraries.
Output [22]
<module 'matplotlib.pyplot' from '/Users/sushiladhikari/opt/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
Step [23 ]:
Check the score has increased after using cross_val_predict.
Output [23]
Logistic Regression: 0.9804539457061846 KNears Neighbors: 0.9319923640549115 Support Vector Classifier: 0.9787274360629965 Decision Tree Classifier: 0.9342312119255496
Step [24]:
Check, how score is achieved after each iteration and we can visualize the train performance through ROC curve
Output [24]
Step [25]:
Let's see what is the scores and what it should have been.
Output [25]
--------------------------------------------------------------------------------------------------------------------------------------- Overfitting: Recall Score: 0.93 Precision Score: 0.83 F1 Score: 0.88 Accuracy Score: 0.88 --------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------- How it should be: Accuracy Score: 0.74 Precision Score: 0.00 Recall Score: 0.21 F1 Score: 0.00 ---------------------------------------------------------------------------------------------------------------------------------------
Step [26]:
Precision as the name says, how precise (how sure) is our model in detecting fraud transactions while recall is the amount of fraud cases our model is able to detect.
Output [26]
Average precision-recall score: 0.05
Step [27]:
Visualisation of Precision & Recall
Output [27]
Text(0.5, 1.0, 'UnderSampling Precision-Recall curve: \n Average Precision-Recall Score =0.05')
Step [28]:
Now we implement SMOTE technique to oversample the dataset.
Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling.
RandomizedSearchCV to get the best parameters to fit the module.
Output [28]
Length of X (train): 227846 | Length of y (train): 227846 Length of X (test): 56961 | Length of y (test): 56961 --------------------------------------------------------------------------------------------------------------------------------------- accuracy: 0.9705855710313749 precision: 0.0658000139635098 recall: 0.9137617656604998 f1: 0.1215697736609885 ---------------------------------------------------------------------------------------------------------------------------------------
precision recall f1-score support No Fraud 1.00 0.99 0.99 56863 Fraud 0.11 0.86 0.20 98 accuracy 0.99 56961 macro avg 0.56 0.92 0.60 56961 weighted avg 1.00 0.99 0.99 56961
Average precision-recall score: 0.75
Step [29]:
Visualisations of Precision & Recall After using SMOTE
Output [29]
Text(0.5, 1.0, 'OverSampling Precision-Recall curve: \n Average Precision-Recall Score =0.75')
Fitting oversample data took :3.6500542163848877 sec
Step [30]:
The Heat-map shows different classifiers score in confusion matrix.
Output [30]
Logistic Regression: precision recall f1-score support 0 0.89 0.97 0.93 90 1 0.97 0.89 0.93 99 accuracy 0.93 189 macro avg 0.93 0.93 0.93 189 weighted avg 0.93 0.93 0.93 189 KNears Neighbors: precision recall f1-score support 0 0.86 0.99 0.92 90 1 0.99 0.85 0.91 99 accuracy 0.92 189 macro avg 0.92 0.92 0.92 189 weighted avg 0.93 0.92 0.92 189 Support Vector Classifier: precision recall f1-score support 0 0.90 0.98 0.94 90 1 0.98 0.90 0.94 99 accuracy 0.94 189 macro avg 0.94 0.94 0.94 189 weighted avg 0.94 0.94 0.94 189 Support Vector Classifier: precision recall f1-score support 0 0.87 0.94 0.90 90 1 0.95 0.87 0.91 99 accuracy 0.90 189 macro avg 0.91 0.91 0.90 189 weighted avg 0.91 0.90 0.90 189
Step [31]:
let's check the final accuracy score for Undersampling and Oversampling, we can see the accuracy is too high that can be misleading. So Now we introduce Neural Network to test Oversampling and Undersampling score.
Output [31]
Technique | Score | |
---|---|---|
0 | Random UnderSampling | 0.925926 |
1 | Oversampling (SMOTE) | 0.988080 |
Step [32]:
Lets Install keras in the Notebook.
pip install Keras
Output [32]
Collecting Keras Using cached https://files.pythonhosted.org/packages/ad/fd/6bfe87920d7f4fd475acd28500a42482b6b84479832bdc0fe9e589a60ceb/Keras-2.3.1-py2.py3-none-any.whl Requirement already satisfied: keras-preprocessing>=1.0.5 in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (1.1.0) Requirement already satisfied: numpy>=1.9.1 in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (1.17.2) Requirement already satisfied: six>=1.9.0 in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (1.12.0) Requirement already satisfied: scipy>=0.14 in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (1.4.1) Requirement already satisfied: h5py in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (2.9.0) Requirement already satisfied: pyyaml in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (5.1.2) Requirement already satisfied: keras-applications>=1.0.6 in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (1.0.8) Installing collected packages: Keras Successfully installed Keras-2.3.1 Note: you may need to restart the kernel to use updated packages.
Step [33]:
Now, we have to import all the necessary libraries, here i have used Sequential model, Adam optimizer and relu as activation function for 2 desnse layer and softmax for last one.
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy
Output [33]
Using TensorFlow backend.
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_1 (Dense) (None, 30) 930 _________________________________________________________________ dense_2 (Dense) (None, 32) 992 _________________________________________________________________ dense_3 (Dense) (None, 2) 66 ================================================================= Total params: 1,988 Trainable params: 1,988 Non-trainable params: 0
Step [34]:
We can see in each epoch, for batch of 25 the accuracy is increasing for Undesampling, we only took 20 epochs.
undersample_model.fit(X_train, y_train, validation_split=0.2, batch_size=25, epochs=20, shuffle=True, verbose=2)
Output [34]
Train on 604 samples, validate on 152 samples Epoch 1/20 - 1s - loss: 0.4678 - accuracy: 0.7964 - val_loss: 0.3206 - val_accuracy: 0.9079 Epoch 2/20 - 0s - loss: 0.3046 - accuracy: 0.8974 - val_loss: 0.2382 - val_accuracy: 0.9211.
.
.
Epoch 20/20 - 0s - loss: 0.0602 - accuracy: 0.9785 - val_loss: 0.1083 - val_accuracy: 0.9605
<keras.callbacks.callbacks.History at 0x147c7aa90>
Step [35]:
Let's plot confusion matrix for Random UnderSample
Output [35]
Confusion matrix, without normalization [[55244 1619] [ 11 87]] Confusion matrix, without normalization [[56863 0] [ 0 98]]
Step [36]:
Now let's do same steps as above for Oversample model.
Output [35]
Confusion matrix, without normalization [[56853 10] [ 31 67]] Confusion matrix, without normalization [[56863 0] [ 0 98]]
Conclusion:
Using SMOTE on imbalanced data set rectified the issue arising from imbalanced dataset i.e, more Nonfraud transaction than fraud transaction. We achieved an accuracy level of 99.98% on the OverSample module. This is how we can use an algorithm to identify the fraud and non-fraud transactions.
Comments
Post a Comment