Detect Fraud & Non Fraud Transaction with Machine Learning Tools and Neural Networks using Python



Introduction
This is the detail walk through for detecting fraud with the Keras using Tensor Flow in backend, with Imbalanced data. Purpose of this documentation is to design a personalized module. Let's start from requirements,

Requirements:
  1. Python 3
  2. Jupyter NoteBook
  3. Intermediate Knowledge of Python Libraries
  4. Hands-on experiance with Machine Learning Libraries
  5. Hands-on experience with Deep Learning Libraries Tensorflow and Keras
  6. Labeled Dataset

Guidelines to Read the document

   1. I tried to make this article as simple as possible, most of the codes are not written down.
   2. Codes are written in Italic scripts in Blue Color.
   3. Steps are written down in Maroon Color
   4. Output of the code are written dow under Output []
   5. Figures are all generate from the algorithms.


Goal of the Project
To Detect the Fraud Transaction with different classifiers and using Neural Network to best classifier.

#Install Tensorflow in Juypter Notebook
pip install --upgrade tensorflow

Let's start with Importing Libraries.

import numpy as np 
import pandas as pd 
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
import matplotlib.patches as mpatches
import time

#Classifier Libraries

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import collections


#Install imblearn to deal with sampling problem

pip install imblearn


Importing More Libraries 

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
from collections import Counter
from sklearn.model_selection import KFold, StratifiedKFold
import warnings
warnings.filterwarnings("ignore")


Step [1]:

Let's upload the dataset from the files using pandas and check the head of the dataset.

df = pd.read_csv('/Users/sushiladhikari/Projects/fraud.csv')
df.head()

Output [1]
TimeV1V2V3V4V5V6V7V8V9...V21V22V23V24V25V26V27V28AmountClass
00.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787...-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620
10.01.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425...-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690
21.0-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.514654...0.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660
31.0-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024...-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500
42.0-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.817739...-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990




Step [2]:

It is better to know the datasets before moving forward, so lets describe the datasets.

Output [2]
TimeV1V2V3V4V5V6V7V8V9...V21V22V23V24V25V26V27V28AmountClass
count284807.0000002.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+05...2.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+05284807.000000284807.000000
mean94813.8595753.919560e-155.688174e-16-8.769071e-152.782312e-15-1.552563e-152.010663e-15-1.694249e-15-1.927028e-16-3.137024e-15...1.537294e-167.959909e-165.367590e-164.458112e-151.453003e-151.699104e-15-3.660161e-16-1.206049e-1688.3496190.001727
std47488.1459551.958696e+001.651309e+001.516255e+001.415869e+001.380247e+001.332271e+001.237094e+001.194353e+001.098632e+00...7.345240e-017.257016e-016.244603e-016.056471e-015.212781e-014.822270e-014.036325e-013.300833e-01250.1201090.041527
min0.000000-5.640751e+01-7.271573e+01-4.832559e+01-5.683171e+00-1.137433e+02-2.616051e+01-4.355724e+01-7.321672e+01-1.343407e+01...-3.483038e+01-1.093314e+01-4.480774e+01-2.836627e+00-1.029540e+01-2.604551e+00-2.256568e+01-1.543008e+010.0000000.000000
25%54201.500000-9.203734e-01-5.985499e-01-8.903648e-01-8.486401e-01-6.915971e-01-7.682956e-01-5.540759e-01-2.086297e-01-6.430976e-01...-2.283949e-01-5.423504e-01-1.618463e-01-3.545861e-01-3.171451e-01-3.269839e-01-7.083953e-02-5.295979e-025.6000000.000000
50%84692.0000001.810880e-026.548556e-021.798463e-01-1.984653e-02-5.433583e-02-2.741871e-014.010308e-022.235804e-02-5.142873e-02...-2.945017e-026.781943e-03-1.119293e-024.097606e-021.659350e-02-5.213911e-021.342146e-031.124383e-0222.0000000.000000
75%139320.5000001.315642e+008.037239e-011.027196e+007.433413e-016.119264e-013.985649e-015.704361e-013.273459e-015.971390e-01...1.863772e-015.285536e-011.476421e-014.395266e-013.507156e-012.409522e-019.104512e-027.827995e-0277.1650000.000000
max172792.0000002.454930e+002.205773e+019.382558e+001.687534e+013.480167e+017.330163e+011.205895e+022.000721e+011.559499e+01...2.720284e+011.050309e+012.252841e+014.584549e+007.519589e+003.517346e+003.161220e+013.384781e+0125691.1600001.000000




Step [3]:

Now, we have to check for null values in datasets.

Output [3]
 
0




Step [4 ]:

The output here is zero, good for us, Here is the columns and dtype.

Output [4]

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')



Step [5 ]:

Here, the classes are heavily skewed so it needs to be fixed, if we continoues to use this dataset as it is it will cause lot of errors and can cause overfitting problem. I will fix this later on.

Output [5]

No Frauds 99.83 % of the dataset
Frauds 0.17 % of the dataset



Step [6 ]:

Here is the visualization of the datasets, we can see that the dataset is not balanced.

Output [6]


Text(0.5, 1.0, 'Class Distributions \n (0: No Fraud || 1: Fraud)')





Step [7 ]:

Now, let's visualize the distribution of 'Amount' and 'Time' which is not scaled.

Output [7]









Step [8 ]:

As we can see from the above figures, 'Amount' and 'Time' are not scaled so we have to scale that for scaling we use following code.
from sklearn.preprocessing import StandardScaler, RobustScaler
#StandardScaler and RobustScaler both are less prone to outliers, which are important features for this project. Now you can see both 'Time' and 'Amount' is scaled and replaced by scaled values.

Output [8]


No Frauds 99.83 % of the dataset
Frauds 0.17 % of the dataset
Train: [ 30473  30496  31002 ... 284804 284805 284806] Test: [    0     1     2 ... 57017 57018 57019]
Train: [     0      1      2 ... 284804 284805 284806] Test: [ 30473  30496  31002 ... 113964 113965 113966]
Train: [     0      1      2 ... 284804 284805 284806] Test: [ 81609  82400  83053 ... 170946 170947 170948]
Train: [     0      1      2 ... 284804 284805 284806] Test: [150654 150660 150661 ... 227866 227867 227868]
Train: [     0      1      2 ... 227866 227867 227868] Test: [212516 212644 213092 ... 284804 284805 284806]
----------------------------------------------------------------------------------------------------
Label Distributions: 

[0.99827076 0.00172924]
[0.99827952 0.00172048]




Step [9 ]:

The next thing we need to do is to take sample and make equally distributed dataset, since the dataset we have are heavily skewed.

Output [9]
scaled_amountscaled_timeV1V2V3V4V5V6V7V8...V20V21V22V23V24V25V26V27V28Class
1061625.791798-0.174368-1.082758-0.5089411.4456831.971222-1.2020020.5230351.5650890.032811...0.7708870.5660660.9091911.0982930.340178-0.917628-0.2603580.0993370.2858820
77099-0.237546-0.326660-0.0754831.812355-2.5669814.127549-1.628532-0.805895-3.3901351.019353...0.3385980.7943720.270471-0.1436240.0135660.6342030.2136930.7736250.3874341
93398-0.182771-0.238454-0.2960120.9680251.460175-0.0824920.188890-0.6218170.812876-0.131799...0.189499-0.264155-0.573313-0.0515760.270170-0.2240960.0268650.074772-0.1352230
1437313.0569410.010385-2.2076313.259076-5.4363653.684737-3.066401-0.671323-3.6961781.822272...0.8083360.9208990.0376750.026754-0.7914890.176493-0.1363121.0875850.3738341
96994-0.202194-0.2191640.2863021.399345-1.6825033.864377-1.185373-0.341732-2.5393800.768378...0.2703600.352456-0.243678-0.194079-0.1722010.7422370.1277900.5697310.2912061
5 rows × 31 columns




Step [10 ]:

let's see how the sample looks after fixing the skewed figures.

Output [10]


Distribution of the Classes in the subsample dataset
1    0.5
0    0.5
Name: Class, dtype: float64



Step [11 ]:


Here we can see in Heatmap Subsample matrix is more correlated than the imbalanced matrix which shows if we have used the imbalanced data than there would be high chances of overfitting.

Output [11]






Setp [12 ]:

From the Heatmap above we can see that V10, V12, V14 & V17 are negatively correlated with the class, let's do box plotting and visualize the correlation. Here, we can see that more frauds exist in those indexes (Gree = No fraud, Red = fraud). Note: lower the feature values more likely it will be a fraud transaction.

Output [12]





Step [13 ]:

In the same way from the Heatmap we can figure out, columns V2, V4, V11, & V19 are positively correlated with the class, let's do the box plotting and visualize the correlation and outliers. Note: higher the features values the probability of fraud transaction increases.

Output [13]





Step [14 ]:

Further dive deeper to see how fraudulent transactions are distributed, I am using norm from scipy.stats and let's visualize this,

Output [14]






Step [15 ]:

Now, we are removing the outliers and multiplying the difference of interquartile range (q75-q25) with 1.5 times. We need to be very careful to choose the range of outliers if we remove a lot of outliers than that may cause features loss, and on contrary to that if we choose large outliers than that may result in overfitting problems.

Here, we can see the range of outliers for different columns.

Output [15]


Quartile 25: -9.692722964972385 | Quartile 75: -4.282820849486866
iqr: 5.409902115485519
Cut Off: 8.114853173228278
V14 Lower: -17.807576138200663
V14 Upper: 3.8320323237414122
Feature V14 Outliers for Fraud Cases: 4
V10 outliers:[-19.2143254902614, -18.8220867423816, -18.4937733551053, -18.049997689859396]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
V12 Lower: -17.3430371579634
V12 Upper: 5.776973384895937
V12 outliers: [-18.683714633344298, -18.553697009645802, -18.047596570821604, -18.4311310279993]
Feature V12 Outliers for Fraud Cases: 4
Number of Instances after outliers removal: 976
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
V10 Lower: -14.89885463232024
V10 Upper: 4.920334958342141
V10 outliers: [-15.1237521803455, -15.346098846877501, -18.9132433348732, -22.1870885620007, -14.9246547735487, -16.6496281595399, -15.563791338730098, -22.1870885620007, -14.9246547735487, -16.2556117491401, -20.949191554361104, -19.836148851696, -18.2711681738888, -24.5882624372475, -15.563791338730098, -22.1870885620007, -22.1870885620007, -15.2399619587112, -17.141513641289198, -23.2282548357516, -24.403184969972802, -15.2318333653018, -16.3035376590131, -16.7460441053944, -15.124162814494698, -15.2399619587112, -16.6011969664137]
Feature V10 Outliers for Fraud Cases: 27
Number of Instances after outliers removal: 945



Step [16 ]:

Now let's do Box plotting after removing outliers, we can observe that outliers are reduced and only extreme remains in the Subsample Dataset.

Output [16]






Step [17 ]:

We are using t-SNE, PCA and TruncatedSVD to reduce dimension of high dimensional datasets.
t-SNE, helps to find the cluster of different featured data in two dimension with out removing alot of information.
PCA, is defined as an orthogonal linear transformation that transforms the data to a new coordinate system.
TruncatedSVD, this transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD)
And executed.


Output [17]


T-SNE took 5.3 s
PCA took 0.059 s
Truncated SVD took 0.0041 s



Step [18 ]:

Here is the Subsample Visualization for t-SNE, PCA and Truncated SVD

Output [18]




Step [19 ]:

The next is to use classifiers, we are using sklearn to split data and we will train our dataset using four classifiers to see which of the classifiers gives higher accuracy on fraud detection.

Output [19]


Classifiers: LogisticRegression Has a training score of 95.0 % accuracy score
Classifiers: KNeighborsClassifier Has a training score of 93.0 % accuracy score
Classifiers: SVC Has a training score of 94.0 % accuracy score
Classifiers: DecisionTreeClassifier Has a training score of 92.0 % accuracy score



Step [20 ]:

let's see if we can further improve our score, for that, we need to identify what is the best parameters to use for classifiers, we are using GridsearchCV for that

We can see after running the code, with parameters from the grid search accuracy of the classifiers has been improved. Point to be noted here is, even the accuracy is improved the chance of overfitting remains high.

Output [20]


Logistic Regression Cross Validation Score:  94.72%
Knears Neighbors Cross Validation Score 93.39%
Support Vector Classifier Cross Validation Score 94.71%
DecisionTree Classifier Cross Validation Score 93.65%



Step [21 ]:


Before moving forward we need to implement the NearMiss Technique for cross-validating undersample. Undersampling refers to a group of techniques designed to balance the class distribution for a classification dataset that has a skewed class distribution.

Cross-Validation, please go to the link to know more 

Output [21]

Train: [ 53548 53844 53924 ... 284804 284805 284806] Test: [ 0 1 2 ... 56963 56964 56965]
Train: [     0      1      2 ... 284804 284805 284806] Test: [ 53548  53844  53924 ... 115164 117282 117955]
Train: [     0      1      2 ... 284804 284805 284806] Test: [113920 113921 113922 ... 179536 180189 180353]
Train: [     0      1      2 ... 284804 284805 284806] Test: [170870 170871 170872 ... 238445 239634 239935]
Train: [     0      1      2 ... 238445 239634 239935] Test: [227826 227827 227828 ... 284804 284805 284806]
NearMiss Label Distribution: Counter({0: 492, 1: 492})



Step [22 ]:

Now we shuffle the undersampling and see how the learning curves visualize, for this, we are using ShuffleSplit and Learning_curve, please see the sklearn documentation if anyone wants to know more about the libraries.

Output [22]

<module 'matplotlib.pyplot' from '/Users/sushiladhikari/opt/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>





Step [23 ]:

Check the score has increased after using cross_val_predict.

Output [23]


Logistic Regression:  0.9804539457061846
KNears Neighbors:  0.9319923640549115
Support Vector Classifier:  0.9787274360629965
Decision Tree Classifier:  0.9342312119255496



Step [24]:

Check, how score is achieved after each iteration and we can visualize the train performance through ROC curve

Output [24]








Step [25]:

Let's see what is the scores and what it should have been.

Output [25]


---------------------------------------------------------------------------------------------------------------------------------------
Overfitting: 

Recall Score: 0.93
Precision Score: 0.83
F1 Score: 0.88
Accuracy Score: 0.88
---------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------
How it should be:

Accuracy Score: 0.74
Precision Score: 0.00
Recall Score: 0.21
F1 Score: 0.00
---------------------------------------------------------------------------------------------------------------------------------------



Step [26]:

Precision as the name says, how precise (how sure) is our model in detecting fraud transactions while recall is the amount of fraud cases our model is able to detect.

Output [26]


Average precision-recall score: 0.05



Step [27]:

Visualisation of Precision & Recall

Output [27]


Text(0.5, 1.0, 'UnderSampling Precision-Recall curve: \n Average Precision-Recall Score =0.05')




Step [28]:


Now we implement SMOTE technique to oversample the dataset.
Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling.
RandomizedSearchCV to get the best parameters to fit the module.

Output [28]


Length of X (train): 227846 | Length of y (train): 227846
Length of X (test): 56961 | Length of y (test): 56961
---------------------------------------------------------------------------------------------------------------------------------------

accuracy: 0.9705855710313749
precision: 0.0658000139635098
recall: 0.9137617656604998
f1: 0.1215697736609885
---------------------------------------------------------------------------------------------------------------------------------------


precision    recall  f1-score   support

    No Fraud       1.00      0.99      0.99     56863
       Fraud       0.11      0.86      0.20        98

    accuracy                           0.99     56961
   macro avg       0.56      0.92      0.60     56961
weighted avg       1.00      0.99      0.99     56961



Average precision-recall score: 0.75



Step [29]:

Visualisations of Precision & Recall After using SMOTE

Output [29]


Text(0.5, 1.0, 'OverSampling Precision-Recall curve: \n Average Precision-Recall Score =0.75')


Fitting oversample data took :3.6500542163848877 sec


Step [30]:

The Heat-map shows different classifiers score in confusion matrix.

Output [30]





Logistic Regression:
              precision    recall  f1-score   support

           0       0.89      0.97      0.93        90
           1       0.97      0.89      0.93        99

    accuracy                           0.93       189
   macro avg       0.93      0.93      0.93       189
weighted avg       0.93      0.93      0.93       189

KNears Neighbors:
              precision    recall  f1-score   support

           0       0.86      0.99      0.92        90
           1       0.99      0.85      0.91        99

    accuracy                           0.92       189
   macro avg       0.92      0.92      0.92       189
weighted avg       0.93      0.92      0.92       189

Support Vector Classifier:
              precision    recall  f1-score   support

           0       0.90      0.98      0.94        90
           1       0.98      0.90      0.94        99

    accuracy                           0.94       189
   macro avg       0.94      0.94      0.94       189
weighted avg       0.94      0.94      0.94       189

Support Vector Classifier:
              precision    recall  f1-score   support

           0       0.87      0.94      0.90        90
           1       0.95      0.87      0.91        99

    accuracy                           0.90       189
   macro avg       0.91      0.91      0.90       189
weighted avg       0.91      0.90      0.90       189



Step [31]:

let's check the final accuracy score for Undersampling and Oversampling, we can see the accuracy is too high that can be misleading. So Now we introduce Neural Network to test Oversampling and Undersampling score.

Output [31]
TechniqueScore
0Random UnderSampling0.925926
1Oversampling (SMOTE)0.988080




Step [32]:

Lets Install keras in the Notebook.

    pip install Keras


Output [32]


Collecting Keras
  Using cached https://files.pythonhosted.org/packages/ad/fd/6bfe87920d7f4fd475acd28500a42482b6b84479832bdc0fe9e589a60ceb/Keras-2.3.1-py2.py3-none-any.whl
Requirement already satisfied: keras-preprocessing>=1.0.5 in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (1.1.0)
Requirement already satisfied: numpy>=1.9.1 in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (1.17.2)
Requirement already satisfied: six>=1.9.0 in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (1.12.0)
Requirement already satisfied: scipy>=0.14 in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (1.4.1)
Requirement already satisfied: h5py in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (2.9.0)
Requirement already satisfied: pyyaml in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (5.1.2)
Requirement already satisfied: keras-applications>=1.0.6 in ./opt/anaconda3/lib/python3.7/site-packages (from Keras) (1.0.8)
Installing collected packages: Keras
Successfully installed Keras-2.3.1
Note: you may need to restart the kernel to use updated packages.


Step [33]:

Now, we have to import all the necessary libraries, here i have used Sequential model, Adam optimizer and relu as activation function for 2 desnse layer and softmax for last one.

import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy

Output [33]


Using TensorFlow backend.


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 30)                930       
_________________________________________________________________
dense_2 (Dense)              (None, 32)                992       
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 66        
=================================================================
Total params: 1,988
Trainable params: 1,988
Non-trainable params: 0


Step [34]:

We can see in each epoch, for batch of 25 the accuracy is increasing for Undesampling, we only took 20 epochs.

undersample_model.fit(X_train, y_train, validation_split=0.2, batch_size=25, epochs=20, shuffle=True, verbose=2)

Output [34]

Train on 604 samples, validate on 152 samples
Epoch 1/20
 - 1s - loss: 0.4678 - accuracy: 0.7964 - val_loss: 0.3206 - val_accuracy: 0.9079
Epoch 2/20
 - 0s - loss: 0.3046 - accuracy: 0.8974 - val_loss: 0.2382 - val_accuracy: 0.9211
.
.
.
Epoch 20/20
 - 0s - loss: 0.0602 - accuracy: 0.9785 - val_loss: 0.1083 - val_accuracy: 0.9605


<keras.callbacks.callbacks.History at 0x147c7aa90>




Step [35]:

Let's plot confusion matrix for Random UnderSample

Output [35]




Confusion matrix, without normalization
[[55244  1619]
 [   11    87]]
Confusion matrix, without normalization
[[56863     0]
 [    0    98]]



Step [36]:

Now let's do same steps as above for Oversample model.

Output [35]




Confusion matrix, without normalization
[[56853    10]
 [   31    67]]
Confusion matrix, without normalization
[[56863     0]
 [    0    98]]




Conclusion:

Using SMOTE on imbalanced data set rectified the issue arising from imbalanced dataset i.e, more Nonfraud transaction than fraud transaction. We achieved an accuracy level of 99.98% on the OverSample module. This is how we can use an algorithm to identify the fraud and non-fraud transactions. 














Comments

Popular posts from this blog

MACHINE LEARNING for BUSINESS PROFESSIONALS

Intro: Shaping Nepalese Banking Sector With AI

PREDICTING NEPALESE BANK LOAN DEFAULT RATE POST PANDEMIC USING A.I.