This article builds on a previous post, Increasing Platform Engagement with Machine Learning. Recommendations made included the use of a confusion matrix to better evaluate model performance.
This, this article hopes to achieve two objectives;
- To explore statistical criteria that can be used as a performance baseline for evaluating, selecting and fine tuning chosen models.
- To compare the performance of models using these criteria with optimal and sub-optimal features to highlight the importance of feature engineering.
A Blight, within the City of Detroit, is a fine issued to a property owners who do not properly maintain his/her property, subject to this and other conditions. And non-compliance to fine payments is costly administration problem for the city.
In order to minimize the costs associated with Blight ticket compliance, the city needed to determine whether particular property owners are at risk of defaulting as a first step to reducing related administration costs.
A supervised machine learning approach was used as a mechanism to identify such future potential property owners in order to reduce above mentioned costs. The implementation of selecting and applying the correct algorithm was defined using Python as follows;
- Receive information of historically compliant and non-compliant property owners. The problem is thus defined as a binary object supervised classification problem
- Identify strong and weak features for comparison of relative performance
- Train three machine learning algorithms on the data, and compare their performance using Receiver Operating Characteristics curves and a Dummy Classifier
- Integrate a 5 fold cross validation in the training and testing phase, as well as testing models on 10% of original data outside the cross validation process
Model Evaluation Theory
Relative Operating Characteristic Curve
The Receiver or Relative Operating Characteristic Curve is ” is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. ” (Wikipedia, 2019).
Why is this curve important? Because it provides a criteria to select optimal machine learning algorithm from sub-optimal ones. How is this achieved?
The plot is generated by comparing two metrics (scaled from 0 to 1) for a given supervised application,
- The True Positive Rate. This is the ratio of Type 1 to both Type 1 and 2 errors. Often referred to as Recall or Sensitivity. Simplifying this equation yields one that is only dependent on the False Negative Rate
- The False Positive Rate. This is the ration of Type 2 to both Type 1 and 2 errors. Often referred to as Precision. Simplifying this equation yields one that is only dependent on the True Negative Rate
The act of comparing these two models together substantiates the relative nature of the curve, as inferred by the acronym. The relative nature of this relationship points out another important characteristic of handling Type 1 and 2 errors. Both cannot be maximized together, because minimizing the likelihood one increases the likelihood of the other.
And so finding the point at which both metrics yield an optimal model, means finding the point of the graph closest to the top right corner of the graph. Statistically, this is where the False Negative and True Negative Rates are minimized to 0 yielding a machine capable of perfect predictions.
The exact mathematical relationship that defines this distance is given below. Refer to this article for more information.
Finally, observing the AUC or Area Under Curve is an indication of model performance. According to Kaggle,
- A perfect classifier has AUC = 1 and a completely random classifier has AUC = 0.5. Usually, your model will score somewhere in between.
- The range of possible AUC values is [0, 1].
- If your AUC is below 0.5, that means you can invert all the outputs of your classifier and get a better score, so you did something wrong.
The third point is worth noting, because it introduces the idea of a Dummy Classifier.
According to GeeksForGeeks, this classifier is one that intentionally generates a random prediction for any type of information supplied to it. This characteristics serves as an important benchmark, because we don’t want our model to perform in this manner i.e have it’s performance be comparable to one that at best makes random predictions.
A dummy classifier has a AUC score of 0.5, and simply a diagonal line from the origin to the top right corner of the Recall/Precision curve. Hint: half of a rectangle is 0.5 😉
Finally, the use of weak and strong features during the training process can have significant ramifications on model performance. This will be explored and visualized later.
Data Mining Implementation
Find the complete implementation here.
Understanding feature distributions
Before the training process begins, it’s important to understand the statistical distributions of various features being considered. This is important because passing redundant features into the training process is,
- A waste of time and processing power
- A source of sub-optimal model performance, and the impact of this will be visualized later
The above multi-plot was generated using the following Python code,
# Initializing the plot and figures fig2, axs2 = plt.subplots(2, 3, figsize=(9, 5), sharey=False) # Disposition plot dispplot = sns.barplot(df_perfobj.groupby(by='disposition').sum().values.flatten(), df_perfobj.groupby(by='disposition').sum().index, ax=axs2[1, 0]) dispplot.set_ylabel('') dispplot.set_xlabel('Count') dispplot.set_title('Disposition by compliance', fontsize=9) dispplot.tick_params(axis='y', labelsize=6) # Count plot compcountplot = sns.countplot(x="compliance", data=df_perfagg, ax=axs2[1, 1]) compcountplot.set_xlabel('Count') compcountplot.set_xlabel('Compliance Type') compcountplot.set_xticklabels(['Non-compliant', 'Compliant'], fontsize=6) compcountplot.set_title('Imbalanced Binary Classification', fontsize=8) # Compliance vs fee_amount compplot = sns.barplot(x='compliance', y='fine_amount', data=df_perfagg, ax=axs2[1, 2]) compplot.set_xlabel('Compliance Type') compplot.set_ylabel('Fine Amount ($)') compplot.set_xticklabels(['Non-compliant', 'Compliant'], fontsize=6) compplot.set_title('Fee Amount vs Compliance Type', fontsize=9)
The figure above provides the following value/insight from left to right,
- Effective features with respect to the target variable, and attributes that account for high observations of Compliance Type
- An imbalanced data set with respect to the target variable, Compliance Type
- The distribution of fees charged to property owners both compliant and non-compliant
This simple analysis has already revealed two features with a high likelihood of predicting the state of Compliance Type.
Model Training and Optimization
Three methods were defined,
- blight_model. Return clean train and test data sets that account for categorical variables by generating dummy matrices, and retain 10% of completely unseen data for model evaluation. The other 90% was used for model training.
- filterByCol. A support method to merge dummy matrices and respective data sets.
- generateAuc. Return the False Positive and True Positive distribution points, as well as the AUC score.
- getdfDummies. A support method to generate dummy matrices.
Here is an overview of the data acquisition, cleaning and training process on optimal and sub-optimal features previously identified. The models chose for this application were a Logistic Regression, Gradient Boosted Random Forrest Tree Classifier, Support Vector Machine and a Dummy Classifier.
x_train, x_test, y_train, y_test, df_test, out_test, df_otrain = blight_model(['disposition'], ['fine_amount'], ['compliance']) x_train_p, x_test_p, y_train_p, y_test_p, df_test_p, out_test_p, df_otrain = blight_model(['agency_name', 'violation_code', 'graffiti_status'], ['fine_amount'], ['compliance']) # Classification training - optimal deatures clflr = LogisticRegression().fit(x_train.loc[:, df_test.columns].fillna(0), y_train) clfrft = GradientBoostingClassifier().fit(x_train.loc[:, df_test.columns].fillna(0), y_train) clflnr = SVC().fit(x_train.loc[:, df_test.columns].fillna(0), y_train) dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(x_train.loc[:, df_test.columns].fillna(0), y_train) # Classification training - suboptimal features deatures clflr_p = LogisticRegression().fit(x_train_p.loc[:, df_test_p.columns].fillna(0), y_train_p) clfrft_p = GradientBoostingClassifier().fit(x_train_p.loc[:, df_test_p.columns].fillna(0), y_train_p) clflnr_p = SVC().fit(x_train_p.loc[:, df_test_p.columns].fillna(0), y_train_p) dummy_majority_p = DummyClassifier(strategy = 'most_frequent').fit(x_train_p.loc[:, df_test_p.columns].fillna(0), y_train_p)
Once trained, the two sets of models were put to test using cross validated data and the 10% of original data isolated from the training environment. Furthermore, the models were trained with optimal and sub-optimal features.
The performance of each type of model is given below using two metrics,
- The vanilla scoring method provided by scikit-learn
- The ROC score
Test data used during cross validation: ---------------------------------------- Logistic Regression (optimal features): 0.9284185361243155, ROC: 0.7593334889172573 Logistic Regression (sub-optimal features): 0.9295582798209768, ROC: 0.6275645804221298 SVC (optimal features): 0.9284741333778112, ROC: 0.38254876252356274 SVC (sub-optimal features): 0.9308926139048731, ROC: 0.4873286567138604 Gradient Boosting (optimal features): 0.9284185361243155, ROC: 0.7650462148864489 Gradient Boosting (sub-optimal features): 0.9308370166513774, ROC: 0.6306016399993081 Test data outside of cross validation set: ------------------------------------------- Logistic Regression (optimal features): 0.9321991493620215, ROC: 0.7641725410792254 Logistic Regression (sub-optimal features): 0.9272579434575932, ROC: 0.6221755099259199 SVC (optimal features): 0.9321991493620215, ROC: 0.3981806476215124 SVC (sub-optimal features): 0.9280085063797848, ROC: 0.46790645811915743 Gradient Boosting (optimal features): 0.9321991493620215, ROC: 0.7679273126608908 Gradient Boosting (sub-optimal features): 0.9280085063797848, ROC: 0.6234201250183061
The data provides crucial insight to the variation of performance for selected models given different data sets and trained features. In general, with the exception of the Support Vector Machine and Gradient Boosted Model, performance is better with optimal features.
More generally, these observations are noted for data that was pulled outside of the cross validation data set making ensuring consistent performance across models. The ROC curves for each model, and feature set is illustrated below.
The figure above illustrated the Recall/Precision curves for three models; Support Vector Machine, Logistic Regression and a Gradient Boosted Random Forrest Tree Classifier. The following observations are noted;
- ROC curves is orange as those trained with sub-optimal features, and blue with optimal features.
- Optimized curves are closer to the top corner of the graph
The key observation noted here is the relative performance of the Logistic Regression and GB RFT Classifier.
In combination with the figure and performance data supplied above, the Logistic Regression and GB Classifier perform the best with negligible differences in AUC scores on isolated data and optimal features and sub-optimal features
- Optimal features, 0.764 and 0.767 respectively
- Sub-optimal features, 0.622 and 0.623 respectively
This project considered the performance of 3 Supervised Learning Models, and used statistical metrics derived from a confusion matrix; True Positive & False Positive rate or Recall & Precision.
There only recommendation at this point is to consider fine-tuning model parameters to extract better performance.