Machine learning classification is a type of supervised learning in which an algorithm maps a set of inputs to a separate output. Classification models have a wide range of applications in different industries and are one of the pillars of supervised learning. This is because in all industries, many analytical questions can be formulated to map income to a separate output set. The simplicity of defining a classification problem makes classification models versatile and industry agnostic.

An important part of buiClassification models evaluate the performance of the model. In short, data researchers need a reliable way to test roughly how well a model correctly predicts an outcome. Many tools are available to evaluate model performance; Depending on the problem you are trying to solve, some may be more helpful than others.

For example, if the representation of all results is uniform in the accuracy of the data, a confusion matrix may suffice as performance measures. Conversely, if your data is unbalanced, i.e., one or more results are significantly underrepresented, you can use a measure like accuracy. If you want to understand the robustness of your model over decision thresholds, metrics such as the area under the receiver operating curve (AUROC) and the area under the precision call curve (AUPRC) may be more appropriate.

Given that the choice of appropriate rating information depends on the question you are trying to answer, every data scientist should be familiar with a set of rating performance metrics. Python’s Scikit-Learn library has a measurement module that makes calculation of accuracy, precision, AUROC, and AUPRC easy. In addition, it is equally important to know how model performance can be visualized using ROC curves, PR curves, and mixing matrices.

Here we look at the task of building a simple rating model that predicts the probability of a customer switching. A churn is defined as an event where a customer leaves the company, terminates an order, or no longer makes a purchase after a certain period of time. We work Telco Churn information, which contains information about a fictitious telecommunications company. Our job is to predict whether the customer will leave the company or not, and to evaluate how well our model performs this task.

Building a classification model

Let’s start by reading the Telco Churn data as a Pandas data frame:

df = pd.read_csv(‘telco_churn.csv’)

The first five rows of data are now displayed:


We see that the data set contains 21 columns with both categorical and numeric values. The data also includes 7043 rows, corresponding to 7043 individual customers.

Build a simple model that requires a term that is the length of time the client spends with the company and MonthlyCharges as inputs and predicts the probability of the client changing. The result column is a Churn column with a value of either yes or no.

First, we modify our target columns to have machine-readable binary values. We give the Churn column a value of one yes and zero no. We can achieve this by using numpy’s where () method:

import numpy as npdf[‘Churn’] = np.where(df[‘Churn’] == ‘Yes’, 1, 0)

Next, determine the input and output:

X = df[[‘tenure’, ‘MonthlyCharges’]]y = df[‘Churn’]

We can then share our information for training and testing. To do this, we need to import the train_test_split method from the sklearn model_selection module. Let’s create a training set that makes up 67 percent of our data, and then use the remaining data for testing. The test series consists of 2325 data points:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Our classification model uses a simple logistic regression model. Importing a LogisticRegression class from the Sklearn linear_models module:

from sklearn.linear_models import LogisticRegression

Now define an instance of the logistic regression class and store it in a variable called clf_model. We then fit our model to our exercise data:

clf_model = LogisticRegression(), y_train)

Finally, we can make predictions about the test data and store the predictions in a variable called y_pred:

y_pred = clf_model.predict(X_test)

Now that we have trained our model and made predictions about the test data, we need to evaluate the success of our model.

Precision and mixing matrices

A simple and widely used performance meter is accuracy. This is simply the total number of correct predictions divided by the number of data points in the test set. We can import the Accuracy_point method from the Sklearn metric module and calculate the accuracy. The first argument to the precision_point is the actual labels that are stored in the y_test. The second argument is the prediction, which is stored in y_pred:

from sklearn.metrics import accuracy_scoreprint(“Accuracy: “, accuracy_score(y_test, y_pred))

We see that the prediction accuracy of our model is 79 percent. While this is helpful, we really don’t know so much about how well our model specifically predicts either exchange or no variation. Confusion matrices can give us a little more information about how well our model works in each outcome.

This information is important to consider if your information is unbalanced. For example, if our test data does not include 95 service labels and five label labels, guessing “no circulation” to each customer can misleadingly give 95 percent accuracy.

Let’s now create confusion_matrix from our forecasts. Importing a confusion matrix package from the Sklearn metric module:

from sklearn.metrics import confusion_matrix

Create a confusion matrix group and store it in a variable called Conmat:

conmat = confusion_matrix(y_test, y_pred)

Creating a data frame from a confusion matrix group called df_cm:

val = np.mat(conmat)classnames = list(set(y_train))df_cm = pd.DataFrame(val, index=classnames, columns=classnames,)print(df_cm)

Let’s now create a confusion matrix using Seaborn’s thermal mapping method:

import matplotlib.pyplot as pltimport seaborn as snsplt.figure()heatmap = sns.heatmap(df_cm, annot=True, cmap=”Blues”)heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha=’right’)heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha=’right’)plt.ylabel(‘True label’)plt.xlabel(‘Predicted label’)plt.title(‘Churn Logistic Regression Model Results’)

So what does this chapter tell us about the performance of our model? Looking at the diagonal of the confusion matrix, we pay attention to the numbers 1,553 and 289. The number 1,553 corresponds to the number of customers whose model was correctly predicted to deteriorate (i.e., they stay with the company). The number 289 corresponds to the number of customers that the model correctly predicted.

It would be more useful if we could show these as a percentage of the total. For example, it would be useful to know what proportion of all seeds 289 correctly predicted customers make up. We can display percentages for each result by adding the following line of code before the heatmap:

df_cm = df_cm.astype(‘float’) / df_cm.sum(axis=1)[:, np.newaxis]

As we can see, our model correctly predicts 91 percent of customers who don’t listen and 46 percent of customers who don’t. This clearly illustrates the limitations of using accuracy because it did not give us information on the percentage of correctly predicted results.

ROC curve and AUROC

Companies often want to work with predicted probabilities instead of separate labels. This allows them to choose a threshold to mark the result as either negative or positive. When dealing with probabilities, a way is needed to measure how well the model is widespread beyond the probability thresholds. Until then, our algorithm has been given binary identifiers using a default threshold of 0.5, but perhaps the ideal probability limit is higher or lower depending on the use case.

In the case of balanced data, the ideal threshold is 0.5. When our data is unbalanced, the ideal threshold is often lower. In addition, companies sometimes prefer to work with probabilities rather than separate labels. Given the importance of predictive probabilities, it is useful to understand what metrics are used to estimate them.

AUROC is a way to measure how robust your model is outside of decision-making thresholds. It is the area under the true positive percentage curve compared to the false positive percentage. The true positive fraction (TPR) is (true positive) / (true positive + false negative). The false positive fraction is (false positive) / (false positive + true negative).

In the context of our churn problem, this measures how well our model captures customers who do not listen to different probability thresholds.

Let’s start by calculating AUROC. Importing the roc_curve and roc_auc_score methods from the metric module:

from sklearn.metrics import roc_curve, roc_auc_score

Next, the predicted probabilities for our test set are generated using our trained model:

y_pred_proba = clf_model.predict_proba(np.array(X_test))[:,1]

We can then calculate the false positive quantity (fpr), the true positive quantity (tpr) for the different probability thresholds:

fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

Finally, we can draw our ROC curve:

sns.set()plt.plot(fpr, tpr)plt.plot(fpr, fpr, linestyle = ‘ — ‘, color = ‘k’)plt.xlabel(‘False positive rate’)plt.ylabel(‘True positive rate’)AUROC = np.round(roc_auc_score(y_test, y_pred_proba), 2)plt.title(f’Logistic Regression Model ROC curve; AUROC: {AUROC}’);

The faster the real positive interest rate approaches, the better the behavior of our ROC curve. So our model works pretty well for the ROC curve. Also, 0.82 AUROC is pretty good because the AUROC for a complete model would be 1.0. We saw that 91 percent of the negative cases (i.e., no variability) were correctly predicted in our model using the default threshold of 0.5, so this should not come as too much of a surprise.

AUPRC (average accuracy)

The area under the accuracy curve gives us a good idea of ​​our accuracy at different decision thresholds. Accuracy is (true positive) / (true positive + false positive). Return is another word for a true positive percentage.

In the case of a reversal, AUPRC (or mean accuracy) is a measure of how well our model predicts a customer to leave the company correctly, as opposed to predicting a customer to remain above decision thresholds. Creating an accuracy / recovery curve and calculating AUPRC is similar to what we did for AUROC:

from sklearn.metrics import precision_recall_curvefrom sklearn.metrics import average_precision_scoreaverage_precision = average_precision_score(y_test, y_test_proba)precision, recall, thresholds = precision_recall_curve(y_test, y_test_proba)plt.plot(recall, precision, marker=’.’, label=’Logistic’)plt.xlabel(‘Recall’)plt.ylabel(‘Precision’)plt.legend()plt.title(f’Precision Recall Curve. AUPRC: {average_precision}’)

We can see that when the AUPRC is 0.63 and the accuracy of the accuracy / return curve rapidly decreases, our model does less good in predicting if the customer leaves when the probability limit changes. This result is to be expected because we found that when we used the default threshold of 0.5, only 46 percent of the exchange plates were correctly predicted. For those interested in using data and code, a Python script is available here.


Domain and industry data scientists need to have a strong understanding of rating performance metrics. Using metrics for balanced or balanced data is important for clear communication of the performance of your model. Using accuracy to convey the results of a naively unbalanced information-trained model can lead customers to think that their model works better than it actually does.

In addition, it is important to have a strong understanding of how forecasts are used in practice. It may be that a company is looking for separate performance stamps that it can use in decision making. In other cases, firms are more interested in using probabilities in their decision making, so we need to assess probabilities. It is essential to the success of a machine learning project that you know the many aspects and approaches to evaluating model performance.

Python’s Scikit-learn package conveniently provides tools for most performance metrics used in a variety of industries. This allows you to get an idea of ​​the performance of the model from many perspectives in a short amount of time and in a relatively few lines of code. The ability to produce confusion matrices, ROC curves, and precision / recovery curves quickly allows researchers to iterate faster in projects.

Whether you want to quickly build and evaluate a machine learning model for a problem, compare ML models, choose model features, or tune your machine learning model, a good knowledge of this grading performance data is an invaluable skill set.

If you are interested in learning the basics of python programming, data processing with Panda, and machine learning in Python, check Python for Data Science and Machine Learning: Python Programming, Pandas and Scikit-Learn Guides for Beginners. I hope you found this post useful / interesting.

This post was originally published BuiltIn blog. The original song can be found here.


Please enter your comment!
Please enter your name here