An introduction to this modern gradient-enhancing library

If you’ve worked as a data scientist, participated in Kaggle contests, or even browsed data science articles on the Internet, there’s a good chance you’ve heard of XGBoost. Even today, it is often a go-to algorithm for many Kagglers and data scientists working on general machine learning tasks.

While XGBoost is popular for good reasons, it has some limitations that I mentioned in my article below.

The odds are, you’ve probably heard Of XGBoost, have you ever heard of CatBoost? CatBoost is another open source gradient enhancing library created by Yandex researchers. While it may be slower than XGBoost, it still has a number of interesting features and can be used as an alternative or incorporated into an entity model used with XGBoost. For some comparison datasets, CatBoost has even surpassed XGBoost.

This article compares this framework to XGBoos and shows how to train a CatBoost model with a simple data set.

Like XGBoost, CatBoost is also a gradient-enhancing framework. However, CatBoost has several features, such as those listed below, that set it apart from XGBoost:

  • CatBoost is a different implementation of gradient enhancement and uses a concept called ordered efficiency, which is thoroughly covered CatBoost paper.
  • Because CatBoost has a different implementation of gradient gain, it has the potential to surpass other implementations in certain tasks.
  • CatBoost includes visualization widgets for cross-validation and grid search that can be viewed in Jupyter notebooks.
  • Pool module CatBoost supports preprocessing of categorical and text properties.

For a complete list of features, be sure to check CatBoost Help Page. Although CatBoost has additional features, the main drawback to this implementation is that it is generally slower than XGBoost. But if you are willing to sacrifice speed, this compromise may be justified in certain situations.

To install CatBoost using pip, run the command listed below.

pip install catboost

Alternatively, you can also install CatBoost with Conda with the following commands.

conda config --add channels conda-forge
conda install catboost

In this tutorial, I present how a rating model can be trained using CatBoost using a simple set of data created with Scikit-learn. You will find the whole code for this tutorial GitHub.

Import libraries

In the code below, I imported basic libraries like Numpy and Pandas, as well as some modules from CatBoost.

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier, Pool, cv

Creating a data set

Use the make_classification function in the code below to create a data set from Scikit-learn.

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=50000,
n_features=20,
n_informative=15,
n_redundant=5,
n_clusters_per_class=5,
class_sep=0.7,
flip_y=0.03,
n_classes=2)

Next, we can divide the data set into training and testing kits using the code below.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Model training

CatBoost has a very simple Scikit-learn style API for training templates. We can embody the CatBoostClassifier object and train it in the training data according to the code below. Note that the iteration argument corresponds to the number of enhancing iterations (or number of trees).

model = CatBoostClassifier(iterations=100,
depth=2,
learning_rate=1,
loss_function='Logloss',
verbose=True)
model.fit(X_train, y_train)

Model training writes the training loss in each iteration to a constant result when the verbose argument is set to True.

Note how the total time elapsed and the remaining time is also written to the standard output.

Calculation of property statistics

We can also calculate detailed feature statistics from the training data set using the function calc_feature_statistics, as shown below.

model.calc_feature_statistics(X_train, y_train, feature=1, plot=True)

Note that the property argument indicates which property statistics are being calculated. This argument can be either a directory integer, a string to specify the property name, or a list of strings or integers to specify multiple properties.

The diagram above helps us to understand the behavior of the model in predicting objects from property values ​​in different compartments. These trays correspond to different value ranges for the specified property and are used to create decision trees in the CatBoost model.

Getting the importance of features

We can also calculate the importance of features with a trained CatBoost model. To do this, we must first take the training data and convert it to a preprocessed CatBoost dataset using the Pool module. After that, we can simply use the get_feature_importance function, as shown below.

train_data = Pool(data=X_train, label=y_train)model.get_feature_importance(train_data)

The function returns the import of Numpy group properties as shown below.

array([3.01594829, 7.75329451, 5.20064972, 4.43992429, 4.30243392,
8.32023227, 9.08359773, 2.73403973, 7.11605088, 2.31413571,
7.76344028, 1.95471762, 6.66177812, 7.78073865, 1.63636954,
4.66399329, 4.33191962, 1.836554 , 1.96756493, 7.12261691])

Cross-confirmation

In order to perform cross-validation with CatBoost, we need to complete the following steps:

  1. Create a preprocessed dataset with the Pool module.
  2. Create a dictionary of CatBoost model parameters.
  3. Create cross-validation points for the model with the cv function.

Note that the Pool module also contains optional arguments for preprocessing text and categorical properties, but since all properties in our data set are numeric, I did not need to use any of these arguments in this example.

cv_dataset = Pool(data=X_train,
label=y_train)
params = {"iterations": 100,
"depth": 2,
"loss_function": "Logloss",
"verbose": False}
scores = cv(cv_dataset,
params,
fold_count=5,
plot="True")

Executing the code above with the plot argument True gives us a great widget that appears below in the Jupyter notebook.

On the left we see the results of the cross-validation of each fold, and on the right we see a graph with the mean learning curve of the model as well as the standard deviation. The x-axis contains the number of iterations and the y-axis corresponds to the values ​​of the validation loss.

Grid search

We can also perform a grid search where the library compares the performance of different hyperparameter combinations to find the best model, as shown below.

model = CatBoostClassifier(loss_function='Logloss')grid = {'learning_rate': [0.03, 0.1],
'depth': [4, 6, 10]}
grid_search_result = model.grid_search(grid,
X=X_train,
y=y_train,
cv=3,
plot=True)

Executing the code above will produce the widget shown in the GIF below.

Select to access the best parameters with grid search parameters attribute.

print(grid_search_result['params'])

The print statement above gives us the best parameters in the list below.

{'depth': 10, 'learning_rate': 0.1}

Model testing

We can create predictions from a trained CatBoost model by performing a prediction function.

model.predict(X_test)

Performing the prediction function above produces Numpy series class labels as shown below.

array([0, 1, 0, ..., 0, 1, 1])

If we want to evaluate the performance of the model in the test data, we can use the score function, as shown below.

model.score(X_test, y_test)

Performing the above operation produced the following output.

0.906

Based on the above result, we can see that the model achieved a test accuracy of 90.6 percent.

Saving a template

You can also save CatBoost in various formats, such as PMML, as the code below indicates.

model.save_model(
"catboost.pmml",
format="pmml",
export_parameters={
'pmml_copyright': 'my copyright (c)',
'pmml_description': 'test model for BinaryClassification',
'pmml_model_version': '1'
}
)

CatBoost is an updated gradient-enhancing framework with additional features that are worth considering as a possible alternative to XGBoost. It may not be as fast, but it has useful features and can surpass XGBoost in certain tasks because it is an improved implementation of gradient enhancement.

As usual, you can find the code used in this article at GitHub.

  1. L. Prokhorenkova, G. Gusev et al. al, CatBoost: unbiased enhancement with categorical features, (2019), arXiv.org.

LEAVE A REPLY

Please enter your comment!
Please enter your name here