Faster Than XGBoost: Using Catboost with C++

Integrating machine learning models into production environments often requires a balance between performance, compatibility, and ease of deployment. While Python is the go-to language for developing and training machine learning models, deploying these models in a C++ environment can offer performance benefits and better integration with existing codebases.

In this tutorial, we'll walk through the process of training a CatBoost model for binary classification in Python, exporting it as standalone C++ code, and integrating it into a C++ application. This end-to-end guide is designed for engineers looking to leverage CatBoost's powerful algorithms within a native C++ environment.

Introduction to CatBoost

CatBoost is an open-source gradient boosting library developed by Yandex. It excels in handling categorical features and offers superior performance with minimal parameter tuning. CatBoost supports both numerical and categorical features without extensive preprocessing, making it an excellent choice for real-world datasets.

Gradient Boosting Models in C++

One of CatBoost's standout features is the ability to export trained models as standalone C++ code. This capability allows engineers to integrate machine learning models into C++ applications seamlessly, facilitating high-performance inference and better resource utilization.

Prerequisites

Before we begin, ensure you have the following installed:

Python 3.x
CatBoost library: Install using pip install catboost
scikit-learn: Install using pip install scikit-learn
NumPy: Install using pip install numpy
C++ Compiler: A compiler supporting C++11 or higher (e.g., GCC, Clang)
CityHash Library: Required dependency for the exported C++ code

Overview of the Workflow

We'll follow these main steps:

Train a CatBoost binary classification model in Python.
Export the trained model to standalone C++ code.
Compile the C++ code, including necessary dependencies like the CityHash library.
Invoke the model from a C++ application, applying the sigmoid function to obtain probabilities.

By the end of this tutorial, you'll have a working C++ application that utilizes a CatBoost model trained in Python.

Step 1: Training a CatBoost Binary Classification Model in Python

We'll use the well-known Iris dataset from scikit-learn, adjusting it for binary classification to comply with CatBoost's C++ export limitations.

1.1 Preparing the Dataset

The Iris dataset contains three classes of flowers. We'll filter the dataset to include only two classes for binary classification.

# train_catboost_model.py

from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Filter to include only two classes (e.g., class 0 and class 1)
binary_indices = np.where(y != 2)
X = X[binary_indices]
y = y[binary_indices]

1.2 Splitting the Data

We split the data into training and validation sets to evaluate the model's performance.

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

1.3 Creating CatBoost Pool Objects

CatBoost uses Pool objects to handle datasets efficiently

# Create Pool objects for CatBoost
train_pool = Pool(X_train, y_train)
val_pool = Pool(X_val, y_val)

1.4 Training the CatBoost Model

We initialize and train the CatBoost classifier with appropriate parameters.

# Initialize the CatBoostClassifier
model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    loss_function='Logloss',  # For binary classification
    verbose=False
)

# Train the model
model.fit(train_pool, eval_set=val_pool)

1.5 Evaluating the Model (Optional)

It's good practice to evaluate the model's performance

# Make predictions on the validation set
y_pred_proba = model.predict_proba(X_val)[:, 1]

# Calculate accuracy or other metrics
from sklearn.metrics import accuracy_score, roc_auc_score

# Convert probabilities to binary predictions
y_pred = (y_pred_proba > 0.5).astype(int)

# Compute accuracy
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy:.4f}")

# Compute ROC AUC score
roc_auc = roc_auc_score(y_val, y_pred_proba)
print(f"Validation ROC AUC: {roc_auc:.4f}")

1.6 Saving the Model

Finally, we save the trained model for later use and export

# Save the model in CatBoost binary format (optional) model.save_model("catboost_model.cbm")

Step 2: Exporting the Model to Standalone C++ Code

CatBoost allows exporting the trained model to C++ code using the save_model method with the format="cpp" parameter.

# Export the model to standalone C++ code
model.save_model("catboost_model.cpp", format="cpp")

This command generates a catboost_model.cpp file containing:

Model Data: All necessary data for the model to make predictions.
ApplyCatboostModel Function: A function that applies the model to input features.

SStep 3: Compiling the C++ Code with Dependencies

The exported C++ code depends on the CityHash library. We'll compile the code and include all necessary dependencies.

3.1 Installing the CityHash Library

The CityHash library is used for hashing categorical features. Even if your model doesn't use categorical features, the exported code may still require it.

# Clone the CityHash repository
git clone https://github.com/google/cityhash.git
cd cityhash

# Checkout the required revision (replace with the correct revision if specified)
git checkout 9b5bd90

# Build and install CityHash
./configure
make
sudo make install

# Return to the project directory
cd ..

3.2 Creating the C++ Application

Create a file named main.cpp for your C++ application

// main.cpp

#include <iostream>
#include <vector>
#include <cmath>  // For std::exp

// Include the generated model code
#include "catboost_model.cpp"

// Sigmoid function to convert raw scores to probabilities
double Sigmoid(double x) {
    return 1.0 / (1.0 + std::exp(-x));
}

int main() {
    // Example input features (replace with your own data)
    std::vector<float> features = {5.1f, 3.5f, 1.4f, 0.2f};

    // Apply the model to get the raw prediction (log-odds)
    double raw_prediction = ApplyCatboostModel(features);

    // Convert raw prediction to probability
    double probability = Sigmoid(raw_prediction);

    // Output the results
    std::cout << "Raw Prediction (Log-Odds): " << raw_prediction << std::endl;
    std::cout << "Predicted Probability: " << probability << std::endl;

    return 0;
}

Compiling the Application

Use the following command to compile your application:

g++ -std=c++11 main.cpp -o catboost_app -lcityhash -lm

-std=c++11: Specifies the C++ standard.
-lcityhash: Links the CityHash library.
-lm: Links the math library for the std::exp function.

Step 4: Invoking the Model from a C++ Application

Run the compiled application:

./catboost_app

Expected output should be some number:

Note: The actual prediction value will vary based on the trained model and input features.

Understanding the Output

Raw Prediction (Log-Odds): The unprocessed output from the model, representing the log-odds of the positive class.
Predicted Probability: The probability of the input belonging to the positive class, obtained by applying the sigmoid function to the raw prediction.

Explanation

In binary classification, the model outputs a raw score (log-odds) that can be any real number. To convert this to a probability between 0 and 1, we apply the sigmoid function:

This conversion ensures that the predictions are interpretable as probabilities.

Handling Categorical Features

If your model includes categorical features, the ApplyCatboostModel function signature changes:

double ApplyCatboostModel(
    const std::vector<float>& floatFeatures,
    const std::vector<std::string>& catFeatures
);

Order Matters: Pass features in the same order as during training.
CityHash Version: Ensure the CityHash library version matches the one used by CatBoost to avoid hash mismatches.

Compiler Requirements

C++11 or Higher: Ensure your compiler supports the required C++ standard.
Math Library: Link the math library using -lm if you use functions like std::exp.

Common Issues

Different Predictions Between Environments:
- Check data preprocessing and ensure consistency.
- Confirm that the CityHash version matches.
Compilation Errors:
- Verify that all dependencies are correctly linked.
- Ensure that include paths are correctly specified.

Limitations

Multiclass Classification Not Supported: Currently, exporting multiclass models to standalone C++ code is not supported. Stick to binary classification or regression models.
Performance Considerations: The exported code may not be as optimized as CatBoost's native implementations.

Conclusion

By following this tutorial, you've successfully:

Trained a CatBoost binary classification model in Python.
Exported the model to standalone C++ code.
Compiled the C++ code with necessary dependencies.
Invoked the model from a C++ application, obtaining probabilities.

Integrating machine learning models into C++ applications allows for high-performance inference and better integration with existing systems. While there are some limitations, CatBoost's ability to export models to C++ code provides a valuable tool for engineers.