SNOWFLAKE CERTIFIED SOLUTION
Train an XGBoost Model with GPUs using Snowflake Notebooks
!pip install plotnine
# Import python packages
import streamlit as st
import pandas as pd
import sys
import seaborn as sns
import matplotlib.pyplot as plt
# xgboost libraries
import xgboost
from xgboost import XGBRegressor
# Snowpark libraries & session
from snowflake.snowpark import DataFrame
from snowflake.snowpark.functions import col
from snowflake.snowpark.context import get_active_session
session = get_active_session()
session
import torch
# Get the list of GPUs
if torch.cuda.is_available():
# Get the number of GPUs
num_gpus = torch.cuda.device_count()
print(f'{num_gpus} GPU Device(s) Found')
# Print the list of GPUs
for i in range(num_gpus):
print("Name:", torch.cuda.get_device_name(i), " Index:", i)
else:
print("No GPU available")
#Load in data from Snowflake table into a Snowpark dataframe
table = "XGB_GPU_DATABASE.XGB_GPU_SCHEMA.VEHICLES_TABLE"
df = session.table(table)
df.count(), len(df.columns)
#Note the maximum price - a $3B car must be quite a spectacle, but we don't want to use that for our model
df.select('PRICE').describe()
#Lets filter down to cars $100k or less - note that we only filter out ~1% of our data here
df = df.filter(col('PRICE')<100000)
df.select('PRICE').describe()
#View data schema
list(df.schema)
#Drop some columns that won't be helpful for modeling
drop_cols = ["ID","URL", "REGION_URL", "IMAGE_URL", "DESCRIPTION", "VIN", "POSTING_DATE", 'COUNTY']
df = df.drop(drop_cols)
#Fill NULL values with "NA" for string columns and 0 for numerical columns
from snowflake.snowpark.types import StringType
string_cols = df.select([col.name for col in df.schema if col.datatype ==StringType()]).columns
non_string_cols = df.drop(string_cols).columns
df = df.fillna("NA", subset=string_cols)
df = df.fillna(0, subset= non_string_cols)
Overview
In this solution, we'll explore how to easily harness the power of containers to build models at scale in Snowflake ML using GPUs from Snowflake Notebooks in the Container Runtime in the Container Runtime. Specifically, we'll train an XGBoost model and walk through a workflow that involves inspecting GPU resources, loading data from a Snowflake table, and setting up that data for modeling. In the notebook, we will train two XGBoost models—one trained with open source XGBoost (single GPU) and one distributing across the full GPU cluster. Finally, we'll log the model to Snowflake's model registry then test out built-in inference and explainability capabilities on the model object.
Snowflake Notebooks let you quickly tap into the GPU compute power you need to scalably build ML models using any open-source Python framework of choice. This solution showcases:
- Use Snowflake Notebooks with GPUs to speed up model training jobs with distributed processing
- Build using a set of pre-installed ML packages or pip install any of your favorite open-source package
- Run ML workloads at scale without any data movement
Reference Architecture for Train an XGBoost Model with GPUs using Snowflake Notebooks

Fig1: Reference Architecture for Train an XGBoost Model with GPUs using Snowflake Notebooks
About the Architecture
- Data in Snowflake Tables: Source data is saved in Snowflake tables. In the notebook, we perform feature engineering on this data using Snowpark Python and use it to train our models.
- Snowflake Notebook on Container Runtime: The solution leverages Container Runtimes with GPUs to efficiently train the models. We show how you can use a single GPU or GPU cluster to distribute training.
- Snowflake Model Registry: Once our models are trained, we log them to model registry and perform inference with the logged models.
This solution was created by an in-house Snowflake expert and has been verified to work with current Snowflake instances as of the date of publication.
Solution not working as expected? Contact our team for assistance.