Skip to content

Instantly share code, notes, and snippets.

@JGalego
Last active June 6, 2025 00:54
Show Gist options
  • Select an option

  • Save JGalego/2583330c0c883407387919a1153c8f57 to your computer and use it in GitHub Desktop.

Select an option

Save JGalego/2583330c0c883407387919a1153c8f57 to your computer and use it in GitHub Desktop.
A simple, yet mostly vibe-coded demo of Amazon SageMaker with Scikit-Learn using the California Housing dataset πŸ˜οΈπŸ’Έ
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "dfbf3a79-1f54-450f-95eb-1094146b71a9",
"metadata": {},
"source": [
"# California Housing\n",
"\n",
"A simple, yet mostly vibe-coded demo of [Amazon SageMaker with Scikit-Learn](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html) using the [California Housing](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html) dataset.\n",
"\n",
"> *\"Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.\"* β€”Samuel Beckett\n",
"\n",
"> *\"The best time to buy a house is always 5 years ago.\"* β€”Ray Brown\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/amansingh9097/California-housing-price-prediction/master/california-house-price-trends.PNG\" width=\"50%\"/>"
]
},
{
"cell_type": "markdown",
"id": "c2ba7296-1b2b-41aa-b92b-510850e17611",
"metadata": {},
"source": [
"## βœ… Prerequisites\n",
"\n",
"Make sure the SageMaker SDK is installed"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2135a82b-2798-4fc9-971f-cd15b02717c5",
"metadata": {},
"outputs": [],
"source": [
"!pip install -q sagemaker==2.246.0"
]
},
{
"cell_type": "markdown",
"id": "7d6381cf-1257-4ddb-a8c6-e883b5d49300",
"metadata": {},
"source": [
"and let's start by importing some Python libraries and defining some helper functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9420e87a-7782-491b-9b0b-c272d1295883",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import logging\n",
"import warnings\n",
"\n",
"# Suppress all warnings\n",
"warnings.filterwarnings(\"ignore\")\n",
"\n",
"# Sagemaker continuously complains about config, so we'll suppress that too\n",
"logging.getLogger(\"sagemaker.config\").setLevel(logging.WARNING)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9fc1b8c8-fde4-4286-a254-cc6fad30d537",
"metadata": {},
"outputs": [],
"source": [
"import boto3 # AWS SDK for Python\n",
"import sagemaker # Amazon SageMaker SDK for Python\n",
"\n",
"import numpy as np # Matrix multiplication and numerical processing\n",
"import pandas as pd # Munging tabular data\n",
"import matplotlib.pyplot as plt # Charts and visualizations\n",
"\n",
"from IPython import get_ipython\n",
"\n",
"from IPython.core.magic import register_cell_magic\n",
"\n",
"from IPython.display import ( # Display tools in IPython\n",
" display,\n",
" Markdown\n",
")\n",
"\n",
"\n",
"@register_cell_magic\n",
"def skip(line, cell):\n",
" return\n",
"\n",
"\n",
"@register_cell_magic\n",
"def skip_if(line, cell):\n",
" if eval(line):\n",
" return\n",
" get_ipython().run_cell(cell)\n",
"\n",
"\n",
"def printmd(str):\n",
" \"\"\"Prints a Markdown string\"\"\"\n",
" display(Markdown(str))\n",
"\n",
"\n",
"# Debug\n",
"printmd(f\"Numpy: `{np.__version__}`\")\n",
"printmd(f\"Pandas: `{pd.__version__}`\")\n",
"printmd(f\"Boto3: `{boto3.__version__}`\")\n",
"printmd(f\"SageMaker: `{sagemaker.__version__}`\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26b10bd9-287d-4822-b93d-e94ccc2df477",
"metadata": {},
"outputs": [],
"source": [
"# Initialize SageMaker client\n",
"client = boto3.client(\"sagemaker\")\n",
"\n",
"# Manages interactions with the Amazon SageMaker APIs\n",
"session = sagemaker.session.Session()\n",
"\n",
"# The AWS Region that we're using\n",
"region = session.boto_region_name\n",
"\n",
"# The IAM execution role assumed by SageMaker\n",
"role = sagemaker.get_execution_role()\n",
"\n",
"# The S3 bucket to be used by this session\n",
"bucket = session.default_bucket()\n",
"\n",
"# Where we'll store our data and model artifacts\n",
"prefix = \"california-housing\"\n",
"\n",
"printmd(f\"Region 🌎: `{region}`\")\n",
"printmd(f\"Bucket πŸͺ£: `{bucket}`\")\n",
"printmd(f\"Role πŸ‘·: `{role}`\")"
]
},
{
"cell_type": "markdown",
"id": "6d9b3118-e03c-4048-ab79-8fca5e6d65c7",
"metadata": {},
"source": [
"## βš™οΈ Prepare"
]
},
{
"cell_type": "markdown",
"id": "8007897f-e29f-426f-9b53-acc502d5c561",
"metadata": {},
"source": [
"Next, we'll load our data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6eeb4aee-0c39-4f91-8360-5998908d5b69",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_california_housing\n",
"\n",
"california = fetch_california_housing()\n",
"X = pd.DataFrame(california.data, columns=california.feature_names)\n",
"y = pd.Series(california.target, name='MedHouseVal')\n",
"\n",
"X.hist(figsize=(12, 10))"
]
},
{
"cell_type": "markdown",
"id": "b5623259-9126-46e0-90c1-675187460dc4",
"metadata": {},
"source": [
"Split it into datasets"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f3cc470-f9de-44b8-a58b-d9becd29603a",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "markdown",
"id": "5e5592f5-afd3-4f05-a4a9-0543fe02c675",
"metadata": {},
"source": [
"Save to CSV for SageMaker"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c272b34-a923-41ac-a966-376797d5ee83",
"metadata": {},
"outputs": [],
"source": [
"train_df = pd.concat([X_train, y_train], axis=1)\n",
"test_df = pd.concat([X_test, y_test], axis=1)\n",
"\n",
"os.makedirs('data', exist_ok=True)\n",
"train_df.to_csv('data/train.csv', index=False)\n",
"test_df.to_csv('data/test.csv', index=False)\n",
"!ls -lisah data"
]
},
{
"cell_type": "markdown",
"id": "b5f06c7e-849d-4ed8-a40a-3172c9bc97df",
"metadata": {},
"source": [
"and upload the final result to the S3 bucket"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92745d07-88bb-4e90-b4ac-6dfa05c09e7b",
"metadata": {},
"outputs": [],
"source": [
"train_input = session.upload_data('data/train.csv', bucket=bucket, key_prefix=f'{prefix}/train')\n",
"test_input = session.upload_data('data/test.csv', bucket=bucket, key_prefix=f'{prefix}/test')"
]
},
{
"cell_type": "markdown",
"id": "ca2d5000-f1c9-4c62-9777-a22a647241cb",
"metadata": {},
"source": [
"## πŸ—οΈ Build\n",
"\n",
"Now it's time to define our training workflow"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa12ab5b-cb94-4df1-a1c1-ef93184809e2",
"metadata": {},
"outputs": [],
"source": [
"%%writefile train_and_deploy.py\n",
"\n",
"import os\n",
"import argparse\n",
"import joblib\n",
"\n",
"import pandas as pd\n",
"\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.metrics import mean_squared_error\n",
"\n",
"\n",
"def model_fn(model_dir):\n",
" \"\"\"Load the model from the model_dir.\"\"\"\n",
" model_path = os.path.join(model_dir, \"model.joblib\")\n",
" return joblib.load(model_path)\n",
"\n",
"\n",
"if __name__ == '__main__':\n",
" parser = argparse.ArgumentParser()\n",
" parser.add_argument('--output-data-dir', type=str, default='/opt/ml/output/data')\n",
" parser.add_argument('--model-dir', type=str, default='/opt/ml/model')\n",
" parser.add_argument('--train', type=str, default='/opt/ml/input/data/train')\n",
" args = parser.parse_args()\n",
"\n",
" # Load data\n",
" df = pd.read_csv(f'{args.train}/train.csv')\n",
" X = df.drop('MedHouseVal', axis=1)\n",
" y = df['MedHouseVal']\n",
"\n",
" # Train model with GridSearch\n",
" param_grid = {\n",
" 'n_estimators': [100],\n",
" 'max_depth': [None],\n",
" 'max_features': ['sqrt']\n",
" }\n",
" model = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error')\n",
" model.fit(X, y)\n",
"\n",
" # Save model\n",
" joblib.dump(model.best_estimator_, f'{args.model_dir}/model.joblib')"
]
},
{
"cell_type": "markdown",
"id": "3b88a176-b1cd-4eb4-859b-3c8e70382d8e",
"metadata": {},
"source": [
"In the training script above, we've defined a special function called `model_fn` which is used to load the model. This function will be called during deployment."
]
},
{
"cell_type": "markdown",
"id": "330367e8-df36-4cbb-9389-8be6f570e0a8",
"metadata": {},
"source": [
"**Optional:** feel free to run the script locally to check if it's working"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8134d43e-5a16-4fef-b268-2c5516b137ff",
"metadata": {},
"outputs": [],
"source": [
"!python train_and_deploy.py --train data --model-dir ."
]
},
{
"cell_type": "markdown",
"id": "1ee267ae-d91c-485b-b06c-2c4a50399c1c",
"metadata": {},
"source": [
"Let's initialize our estimator (think of it as a high-level interface for SageMaker training)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5fe7718-73dc-4062-8b6a-a615b57fc5db",
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.sklearn.estimator import SKLearn\n",
"\n",
"sklearn_estimator = SKLearn(\n",
" entry_point='train_and_deploy.py',\n",
" role=role,\n",
" instance_type='ml.m5.large',\n",
" framework_version='1.2-1',\n",
" py_version='py3',\n",
" sagemaker_session=session,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "aa3f4393-c9bc-4a24-bbd3-9d33fe21b5ca",
"metadata": {},
"source": [
"and kick off our training run"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "106cb4ad-1d91-4348-98a9-7d30b8b2c1f3",
"metadata": {},
"outputs": [],
"source": [
"sklearn_estimator.fit({'train': train_input})"
]
},
{
"cell_type": "markdown",
"id": "826d8560-2aef-4807-8b21-f5c68a65b841",
"metadata": {},
"source": [
"**Important:** use the next cell if you want to test this *locally*\n",
"\n",
"**Requirements:** in order to use `local` mode, you'll need some AWS credentials to pull images from the [public ECR repositories managed by SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/ecr-eu-west-1.html) and [Docker Compose](https://docs.docker.com/compose/install/) 🐳)\n",
"\n",
"> ✨ For more information, see [Configuring Local Mode Execution in Sagemaker Studio](https://community.aws/content/2kWVSWbVdUpWHupaVIMLi7Vq12x/configuring-local-mode-execution-in-sagemaker-studio)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "05190a46-685a-41ba-89f0-cc9884ca9ebf",
"metadata": {},
"outputs": [],
"source": [
"# Switch off to use run local mode\n",
"LOCAL_MODE_DISABLED=True"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb00ef1c-3b3b-49ce-bf63-db5716a290b1",
"metadata": {},
"outputs": [],
"source": [
"%%skip_if LOCAL_MODE_DISABLED\n",
"\n",
"from getpass import getpass\n",
"\n",
"os.environ['AWS_DEFAULT_REGION'] = input(\"Region:\") or \"eu-west-1\"\n",
"os.environ['AWS_ACCESS_KEY_ID'] = getpass(\"Access key:\")\n",
"os.environ['AWS_SECRET_ACCESS_KEY'] = getpass(\"Secret access key\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b2b55f3-cd7b-480e-b387-731f03ff2bb3",
"metadata": {},
"outputs": [],
"source": [
"%%skip_if LOCAL_MODE_DISABLED\n",
"\n",
"from sagemaker.local import LocalSession\n",
"\n",
"# Set up local session\n",
"local_session = LocalSession()\n",
"local_session.config = {'local': {'local_code': True}}\n",
"\n",
"# Dummy role for local mode\n",
"DUMMY_IAM_ROLE = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'\n",
"\n",
"sklearn_local_estimator = SKLearn(\n",
" entry_point='train_and_deploy.py',\n",
" role=DUMMY_IAM_ROLE,\n",
" instance_type='local',\n",
" instance_count=1,\n",
" framework_version='1.2-1',\n",
" py_version='py3',\n",
" sagemaker_session=local_session\n",
")\n",
"\n",
"sklearn_local_estimator.fit({'train': 'file://data'})"
]
},
{
"cell_type": "markdown",
"id": "d5f13cdc-a92d-4557-8bc5-61a0aad88272",
"metadata": {},
"source": [
"## πŸš€ Deploy"
]
},
{
"cell_type": "markdown",
"id": "8afe426a-51aa-4e55-953d-6f74a3810241",
"metadata": {},
"source": [
"Once the training job finishes, we can start the deployment"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4ade4445-1442-48ed-8597-b8ceba652f93",
"metadata": {},
"outputs": [],
"source": [
"predictor = sklearn_estimator.deploy(\n",
" instance_type='ml.m5.large',\n",
" initial_instance_count=1\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f08cb8e5-7224-45ec-ab98-3e49bb5e4a2d",
"metadata": {},
"outputs": [],
"source": [
"%%skip_if LOCAL_MODE_DISABLED\n",
"\n",
"local_predictor = sklearn_local_estimator.deploy(\n",
" instance_type='local',\n",
" initial_instance_count=1\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9f21d610-9cfd-4e4e-b388-b2a065e6fe72",
"metadata": {},
"source": [
"This will create an endpoint that we can call directly"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c07d1c2d-b3a4-44bd-b71f-a65520c2f864",
"metadata": {},
"outputs": [],
"source": [
"new_house = [[3.2, 15, 6, 1.5, 300, 3, 34.05, -118.25]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ddddc3e8-f313-4125-b22e-9cc11061863d",
"metadata": {},
"outputs": [],
"source": [
"prediction = predictor.predict(new_house)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af002280-54cf-4266-acc0-135ebe3ea934",
"metadata": {},
"outputs": [],
"source": [
"%%skip_if LOCAL_MODE_DISABLED\n",
"\n",
"prediction = local_predictor.predict(new_house)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2cfa575c-1084-4a07-91fd-9458499bdbff",
"metadata": {},
"outputs": [],
"source": [
"print(f\"πŸ’Έ Predicted Price: {prediction[0]:.2f}\")"
]
},
{
"cell_type": "markdown",
"id": "4bef00a3-f689-4564-90e6-35fb6860e45b",
"metadata": {},
"source": [
"## 🧹 Clean"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14f859b2-e84c-4ac2-beb0-7d0f6cf9ef8d",
"metadata": {},
"outputs": [],
"source": [
"predictor.delete_endpoint()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "967ecc8c-a0f3-4f5a-bbe6-1cbc0416733b",
"metadata": {},
"outputs": [],
"source": [
"%%skip_if LOCAL_MODE_DISABLED\n",
"\n",
"local_predictor.delete_endpoint()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment