Last active
June 6, 2025 00:54
-
-
Save JGalego/2583330c0c883407387919a1153c8f57 to your computer and use it in GitHub Desktop.
A simple, yet mostly vibe-coded demo of Amazon SageMaker with Scikit-Learn using the California Housing dataset ποΈπΈ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "id": "dfbf3a79-1f54-450f-95eb-1094146b71a9", | |
| "metadata": {}, | |
| "source": [ | |
| "# California Housing\n", | |
| "\n", | |
| "A simple, yet mostly vibe-coded demo of [Amazon SageMaker with Scikit-Learn](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html) using the [California Housing](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html) dataset.\n", | |
| "\n", | |
| "> *\"Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.\"* βSamuel Beckett\n", | |
| "\n", | |
| "> *\"The best time to buy a house is always 5 years ago.\"* βRay Brown\n", | |
| "\n", | |
| "<img src=\"https://raw.githubusercontent.com/amansingh9097/California-housing-price-prediction/master/california-house-price-trends.PNG\" width=\"50%\"/>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "c2ba7296-1b2b-41aa-b92b-510850e17611", | |
| "metadata": {}, | |
| "source": [ | |
| "## β Prerequisites\n", | |
| "\n", | |
| "Make sure the SageMaker SDK is installed" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "2135a82b-2798-4fc9-971f-cd15b02717c5", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "!pip install -q sagemaker==2.246.0" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "7d6381cf-1257-4ddb-a8c6-e883b5d49300", | |
| "metadata": {}, | |
| "source": [ | |
| "and let's start by importing some Python libraries and defining some helper functions" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "9420e87a-7782-491b-9b0b-c272d1295883", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import os\n", | |
| "import logging\n", | |
| "import warnings\n", | |
| "\n", | |
| "# Suppress all warnings\n", | |
| "warnings.filterwarnings(\"ignore\")\n", | |
| "\n", | |
| "# Sagemaker continuously complains about config, so we'll suppress that too\n", | |
| "logging.getLogger(\"sagemaker.config\").setLevel(logging.WARNING)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "9fc1b8c8-fde4-4286-a254-cc6fad30d537", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import boto3 # AWS SDK for Python\n", | |
| "import sagemaker # Amazon SageMaker SDK for Python\n", | |
| "\n", | |
| "import numpy as np # Matrix multiplication and numerical processing\n", | |
| "import pandas as pd # Munging tabular data\n", | |
| "import matplotlib.pyplot as plt # Charts and visualizations\n", | |
| "\n", | |
| "from IPython import get_ipython\n", | |
| "\n", | |
| "from IPython.core.magic import register_cell_magic\n", | |
| "\n", | |
| "from IPython.display import ( # Display tools in IPython\n", | |
| " display,\n", | |
| " Markdown\n", | |
| ")\n", | |
| "\n", | |
| "\n", | |
| "@register_cell_magic\n", | |
| "def skip(line, cell):\n", | |
| " return\n", | |
| "\n", | |
| "\n", | |
| "@register_cell_magic\n", | |
| "def skip_if(line, cell):\n", | |
| " if eval(line):\n", | |
| " return\n", | |
| " get_ipython().run_cell(cell)\n", | |
| "\n", | |
| "\n", | |
| "def printmd(str):\n", | |
| " \"\"\"Prints a Markdown string\"\"\"\n", | |
| " display(Markdown(str))\n", | |
| "\n", | |
| "\n", | |
| "# Debug\n", | |
| "printmd(f\"Numpy: `{np.__version__}`\")\n", | |
| "printmd(f\"Pandas: `{pd.__version__}`\")\n", | |
| "printmd(f\"Boto3: `{boto3.__version__}`\")\n", | |
| "printmd(f\"SageMaker: `{sagemaker.__version__}`\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "26b10bd9-287d-4822-b93d-e94ccc2df477", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "# Initialize SageMaker client\n", | |
| "client = boto3.client(\"sagemaker\")\n", | |
| "\n", | |
| "# Manages interactions with the Amazon SageMaker APIs\n", | |
| "session = sagemaker.session.Session()\n", | |
| "\n", | |
| "# The AWS Region that we're using\n", | |
| "region = session.boto_region_name\n", | |
| "\n", | |
| "# The IAM execution role assumed by SageMaker\n", | |
| "role = sagemaker.get_execution_role()\n", | |
| "\n", | |
| "# The S3 bucket to be used by this session\n", | |
| "bucket = session.default_bucket()\n", | |
| "\n", | |
| "# Where we'll store our data and model artifacts\n", | |
| "prefix = \"california-housing\"\n", | |
| "\n", | |
| "printmd(f\"Region π: `{region}`\")\n", | |
| "printmd(f\"Bucket πͺ£: `{bucket}`\")\n", | |
| "printmd(f\"Role π·: `{role}`\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "6d9b3118-e03c-4048-ab79-8fca5e6d65c7", | |
| "metadata": {}, | |
| "source": [ | |
| "## βοΈ Prepare" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "8007897f-e29f-426f-9b53-acc502d5c561", | |
| "metadata": {}, | |
| "source": [ | |
| "Next, we'll load our data" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "6eeb4aee-0c39-4f91-8360-5998908d5b69", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "from sklearn.datasets import fetch_california_housing\n", | |
| "\n", | |
| "california = fetch_california_housing()\n", | |
| "X = pd.DataFrame(california.data, columns=california.feature_names)\n", | |
| "y = pd.Series(california.target, name='MedHouseVal')\n", | |
| "\n", | |
| "X.hist(figsize=(12, 10))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "b5623259-9126-46e0-90c1-675187460dc4", | |
| "metadata": {}, | |
| "source": [ | |
| "Split it into datasets" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "2f3cc470-f9de-44b8-a58b-d9becd29603a", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "from sklearn.model_selection import train_test_split\n", | |
| "\n", | |
| "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "5e5592f5-afd3-4f05-a4a9-0543fe02c675", | |
| "metadata": {}, | |
| "source": [ | |
| "Save to CSV for SageMaker" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "4c272b34-a923-41ac-a966-376797d5ee83", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "train_df = pd.concat([X_train, y_train], axis=1)\n", | |
| "test_df = pd.concat([X_test, y_test], axis=1)\n", | |
| "\n", | |
| "os.makedirs('data', exist_ok=True)\n", | |
| "train_df.to_csv('data/train.csv', index=False)\n", | |
| "test_df.to_csv('data/test.csv', index=False)\n", | |
| "!ls -lisah data" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "b5f06c7e-849d-4ed8-a40a-3172c9bc97df", | |
| "metadata": {}, | |
| "source": [ | |
| "and upload the final result to the S3 bucket" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "92745d07-88bb-4e90-b4ac-6dfa05c09e7b", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "train_input = session.upload_data('data/train.csv', bucket=bucket, key_prefix=f'{prefix}/train')\n", | |
| "test_input = session.upload_data('data/test.csv', bucket=bucket, key_prefix=f'{prefix}/test')" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "ca2d5000-f1c9-4c62-9777-a22a647241cb", | |
| "metadata": {}, | |
| "source": [ | |
| "## ποΈ Build\n", | |
| "\n", | |
| "Now it's time to define our training workflow" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "fa12ab5b-cb94-4df1-a1c1-ef93184809e2", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "%%writefile train_and_deploy.py\n", | |
| "\n", | |
| "import os\n", | |
| "import argparse\n", | |
| "import joblib\n", | |
| "\n", | |
| "import pandas as pd\n", | |
| "\n", | |
| "from sklearn.ensemble import RandomForestRegressor\n", | |
| "from sklearn.model_selection import GridSearchCV\n", | |
| "from sklearn.metrics import mean_squared_error\n", | |
| "\n", | |
| "\n", | |
| "def model_fn(model_dir):\n", | |
| " \"\"\"Load the model from the model_dir.\"\"\"\n", | |
| " model_path = os.path.join(model_dir, \"model.joblib\")\n", | |
| " return joblib.load(model_path)\n", | |
| "\n", | |
| "\n", | |
| "if __name__ == '__main__':\n", | |
| " parser = argparse.ArgumentParser()\n", | |
| " parser.add_argument('--output-data-dir', type=str, default='/opt/ml/output/data')\n", | |
| " parser.add_argument('--model-dir', type=str, default='/opt/ml/model')\n", | |
| " parser.add_argument('--train', type=str, default='/opt/ml/input/data/train')\n", | |
| " args = parser.parse_args()\n", | |
| "\n", | |
| " # Load data\n", | |
| " df = pd.read_csv(f'{args.train}/train.csv')\n", | |
| " X = df.drop('MedHouseVal', axis=1)\n", | |
| " y = df['MedHouseVal']\n", | |
| "\n", | |
| " # Train model with GridSearch\n", | |
| " param_grid = {\n", | |
| " 'n_estimators': [100],\n", | |
| " 'max_depth': [None],\n", | |
| " 'max_features': ['sqrt']\n", | |
| " }\n", | |
| " model = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error')\n", | |
| " model.fit(X, y)\n", | |
| "\n", | |
| " # Save model\n", | |
| " joblib.dump(model.best_estimator_, f'{args.model_dir}/model.joblib')" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "3b88a176-b1cd-4eb4-859b-3c8e70382d8e", | |
| "metadata": {}, | |
| "source": [ | |
| "In the training script above, we've defined a special function called `model_fn` which is used to load the model. This function will be called during deployment." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "330367e8-df36-4cbb-9389-8be6f570e0a8", | |
| "metadata": {}, | |
| "source": [ | |
| "**Optional:** feel free to run the script locally to check if it's working" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "8134d43e-5a16-4fef-b268-2c5516b137ff", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "!python train_and_deploy.py --train data --model-dir ." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "1ee267ae-d91c-485b-b06c-2c4a50399c1c", | |
| "metadata": {}, | |
| "source": [ | |
| "Let's initialize our estimator (think of it as a high-level interface for SageMaker training)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "a5fe7718-73dc-4062-8b6a-a615b57fc5db", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "from sagemaker.sklearn.estimator import SKLearn\n", | |
| "\n", | |
| "sklearn_estimator = SKLearn(\n", | |
| " entry_point='train_and_deploy.py',\n", | |
| " role=role,\n", | |
| " instance_type='ml.m5.large',\n", | |
| " framework_version='1.2-1',\n", | |
| " py_version='py3',\n", | |
| " sagemaker_session=session,\n", | |
| ")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "aa3f4393-c9bc-4a24-bbd3-9d33fe21b5ca", | |
| "metadata": {}, | |
| "source": [ | |
| "and kick off our training run" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "106cb4ad-1d91-4348-98a9-7d30b8b2c1f3", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "sklearn_estimator.fit({'train': train_input})" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "826d8560-2aef-4807-8b21-f5c68a65b841", | |
| "metadata": {}, | |
| "source": [ | |
| "**Important:** use the next cell if you want to test this *locally*\n", | |
| "\n", | |
| "**Requirements:** in order to use `local` mode, you'll need some AWS credentials to pull images from the [public ECR repositories managed by SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/ecr-eu-west-1.html) and [Docker Compose](https://docs.docker.com/compose/install/) π³)\n", | |
| "\n", | |
| "> β¨ For more information, see [Configuring Local Mode Execution in Sagemaker Studio](https://community.aws/content/2kWVSWbVdUpWHupaVIMLi7Vq12x/configuring-local-mode-execution-in-sagemaker-studio)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "05190a46-685a-41ba-89f0-cc9884ca9ebf", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "# Switch off to use run local mode\n", | |
| "LOCAL_MODE_DISABLED=True" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "cb00ef1c-3b3b-49ce-bf63-db5716a290b1", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "%%skip_if LOCAL_MODE_DISABLED\n", | |
| "\n", | |
| "from getpass import getpass\n", | |
| "\n", | |
| "os.environ['AWS_DEFAULT_REGION'] = input(\"Region:\") or \"eu-west-1\"\n", | |
| "os.environ['AWS_ACCESS_KEY_ID'] = getpass(\"Access key:\")\n", | |
| "os.environ['AWS_SECRET_ACCESS_KEY'] = getpass(\"Secret access key\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "5b2b55f3-cd7b-480e-b387-731f03ff2bb3", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "%%skip_if LOCAL_MODE_DISABLED\n", | |
| "\n", | |
| "from sagemaker.local import LocalSession\n", | |
| "\n", | |
| "# Set up local session\n", | |
| "local_session = LocalSession()\n", | |
| "local_session.config = {'local': {'local_code': True}}\n", | |
| "\n", | |
| "# Dummy role for local mode\n", | |
| "DUMMY_IAM_ROLE = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'\n", | |
| "\n", | |
| "sklearn_local_estimator = SKLearn(\n", | |
| " entry_point='train_and_deploy.py',\n", | |
| " role=DUMMY_IAM_ROLE,\n", | |
| " instance_type='local',\n", | |
| " instance_count=1,\n", | |
| " framework_version='1.2-1',\n", | |
| " py_version='py3',\n", | |
| " sagemaker_session=local_session\n", | |
| ")\n", | |
| "\n", | |
| "sklearn_local_estimator.fit({'train': 'file://data'})" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "d5f13cdc-a92d-4557-8bc5-61a0aad88272", | |
| "metadata": {}, | |
| "source": [ | |
| "## π Deploy" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "8afe426a-51aa-4e55-953d-6f74a3810241", | |
| "metadata": {}, | |
| "source": [ | |
| "Once the training job finishes, we can start the deployment" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "4ade4445-1442-48ed-8597-b8ceba652f93", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "predictor = sklearn_estimator.deploy(\n", | |
| " instance_type='ml.m5.large',\n", | |
| " initial_instance_count=1\n", | |
| ")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "f08cb8e5-7224-45ec-ab98-3e49bb5e4a2d", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "%%skip_if LOCAL_MODE_DISABLED\n", | |
| "\n", | |
| "local_predictor = sklearn_local_estimator.deploy(\n", | |
| " instance_type='local',\n", | |
| " initial_instance_count=1\n", | |
| ")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "9f21d610-9cfd-4e4e-b388-b2a065e6fe72", | |
| "metadata": {}, | |
| "source": [ | |
| "This will create an endpoint that we can call directly" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "c07d1c2d-b3a4-44bd-b71f-a65520c2f864", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "new_house = [[3.2, 15, 6, 1.5, 300, 3, 34.05, -118.25]]" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "ddddc3e8-f313-4125-b22e-9cc11061863d", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "prediction = predictor.predict(new_house)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "af002280-54cf-4266-acc0-135ebe3ea934", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "%%skip_if LOCAL_MODE_DISABLED\n", | |
| "\n", | |
| "prediction = local_predictor.predict(new_house)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "2cfa575c-1084-4a07-91fd-9458499bdbff", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "print(f\"πΈ Predicted Price: {prediction[0]:.2f}\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "4bef00a3-f689-4564-90e6-35fb6860e45b", | |
| "metadata": {}, | |
| "source": [ | |
| "## π§Ή Clean" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "14f859b2-e84c-4ac2-beb0-7d0f6cf9ef8d", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "predictor.delete_endpoint()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "967ecc8c-a0f3-4f5a-bbe6-1cbc0416733b", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "%%skip_if LOCAL_MODE_DISABLED\n", | |
| "\n", | |
| "local_predictor.delete_endpoint()" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 3 (ipykernel)", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.12.9" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 5 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment