JGalego · June 6, 2025 00:54
diff --git a/california_vibe.ipynb b/california_vibe.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dfbf3a79-1f54-450f-95eb-1094146b71a9",
   "metadata": {},
   "source": [
    "# California Housing\n",
    "\n",
    "A simple, yet mostly vibe-coded demo of [Amazon SageMaker with Scikit-Learn](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html) using the [California Housing](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html) dataset.\n",
    "\n",
    "> *\"Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.\"* —Samuel Beckett\n",
    "\n",
    "> *\"The best time to buy a house is always 5 years ago.\"* —Ray Brown\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/amansingh9097/California-housing-price-prediction/master/california-house-price-trends.PNG\" width=\"50%\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2ba7296-1b2b-41aa-b92b-510850e17611",
   "metadata": {},
   "source": [
    "## ✅ Prerequisites\n",
    "\n",
    "Make sure the SageMaker SDK is installed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2135a82b-2798-4fc9-971f-cd15b02717c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q sagemaker==2.246.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d6381cf-1257-4ddb-a8c6-e883b5d49300",
   "metadata": {},
   "source": [
    "and let's start by importing some Python libraries and defining some helper functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9420e87a-7782-491b-9b0b-c272d1295883",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import logging\n",
    "import warnings\n",
    "\n",
    "# Suppress all warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "# Sagemaker continuously complains about config, so we'll suppress that too\n",
    "logging.getLogger(\"sagemaker.config\").setLevel(logging.WARNING)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9fc1b8c8-fde4-4286-a254-cc6fad30d537",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3                         # AWS SDK for Python\n",
    "import sagemaker                     # Amazon SageMaker SDK for Python\n",
    "\n",
    "import numpy as np                   # Matrix multiplication and numerical processing\n",
    "import pandas as pd                  # Munging tabular data\n",
    "import matplotlib.pyplot as plt      # Charts and visualizations\n",
    "\n",
    "from IPython import get_ipython\n",
    "\n",
    "from IPython.core.magic import register_cell_magic\n",
    "\n",
    "from IPython.display import (        # Display tools in IPython\n",
    "    display,\n",
    "    Markdown\n",
    ")\n",
    "\n",
    "\n",
    "@register_cell_magic\n",
    "def skip(line, cell):\n",
    "    return\n",
    "\n",
    "\n",
    "@register_cell_magic\n",
    "def skip_if(line, cell):\n",
    "    if eval(line):\n",
    "        return\n",
    "    get_ipython().run_cell(cell)\n",
    "\n",
    "\n",
    "def printmd(str):\n",
    "    \"\"\"Prints a Markdown string\"\"\"\n",
    "    display(Markdown(str))\n",
    "\n",
    "\n",
    "# Debug\n",
    "printmd(f\"Numpy: `{np.__version__}`\")\n",
    "printmd(f\"Pandas: `{pd.__version__}`\")\n",
    "printmd(f\"Boto3: `{boto3.__version__}`\")\n",
    "printmd(f\"SageMaker: `{sagemaker.__version__}`\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26b10bd9-287d-4822-b93d-e94ccc2df477",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize SageMaker client\n",
    "client = boto3.client(\"sagemaker\")\n",
    "\n",
    "# Manages interactions with the Amazon SageMaker APIs\n",
    "session = sagemaker.session.Session()\n",
    "\n",
    "# The AWS Region that we're using\n",
    "region = session.boto_region_name\n",
    "\n",
    "# The IAM execution role assumed by SageMaker\n",
    "role = sagemaker.get_execution_role()\n",
    "\n",
    "# The S3 bucket to be used by this session\n",
    "bucket = session.default_bucket()\n",
    "\n",
    "# Where we'll store our data and model artifacts\n",
    "prefix = \"california-housing\"\n",
    "\n",
    "printmd(f\"Region 🌎: `{region}`\")\n",
    "printmd(f\"Bucket 🪣: `{bucket}`\")\n",
    "printmd(f\"Role 👷: `{role}`\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d9b3118-e03c-4048-ab79-8fca5e6d65c7",
   "metadata": {},
   "source": [
    "## ⚙️ Prepare"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8007897f-e29f-426f-9b53-acc502d5c561",
   "metadata": {},
   "source": [
    "Next, we'll load our data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6eeb4aee-0c39-4f91-8360-5998908d5b69",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "\n",
    "california = fetch_california_housing()\n",
    "X = pd.DataFrame(california.data, columns=california.feature_names)\n",
    "y = pd.Series(california.target, name='MedHouseVal')\n",
    "\n",
    "X.hist(figsize=(12, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5623259-9126-46e0-90c1-675187460dc4",
   "metadata": {},
   "source": [
    "Split it into datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f3cc470-f9de-44b8-a58b-d9becd29603a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e5592f5-afd3-4f05-a4a9-0543fe02c675",
   "metadata": {},
   "source": [
    "Save to CSV for SageMaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c272b34-a923-41ac-a966-376797d5ee83",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = pd.concat([X_train, y_train], axis=1)\n",
    "test_df = pd.concat([X_test, y_test], axis=1)\n",
    "\n",
    "os.makedirs('data', exist_ok=True)\n",
    "train_df.to_csv('data/train.csv', index=False)\n",
    "test_df.to_csv('data/test.csv', index=False)\n",
    "!ls -lisah data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5f06c7e-849d-4ed8-a40a-3172c9bc97df",
   "metadata": {},
   "source": [
    "and upload the final result to the S3 bucket"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92745d07-88bb-4e90-b4ac-6dfa05c09e7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_input = session.upload_data('data/train.csv', bucket=bucket, key_prefix=f'{prefix}/train')\n",
    "test_input = session.upload_data('data/test.csv', bucket=bucket, key_prefix=f'{prefix}/test')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca2d5000-f1c9-4c62-9777-a22a647241cb",
   "metadata": {},
   "source": [
    "## 🏗️ Build\n",
    "\n",
    "Now it's time to define our training workflow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa12ab5b-cb94-4df1-a1c1-ef93184809e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile train_and_deploy.py\n",
    "\n",
    "import os\n",
    "import argparse\n",
    "import joblib\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "\n",
    "def model_fn(model_dir):\n",
    "    \"\"\"Load the model from the model_dir.\"\"\"\n",
    "    model_path = os.path.join(model_dir, \"model.joblib\")\n",
    "    return joblib.load(model_path)\n",
    "\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    parser = argparse.ArgumentParser()\n",
    "    parser.add_argument('--output-data-dir', type=str, default='/opt/ml/output/data')\n",
    "    parser.add_argument('--model-dir', type=str, default='/opt/ml/model')\n",
    "    parser.add_argument('--train', type=str, default='/opt/ml/input/data/train')\n",
    "    args = parser.parse_args()\n",
    "\n",
    "    # Load data\n",
    "    df = pd.read_csv(f'{args.train}/train.csv')\n",
    "    X = df.drop('MedHouseVal', axis=1)\n",
    "    y = df['MedHouseVal']\n",
    "\n",
    "    # Train model with GridSearch\n",
    "    param_grid = {\n",
    "        'n_estimators': [100],\n",
    "        'max_depth': [None],\n",
    "        'max_features': ['sqrt']\n",
    "    }\n",
    "    model = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error')\n",
    "    model.fit(X, y)\n",
    "\n",
    "    # Save model\n",
    "    joblib.dump(model.best_estimator_, f'{args.model_dir}/model.joblib')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b88a176-b1cd-4eb4-859b-3c8e70382d8e",
   "metadata": {},
   "source": [
    "In the training script above, we've defined a special function called `model_fn` which is used to load the model. This function will be called during deployment."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "330367e8-df36-4cbb-9389-8be6f570e0a8",
   "metadata": {},
   "source": [
    "**Optional:** feel free to run the script locally to check if it's working"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8134d43e-5a16-4fef-b268-2c5516b137ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python train_and_deploy.py --train data --model-dir ."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ee267ae-d91c-485b-b06c-2c4a50399c1c",
   "metadata": {},
   "source": [
    "Let's initialize our estimator (think of it as a high-level interface for SageMaker training)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5fe7718-73dc-4062-8b6a-a615b57fc5db",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.sklearn.estimator import SKLearn\n",
    "\n",
    "sklearn_estimator = SKLearn(\n",
    "    entry_point='train_and_deploy.py',\n",
    "    role=role,\n",
    "    instance_type='ml.m5.large',\n",
    "    framework_version='1.2-1',\n",
    "    py_version='py3',\n",
    "    sagemaker_session=session,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa3f4393-c9bc-4a24-bbd3-9d33fe21b5ca",
   "metadata": {},
   "source": [
    "and kick off our training run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "106cb4ad-1d91-4348-98a9-7d30b8b2c1f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn_estimator.fit({'train': train_input})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "826d8560-2aef-4807-8b21-f5c68a65b841",
   "metadata": {},
   "source": [
    "**Important:** use the next cell if you want to test this *locally*\n",
    "\n",
    "**Requirements:** in order to use `local` mode, you'll need some AWS credentials to pull images from the [public ECR repositories managed by SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/ecr-eu-west-1.html) and [Docker Compose](https://docs.docker.com/compose/install/) 🐳)\n",
    "\n",
    "> ✨ For more information, see [Configuring Local Mode Execution in Sagemaker Studio](https://community.aws/content/2kWVSWbVdUpWHupaVIMLi7Vq12x/configuring-local-mode-execution-in-sagemaker-studio)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05190a46-685a-41ba-89f0-cc9884ca9ebf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Switch off to use run local mode\n",
    "LOCAL_MODE_DISABLED=True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb00ef1c-3b3b-49ce-bf63-db5716a290b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%skip_if LOCAL_MODE_DISABLED\n",
    "\n",
    "from getpass import getpass\n",
    "\n",
    "os.environ['AWS_DEFAULT_REGION'] = input(\"Region:\") or \"eu-west-1\"\n",
    "os.environ['AWS_ACCESS_KEY_ID'] = getpass(\"Access key:\")\n",
    "os.environ['AWS_SECRET_ACCESS_KEY'] = getpass(\"Secret access key\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b2b55f3-cd7b-480e-b387-731f03ff2bb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%skip_if LOCAL_MODE_DISABLED\n",
    "\n",
    "from sagemaker.local import LocalSession\n",
    "\n",
    "# Set up local session\n",
    "local_session = LocalSession()\n",
    "local_session.config = {'local': {'local_code': True}}\n",
    "\n",
    "# Dummy role for local mode\n",
    "DUMMY_IAM_ROLE = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'\n",
    "\n",
    "sklearn_local_estimator = SKLearn(\n",
    "    entry_point='train_and_deploy.py',\n",
    "    role=DUMMY_IAM_ROLE,\n",
    "    instance_type='local',\n",
    "    instance_count=1,\n",
    "    framework_version='1.2-1',\n",
    "    py_version='py3',\n",
    "    sagemaker_session=local_session\n",
    ")\n",
    "\n",
    "sklearn_local_estimator.fit({'train': 'file://data'})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5f13cdc-a92d-4557-8bc5-61a0aad88272",
   "metadata": {},
   "source": [
    "## 🚀 Deploy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8afe426a-51aa-4e55-953d-6f74a3810241",
   "metadata": {},
   "source": [
    "Once the training job finishes, we can start the deployment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ade4445-1442-48ed-8597-b8ceba652f93",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = sklearn_estimator.deploy(\n",
    "    instance_type='ml.m5.large',\n",
    "    initial_instance_count=1\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f08cb8e5-7224-45ec-ab98-3e49bb5e4a2d",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%skip_if LOCAL_MODE_DISABLED\n",
    "\n",
    "local_predictor = sklearn_local_estimator.deploy(\n",
    "    instance_type='local',\n",
    "    initial_instance_count=1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f21d610-9cfd-4e4e-b388-b2a065e6fe72",
   "metadata": {},
   "source": [
    "This will create an endpoint that we can call directly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c07d1c2d-b3a4-44bd-b71f-a65520c2f864",
   "metadata": {},
   "outputs": [],
   "source": [
    "new_house = [[3.2, 15, 6, 1.5, 300, 3, 34.05, -118.25]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddddc3e8-f313-4125-b22e-9cc11061863d",
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction = predictor.predict(new_house)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af002280-54cf-4266-acc0-135ebe3ea934",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%skip_if LOCAL_MODE_DISABLED\n",
    "\n",
    "prediction = local_predictor.predict(new_house)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2cfa575c-1084-4a07-91fd-9458499bdbff",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"💸 Predicted Price: {prediction[0]:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bef00a3-f689-4564-90e6-35fb6860e45b",
   "metadata": {},
   "source": [
    "## 🧹 Clean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14f859b2-e84c-4ac2-beb0-7d0f6cf9ef8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_endpoint()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "967ecc8c-a0f3-4f5a-bbe6-1cbc0416733b",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%skip_if LOCAL_MODE_DISABLED\n",
    "\n",
    "local_predictor.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "dfbf3a79-1f54-450f-95eb-1094146b71a9",
	"metadata": {},
	"source": [
	"# California Housing\n",
	"\n",
	"A simple, yet mostly vibe-coded demo of [Amazon SageMaker with Scikit-Learn](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html) using the [California Housing](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html) dataset.\n",
	"\n",
	"> \"Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.\" —Samuel Beckett\n",
	"\n",
	"> \"The best time to buy a house is always 5 years ago.\" —Ray Brown\n",
	"\n",
	"<img src=\"https://raw.githubusercontent.com/amansingh9097/California-housing-price-prediction/master/california-house-price-trends.PNG\" width=\"50%\"/>"
	]
	},
	{
	"cell_type": "markdown",
	"id": "c2ba7296-1b2b-41aa-b92b-510850e17611",
	"metadata": {},
	"source": [
	"## ✅ Prerequisites\n",
	"\n",
	"Make sure the SageMaker SDK is installed"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "2135a82b-2798-4fc9-971f-cd15b02717c5",
	"metadata": {},
	"outputs": [],
	"source": [
	"!pip install -q sagemaker==2.246.0"
	]
	},
	{
	"cell_type": "markdown",
	"id": "7d6381cf-1257-4ddb-a8c6-e883b5d49300",
	"metadata": {},
	"source": [
	"and let's start by importing some Python libraries and defining some helper functions"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "9420e87a-7782-491b-9b0b-c272d1295883",
	"metadata": {},
	"outputs": [],
	"source": [
	"import os\n",
	"import logging\n",
	"import warnings\n",
	"\n",
	"# Suppress all warnings\n",
	"warnings.filterwarnings(\"ignore\")\n",
	"\n",
	"# Sagemaker continuously complains about config, so we'll suppress that too\n",
	"logging.getLogger(\"sagemaker.config\").setLevel(logging.WARNING)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "9fc1b8c8-fde4-4286-a254-cc6fad30d537",
	"metadata": {},
	"outputs": [],
	"source": [
	"import boto3 # AWS SDK for Python\n",
	"import sagemaker # Amazon SageMaker SDK for Python\n",
	"\n",
	"import numpy as np # Matrix multiplication and numerical processing\n",
	"import pandas as pd # Munging tabular data\n",
	"import matplotlib.pyplot as plt # Charts and visualizations\n",
	"\n",
	"from IPython import get_ipython\n",
	"\n",
	"from IPython.core.magic import register_cell_magic\n",
	"\n",
	"from IPython.display import ( # Display tools in IPython\n",
	" display,\n",
	" Markdown\n",
	")\n",
	"\n",
	"\n",
	"@register_cell_magic\n",
	"def skip(line, cell):\n",
	" return\n",
	"\n",
	"\n",
	"@register_cell_magic\n",
	"def skip_if(line, cell):\n",
	" if eval(line):\n",
	" return\n",
	" get_ipython().run_cell(cell)\n",
	"\n",
	"\n",
	"def printmd(str):\n",
	" \"\"\"Prints a Markdown string\"\"\"\n",
	" display(Markdown(str))\n",
	"\n",
	"\n",
	"# Debug\n",
	"printmd(f\"Numpy: `{np.__version__}`\")\n",
	"printmd(f\"Pandas: `{pd.__version__}`\")\n",
	"printmd(f\"Boto3: `{boto3.__version__}`\")\n",
	"printmd(f\"SageMaker: `{sagemaker.__version__}`\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "26b10bd9-287d-4822-b93d-e94ccc2df477",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Initialize SageMaker client\n",
	"client = boto3.client(\"sagemaker\")\n",
	"\n",
	"# Manages interactions with the Amazon SageMaker APIs\n",
	"session = sagemaker.session.Session()\n",
	"\n",
	"# The AWS Region that we're using\n",
	"region = session.boto_region_name\n",
	"\n",
	"# The IAM execution role assumed by SageMaker\n",
	"role = sagemaker.get_execution_role()\n",
	"\n",
	"# The S3 bucket to be used by this session\n",
	"bucket = session.default_bucket()\n",
	"\n",
	"# Where we'll store our data and model artifacts\n",
	"prefix = \"california-housing\"\n",
	"\n",
	"printmd(f\"Region 🌎: `{region}`\")\n",
	"printmd(f\"Bucket 🪣: `{bucket}`\")\n",
	"printmd(f\"Role 👷: `{role}`\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "6d9b3118-e03c-4048-ab79-8fca5e6d65c7",
	"metadata": {},
	"source": [
	"## ⚙️ Prepare"
	]
	},
	{
	"cell_type": "markdown",
	"id": "8007897f-e29f-426f-9b53-acc502d5c561",
	"metadata": {},
	"source": [
	"Next, we'll load our data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "6eeb4aee-0c39-4f91-8360-5998908d5b69",
	"metadata": {},
	"outputs": [],
	"source": [
	"from sklearn.datasets import fetch_california_housing\n",
	"\n",
	"california = fetch_california_housing()\n",
	"X = pd.DataFrame(california.data, columns=california.feature_names)\n",
	"y = pd.Series(california.target, name='MedHouseVal')\n",
	"\n",
	"X.hist(figsize=(12, 10))"
	]
	},
	{
	"cell_type": "markdown",
	"id": "b5623259-9126-46e0-90c1-675187460dc4",
	"metadata": {},
	"source": [
	"Split it into datasets"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "2f3cc470-f9de-44b8-a58b-d9becd29603a",
	"metadata": {},
	"outputs": [],
	"source": [
	"from sklearn.model_selection import train_test_split\n",
	"\n",
	"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "5e5592f5-afd3-4f05-a4a9-0543fe02c675",
	"metadata": {},
	"source": [
	"Save to CSV for SageMaker"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "4c272b34-a923-41ac-a966-376797d5ee83",
	"metadata": {},
	"outputs": [],
	"source": [
	"train_df = pd.concat([X_train, y_train], axis=1)\n",
	"test_df = pd.concat([X_test, y_test], axis=1)\n",
	"\n",
	"os.makedirs('data', exist_ok=True)\n",
	"train_df.to_csv('data/train.csv', index=False)\n",
	"test_df.to_csv('data/test.csv', index=False)\n",
	"!ls -lisah data"
	]
	},
	{
	"cell_type": "markdown",
	"id": "b5f06c7e-849d-4ed8-a40a-3172c9bc97df",
	"metadata": {},
	"source": [
	"and upload the final result to the S3 bucket"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "92745d07-88bb-4e90-b4ac-6dfa05c09e7b",
	"metadata": {},
	"outputs": [],
	"source": [
	"train_input = session.upload_data('data/train.csv', bucket=bucket, key_prefix=f'{prefix}/train')\n",
	"test_input = session.upload_data('data/test.csv', bucket=bucket, key_prefix=f'{prefix}/test')"
	]
	},
	{
	"cell_type": "markdown",
	"id": "ca2d5000-f1c9-4c62-9777-a22a647241cb",
	"metadata": {},
	"source": [
	"## 🏗️ Build\n",
	"\n",
	"Now it's time to define our training workflow"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "fa12ab5b-cb94-4df1-a1c1-ef93184809e2",
	"metadata": {},
	"outputs": [],
	"source": [
	"%%writefile train_and_deploy.py\n",
	"\n",
	"import os\n",
	"import argparse\n",
	"import joblib\n",
	"\n",
	"import pandas as pd\n",
	"\n",
	"from sklearn.ensemble import RandomForestRegressor\n",
	"from sklearn.model_selection import GridSearchCV\n",
	"from sklearn.metrics import mean_squared_error\n",
	"\n",
	"\n",
	"def model_fn(model_dir):\n",
	" \"\"\"Load the model from the model_dir.\"\"\"\n",
	" model_path = os.path.join(model_dir, \"model.joblib\")\n",
	" return joblib.load(model_path)\n",
	"\n",
	"\n",
	"if __name__ == '__main__':\n",
	" parser = argparse.ArgumentParser()\n",
	" parser.add_argument('--output-data-dir', type=str, default='/opt/ml/output/data')\n",
	" parser.add_argument('--model-dir', type=str, default='/opt/ml/model')\n",
	" parser.add_argument('--train', type=str, default='/opt/ml/input/data/train')\n",
	" args = parser.parse_args()\n",
	"\n",
	" # Load data\n",
	" df = pd.read_csv(f'{args.train}/train.csv')\n",
	" X = df.drop('MedHouseVal', axis=1)\n",
	" y = df['MedHouseVal']\n",
	"\n",
	" # Train model with GridSearch\n",
	" param_grid = {\n",
	" 'n_estimators': [100],\n",
	" 'max_depth': [None],\n",
	" 'max_features': ['sqrt']\n",
	" }\n",
	" model = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error')\n",
	" model.fit(X, y)\n",
	"\n",
	" # Save model\n",
	" joblib.dump(model.best_estimator_, f'{args.model_dir}/model.joblib')"
	]
	},
	{
	"cell_type": "markdown",
	"id": "3b88a176-b1cd-4eb4-859b-3c8e70382d8e",
	"metadata": {},
	"source": [
	"In the training script above, we've defined a special function called `model_fn` which is used to load the model. This function will be called during deployment."
	]
	},
	{
	"cell_type": "markdown",
	"id": "330367e8-df36-4cbb-9389-8be6f570e0a8",
	"metadata": {},
	"source": [
	"Optional: feel free to run the script locally to check if it's working"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "8134d43e-5a16-4fef-b268-2c5516b137ff",
	"metadata": {},
	"outputs": [],
	"source": [
	"!python train_and_deploy.py --train data --model-dir ."
	]
	},
	{
	"cell_type": "markdown",
	"id": "1ee267ae-d91c-485b-b06c-2c4a50399c1c",
	"metadata": {},
	"source": [
	"Let's initialize our estimator (think of it as a high-level interface for SageMaker training)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "a5fe7718-73dc-4062-8b6a-a615b57fc5db",
	"metadata": {},
	"outputs": [],
	"source": [
	"from sagemaker.sklearn.estimator import SKLearn\n",
	"\n",
	"sklearn_estimator = SKLearn(\n",
	" entry_point='train_and_deploy.py',\n",
	" role=role,\n",
	" instance_type='ml.m5.large',\n",
	" framework_version='1.2-1',\n",
	" py_version='py3',\n",
	" sagemaker_session=session,\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "aa3f4393-c9bc-4a24-bbd3-9d33fe21b5ca",
	"metadata": {},
	"source": [
	"and kick off our training run"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "106cb4ad-1d91-4348-98a9-7d30b8b2c1f3",
	"metadata": {},
	"outputs": [],
	"source": [
	"sklearn_estimator.fit({'train': train_input})"
	]
	},
	{
	"cell_type": "markdown",
	"id": "826d8560-2aef-4807-8b21-f5c68a65b841",
	"metadata": {},
	"source": [
	"Important: use the next cell if you want to test this locally\n",
	"\n",
	"Requirements: in order to use `local` mode, you'll need some AWS credentials to pull images from the [public ECR repositories managed by SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/ecr-eu-west-1.html) and [Docker Compose](https://docs.docker.com/compose/install/) 🐳)\n",
	"\n",
	"> ✨ For more information, see [Configuring Local Mode Execution in Sagemaker Studio](https://community.aws/content/2kWVSWbVdUpWHupaVIMLi7Vq12x/configuring-local-mode-execution-in-sagemaker-studio)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "05190a46-685a-41ba-89f0-cc9884ca9ebf",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Switch off to use run local mode\n",
	"LOCAL_MODE_DISABLED=True"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "cb00ef1c-3b3b-49ce-bf63-db5716a290b1",
	"metadata": {},
	"outputs": [],
	"source": [
	"%%skip_if LOCAL_MODE_DISABLED\n",
	"\n",
	"from getpass import getpass\n",
	"\n",
	"os.environ['AWS_DEFAULT_REGION'] = input(\"Region:\") or \"eu-west-1\"\n",
	"os.environ['AWS_ACCESS_KEY_ID'] = getpass(\"Access key:\")\n",
	"os.environ['AWS_SECRET_ACCESS_KEY'] = getpass(\"Secret access key\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "5b2b55f3-cd7b-480e-b387-731f03ff2bb3",
	"metadata": {},
	"outputs": [],
	"source": [
	"%%skip_if LOCAL_MODE_DISABLED\n",
	"\n",
	"from sagemaker.local import LocalSession\n",
	"\n",
	"# Set up local session\n",
	"local_session = LocalSession()\n",
	"local_session.config = {'local': {'local_code': True}}\n",
	"\n",
	"# Dummy role for local mode\n",
	"DUMMY_IAM_ROLE = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'\n",
	"\n",
	"sklearn_local_estimator = SKLearn(\n",
	" entry_point='train_and_deploy.py',\n",
	" role=DUMMY_IAM_ROLE,\n",
	" instance_type='local',\n",
	" instance_count=1,\n",
	" framework_version='1.2-1',\n",
	" py_version='py3',\n",
	" sagemaker_session=local_session\n",
	")\n",
	"\n",
	"sklearn_local_estimator.fit({'train': 'file://data'})"
	]
	},
	{
	"cell_type": "markdown",
	"id": "d5f13cdc-a92d-4557-8bc5-61a0aad88272",
	"metadata": {},
	"source": [
	"## 🚀 Deploy"
	]
	},
	{
	"cell_type": "markdown",
	"id": "8afe426a-51aa-4e55-953d-6f74a3810241",
	"metadata": {},
	"source": [
	"Once the training job finishes, we can start the deployment"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "4ade4445-1442-48ed-8597-b8ceba652f93",
	"metadata": {},
	"outputs": [],
	"source": [
	"predictor = sklearn_estimator.deploy(\n",
	" instance_type='ml.m5.large',\n",
	" initial_instance_count=1\n",
	")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "f08cb8e5-7224-45ec-ab98-3e49bb5e4a2d",
	"metadata": {},
	"outputs": [],
	"source": [
	"%%skip_if LOCAL_MODE_DISABLED\n",
	"\n",
	"local_predictor = sklearn_local_estimator.deploy(\n",
	" instance_type='local',\n",
	" initial_instance_count=1\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "9f21d610-9cfd-4e4e-b388-b2a065e6fe72",
	"metadata": {},
	"source": [
	"This will create an endpoint that we can call directly"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "c07d1c2d-b3a4-44bd-b71f-a65520c2f864",
	"metadata": {},
	"outputs": [],
	"source": [
	"new_house = [[3.2, 15, 6, 1.5, 300, 3, 34.05, -118.25]]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "ddddc3e8-f313-4125-b22e-9cc11061863d",
	"metadata": {},
	"outputs": [],
	"source": [
	"prediction = predictor.predict(new_house)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "af002280-54cf-4266-acc0-135ebe3ea934",
	"metadata": {},
	"outputs": [],
	"source": [
	"%%skip_if LOCAL_MODE_DISABLED\n",
	"\n",
	"prediction = local_predictor.predict(new_house)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "2cfa575c-1084-4a07-91fd-9458499bdbff",
	"metadata": {},
	"outputs": [],
	"source": [
	"print(f\"💸 Predicted Price: {prediction[0]:.2f}\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "4bef00a3-f689-4564-90e6-35fb6860e45b",
	"metadata": {},
	"source": [
	"## 🧹 Clean"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "14f859b2-e84c-4ac2-beb0-7d0f6cf9ef8d",
	"metadata": {},
	"outputs": [],
	"source": [
	"predictor.delete_endpoint()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "967ecc8c-a0f3-4f5a-bbe6-1cbc0416733b",
	"metadata": {},
	"outputs": [],
	"source": [
	"%%skip_if LOCAL_MODE_DISABLED\n",
	"\n",
	"local_predictor.delete_endpoint()"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.12.9"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}
No results found