Skip to content

Instantly share code, notes, and snippets.

@benmccown
Last active January 31, 2024 18:02
Show Gist options
  • Select an option

  • Save benmccown/890d804b006b151fbe7b9ce12382f592 to your computer and use it in GitHub Desktop.

Select an option

Save benmccown/890d804b006b151fbe7b9ce12382f592 to your computer and use it in GitHub Desktop.
Test Azure Hybrid Storage Connection

Testing a Hybrid Connection using Gretel Workflows

Step 1 - Upload some sample data to your Azure Storage Container

Download the CSV file from this location and then upload it to your Azure Storage Container. Below is a CLI example to accomplish this.

wget https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-synthetic-healthcare.csv
az storage blob upload --account-name "<your-storage-account>" \
    -c "<your-storage-container>" \
    -f "sample-synthetic-healthcare.csv"
rm -f sample-synthetic-healthcare.csv

Step 2 - Get your existing Gretel Project ID

You can list your available projects using the below command.

gretel projects search --query "<your-query-here>"

I receive the following output.

$ gretel projects search --query "connections"
[
    {
        "name": "Test-Hybrid-Connections",
        "project_id": "65a966d474a8a05f5447fb7c",
        "display_name": "",
        "desc": "",
        "console_url": "https://console.gretel.ai/proj_2b8ddF0STqrvMHUABaOGQ5CKZ7A"
    },
    {
        "name": "Test-Hybrid-Connections-2",
        "project_id": "65a96e5474a8a05f5447fb96",
        "display_name": "Hybrid Connections",
        "desc": "",
        "console_url": "https://console.gretel.ai/proj_2b8hWUCM6rIntF0gWrhim8XGkFY"
    }
]

The project ID we need to reference is actually the tail end of the console_url property. In this example I want to use the "Test-Hybrid-Connections-2" project so the project ID I need to make note of is proj_2b8hWUCM6rIntF0gWrhim8XGkFY.

Step 3 - Get your existing Gretel Connection ID

You can list available connections with the below command.

gretel connections list

I see the below output when running this command. This is my azure storage connection.

$ gretel connections list
{
    "id": "c_2bPmuxR3v2nP6Y0uvoQGbqDGVzQ",
    "type": "azure",
    "name": "test-azure-storage",
    "validation_status": "VALIDATION_STATUS_VALID",
    "config": {
        "account_name": "bengretelsandbox",
        "default_container": "hybrid-bucket"
    },
    "customer_managed_credentials_encryption": true,
    "created_at": "2024-01-24 19:41:54.942000+00:00",
    "project_id": "proj_2b8hWUCM6rIntF0gWrhim8XGkFY",
    "created_by": "user_2XDTKB3iNGwBabq2q6geiLJSyml",
    "connection_target_type": ""
}

Make note of the "id" for your desired connection.

Step 3 - Create an example Gretel Workflow

First define your workflow in a YAML file. I will create azure-workflow.yaml for this example. Make sure you review each property and update values as needed, eg. connection, container, and project_id.

azure-workflow.yaml

name: test-azure-workflow
actions:
  # The name for an individual action can be anything you want. I'm using azure-crawl here.
  - name: azure-crawl
    # Type has to be azure_source for an azure storage container that we want to read from
    type: azure_source
    # This is where we reference our connection ID
    connection: c_2bPmuxR3v2nP6Y0uvoQGbqDGVzQ
    config:
      # If you already set the default container in your connection creation you don't need this. I'm being explicit here and defining it anyway.
      container: hybrid-bucket
      # We want to ingest all CSV files, which should be the single demo healthcare CSV dataset we uploaded
      glob_filter: "*.csv"
  # Let's train a synthetic model on this data
  - name: model-train-run
    type: gretel_model
    # This input has to match the name you passed in for the previous action
    input: azure-crawl
    config:
      model: synthetics/default
      # Your project ID from step 2 goes here.
      project_id: proj_2b8hWUCM6rIntF0gWrhim8XGkFY
      run_params:
        params: {}
      training_data: "{azure-crawl.outputs.data}"

Then run the below command to create the workflow.

gretel workflows create --config azure-workflow.yaml --runner_mode hybrid

My example output is shown below. Make note of the ID that is returned.

$ gretel workflows create --config azure-workflow.yaml --runner_mode hybrid

INFO: Created workflow:
{
    "id": "w_2bQHYiWcVfPNx7p3XEfXFBTx4eq",
    "name": "test-azure-workflow",
    "project_id": "proj_2b8hWUCM6rIntF0gWrhim8XGkFY",
    "config": {
        "name": "test-azure-workflow",
        "actions": [
            {
                "name": "azure-crawl",
                "type": "azure_source",
                "connection": "c_2bPmuxR3v2nP6Y0uvoQGbqDGVzQ",
                "config": {
                    "container": "hybrid-bucket",
                    "glob_filter": "*.csv"
                }
            },
            {
                "name": "model-train-run",
                "type": "gretel_model",
                "input": "azure-crawl",
                "config": {
                    "model": "synthetics/default",
                    "project_id": "proj_2b8hWUCM6rIntF0gWrhim8XGkFY",
                    "run_params": {
                        "params": {}
                    },
                    "training_data": "{azure-crawl.outputs.data}"
                }
            }
        ]
    },
    "config_text": "actions:\n- config:\n    container: hybrid-bucket\n    glob_filter: '*.csv'\n  connection: c_2bPmuxR3v2nP6Y0uvoQGbqDGVzQ\n  name: azure-crawl\n  type: azure_source\n- config:\n    model: synthetics/default\n
    project_id: proj_2b8hWUCM6rIntF0gWrhim8XGkFY\n    run_params:\n      params: {}\n    training_data: '{azure-crawl.outputs.data}'\n  input: azure-crawl\n  name: model-train-run\n  type: gretel_model\nname: test-azure-workf
low\n",
    "runner_mode": "RUNNER_MODE_HYBRID",
    "created_by": "user_2XDTKB3iNGwBabq2q6geiLJSyml",
    "updated_by": "",
    "created_at": "2024-01-24 23:53:51.739054+00:00",
    "updated_at": "2024-01-24 23:53:51.739054+00:00"
}

Step 4 - Run your workflow

Set your workflow ID matching the output from the previous step. Run this command to execute the workflow. If the workflow succeeds, the Azure Storage source connector worked successfully.

gretel workflows run --workflow-id w_2bQHYiWcVfPNx7p3XEfXFBTx4eq
@mckornfield
Copy link

@benmccown We need to remove the double curlies from the example {{azure-crawl.outputs.data}} should be {azure-crawl.outputs.data}

@benmccown
Copy link
Author

@benmccown We need to remove the double curlies from the example {{azure-crawl.outputs.data}} should be {azure-crawl.outputs.data}

Fixed, thanks for catching!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment