Download the CSV file from this location and then upload it to your Azure Storage Container. Below is a CLI example to accomplish this.
wget https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-synthetic-healthcare.csv
az storage blob upload --account-name "<your-storage-account>" \
-c "<your-storage-container>" \
-f "sample-synthetic-healthcare.csv"
rm -f sample-synthetic-healthcare.csv
You can list your available projects using the below command.
gretel projects search --query "<your-query-here>"I receive the following output.
$ gretel projects search --query "connections"
[
{
"name": "Test-Hybrid-Connections",
"project_id": "65a966d474a8a05f5447fb7c",
"display_name": "",
"desc": "",
"console_url": "https://console.gretel.ai/proj_2b8ddF0STqrvMHUABaOGQ5CKZ7A"
},
{
"name": "Test-Hybrid-Connections-2",
"project_id": "65a96e5474a8a05f5447fb96",
"display_name": "Hybrid Connections",
"desc": "",
"console_url": "https://console.gretel.ai/proj_2b8hWUCM6rIntF0gWrhim8XGkFY"
}
]
The project ID we need to reference is actually the tail end of the console_url property. In this example I want to use the "Test-Hybrid-Connections-2" project so the project ID I need to make note of is proj_2b8hWUCM6rIntF0gWrhim8XGkFY.
You can list available connections with the below command.
gretel connections listI see the below output when running this command. This is my azure storage connection.
$ gretel connections list
{
"id": "c_2bPmuxR3v2nP6Y0uvoQGbqDGVzQ",
"type": "azure",
"name": "test-azure-storage",
"validation_status": "VALIDATION_STATUS_VALID",
"config": {
"account_name": "bengretelsandbox",
"default_container": "hybrid-bucket"
},
"customer_managed_credentials_encryption": true,
"created_at": "2024-01-24 19:41:54.942000+00:00",
"project_id": "proj_2b8hWUCM6rIntF0gWrhim8XGkFY",
"created_by": "user_2XDTKB3iNGwBabq2q6geiLJSyml",
"connection_target_type": ""
}Make note of the "id" for your desired connection.
First define your workflow in a YAML file. I will create azure-workflow.yaml for this example. Make sure you review each property and update values as needed, eg. connection, container, and project_id.
azure-workflow.yaml
name: test-azure-workflow
actions:
# The name for an individual action can be anything you want. I'm using azure-crawl here.
- name: azure-crawl
# Type has to be azure_source for an azure storage container that we want to read from
type: azure_source
# This is where we reference our connection ID
connection: c_2bPmuxR3v2nP6Y0uvoQGbqDGVzQ
config:
# If you already set the default container in your connection creation you don't need this. I'm being explicit here and defining it anyway.
container: hybrid-bucket
# We want to ingest all CSV files, which should be the single demo healthcare CSV dataset we uploaded
glob_filter: "*.csv"
# Let's train a synthetic model on this data
- name: model-train-run
type: gretel_model
# This input has to match the name you passed in for the previous action
input: azure-crawl
config:
model: synthetics/default
# Your project ID from step 2 goes here.
project_id: proj_2b8hWUCM6rIntF0gWrhim8XGkFY
run_params:
params: {}
training_data: "{azure-crawl.outputs.data}"Then run the below command to create the workflow.
gretel workflows create --config azure-workflow.yaml --runner_mode hybridMy example output is shown below. Make note of the ID that is returned.
$ gretel workflows create --config azure-workflow.yaml --runner_mode hybrid
INFO: Created workflow:
{
"id": "w_2bQHYiWcVfPNx7p3XEfXFBTx4eq",
"name": "test-azure-workflow",
"project_id": "proj_2b8hWUCM6rIntF0gWrhim8XGkFY",
"config": {
"name": "test-azure-workflow",
"actions": [
{
"name": "azure-crawl",
"type": "azure_source",
"connection": "c_2bPmuxR3v2nP6Y0uvoQGbqDGVzQ",
"config": {
"container": "hybrid-bucket",
"glob_filter": "*.csv"
}
},
{
"name": "model-train-run",
"type": "gretel_model",
"input": "azure-crawl",
"config": {
"model": "synthetics/default",
"project_id": "proj_2b8hWUCM6rIntF0gWrhim8XGkFY",
"run_params": {
"params": {}
},
"training_data": "{azure-crawl.outputs.data}"
}
}
]
},
"config_text": "actions:\n- config:\n container: hybrid-bucket\n glob_filter: '*.csv'\n connection: c_2bPmuxR3v2nP6Y0uvoQGbqDGVzQ\n name: azure-crawl\n type: azure_source\n- config:\n model: synthetics/default\n
project_id: proj_2b8hWUCM6rIntF0gWrhim8XGkFY\n run_params:\n params: {}\n training_data: '{azure-crawl.outputs.data}'\n input: azure-crawl\n name: model-train-run\n type: gretel_model\nname: test-azure-workf
low\n",
"runner_mode": "RUNNER_MODE_HYBRID",
"created_by": "user_2XDTKB3iNGwBabq2q6geiLJSyml",
"updated_by": "",
"created_at": "2024-01-24 23:53:51.739054+00:00",
"updated_at": "2024-01-24 23:53:51.739054+00:00"
}Set your workflow ID matching the output from the previous step. Run this command to execute the workflow. If the workflow succeeds, the Azure Storage source connector worked successfully.
gretel workflows run --workflow-id w_2bQHYiWcVfPNx7p3XEfXFBTx4eq
@benmccown We need to remove the double curlies from the example
{{azure-crawl.outputs.data}}should be{azure-crawl.outputs.data}