Skip to content

Instantly share code, notes, and snippets.

@shawngraham
Created July 11, 2025 15:45
Show Gist options
  • Select an option

  • Save shawngraham/bd845937d8f9789dd282ea80b0a03e4e to your computer and use it in GitHub Desktop.

Select an option

Save shawngraham/bd845937d8f9789dd282ea80b0a03e4e to your computer and use it in GitHub Desktop.
for use with https://shawngraham.github.io/homecooked-history/hm-generator-site/enhanced.html ; talk to your archaeological contexts! Import this to google colab to run.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "378aee38-9bf1-4824-9143-0b24b4addf4c",
"metadata": {
"id": "378aee38-9bf1-4824-9143-0b24b4addf4c"
},
"source": [
"This notebook represents a prototype of a RAG system for archaeological context data with the following capabilities:\n",
"\n",
"1. **Data Processing**: Loads and prepares archaeological context data with stratigraphic relationships\n",
"2. **Embedding Generation**: Creates semantic embeddings using Simon Willison's LLM package\n",
"3. **Similarity Search**: Finds contextually similar archaeological contexts\n",
"4. **RAG Query System**: Answers natural language questions about the archaeological record\n",
"5. **Visualization**: Provides insights into the embedding space and stratigraphic relationships\n",
"6. **Persistence**: Saves embeddings and results for future use\n",
"\n",
"### Features:\n",
"- **Stratigraphic Understanding**: Incorporates complex archaeological relationships\n",
"- **Semantic Search**: Find contexts by meaning, not just keywords\n",
"- **Interactive Querying**: Natural language interface for archaeological research\n",
"- **Visualization**: 2D projections of the embedding space\n",
"- **Extensible**: Easy to add new contexts and query types\n",
"\n",
"...well, that's the theory, anyway...\n",
"\n",
"### Potential Extensions:\n",
"- Implement temporal sequence analysis\n",
"- Include geographic/spatial relationships\n",
"- Add support for multiple sites and comparative analysis\n",
"- Integrate with archaeological databases and standards\n",
"- use Ollama, faster models?\n",
"\n",
"### Usage Tips:\n",
"1. Ensure your CSV has the expected columns or modify the `prepare_context_text` function\n",
"2. Set up your OpenAI API key: `llm keys set openai` if you use those models. By default we're not.\n",
"3. Adjust the embedding model based on your needs (quality vs. cost)\n",
"4. Experiment with different query formulations for best results\n",
"5. Use the visualization tools to understand your data's structure\n",
"\n",
"\n",
"The text generation by default is using Orca, and we're in a GPU environment, so that should be reasonably fast. But groq (nb NOT GROK) is very fast and if you have api key for that, swap that in instead."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "80288799-9f24-4d4a-a621-9ee75c22fec2",
"metadata": {
"id": "80288799-9f24-4d4a-a621-9ee75c22fec2"
},
"outputs": [],
"source": [
"# Install required packages ; takes a few moments\n",
"%%capture\n",
"!pip install llm pandas numpy scikit-learn matplotlib seaborn\n",
"!llm install llm-sentence-transformers #which gives us the all-MiniLM-L6-v2 embedding model as default\n",
"!llm install llm-gpt4all #which gives us a variety of other models; we'll want orca-mini-3b-gguf2-q4_0 because it's small ... but it is slow."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "a9f42c79-1fc1-444a-9fdc-9899c71ce44a",
"metadata": {
"id": "a9f42c79-1fc1-444a-9fdc-9899c71ce44a"
},
"outputs": [],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"import json\n",
"import sqlite3\n",
"from pathlib import Path\n",
"from typing import List, Dict, Any, Optional\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.metrics.pairwise import cosine_similarity\n",
"from sklearn.decomposition import PCA\n",
"import llm\n",
"from typing import List, Dict, Any, Optional\n",
"\n",
"# Configure pandas display options\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.width', None)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "8eda3d11-7eaf-4f93-ac16-d4a25e69dae6",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8eda3d11-7eaf-4f93-ac16-d4a25e69dae6",
"outputId": "2257e2d2-b183-48f3-fc6a-ed99bc5e9118"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Created sample archaeological dataset with proper structure\n",
"\n",
"Sample of the data:\n",
" Context_ID Context_Type Description \\\n",
"0 C001 Layer Dark brown silty clay layer with frequent char... \n",
"1 C002 Cut Circular pit cut with steep sides and flat bot... \n",
"2 C003 Fill Light grey sandy fill of pit C002, containing ... \n",
"3 C004 Layer Compact yellow clay natural subsoil layer \n",
"4 C005 Layer Medieval mortar floor surface with tile fragments \n",
"\n",
" Earliest_Date_Year Earliest_Date_Era Latest_Date_Year Latest_Date_Era \\\n",
"0 100.0 AD 400.0 AD \n",
"1 50.0 AD 200.0 AD \n",
"2 50.0 AD 200.0 AD \n",
"3 NaN None NaN None \n",
"4 1200.0 AD 1400.0 AD \n",
"\n",
" Date_Type Phase_ID Phase_Name Group_ID Group_Name \\\n",
"0 Ceramic P1 Romano-British G1 Occupation deposits \n",
"1 Stratigraphic P1 Romano-British G2 Pit group \n",
"2 Stratigraphic P1 Romano-British G2 Pit group \n",
"3 Geological P0 Natural G0 Natural \n",
"4 Architectural P2 Medieval G3 Domestic structures \n",
"\n",
" Sub-Group_ID Sub-Group_Name Relationship_Type Related_Context_ID \\\n",
"0 SG1 General occupation overlies C004 \n",
"1 SG2 Pit cutting and filling cuts C004 \n",
"2 SG2 Pit cutting and filling fills C002 \n",
"3 None None cut by C002 \n",
"4 SG3 Floor surfaces overlies C004 \n",
"\n",
" Temporal_Conflict Is_Redundant \n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n"
]
}
],
"source": [
"## 1. Data Loading and Preprocessing\n",
"\n",
"## We're assuming input data created using Graham's Harris Matrix Tool\n",
"## at https://shawngraham.github.io/homecooked-history/hm-generator-site/enhanced.html\n",
"\n",
"def load_archaeological_data(csv_path: str) -> pd.DataFrame:\n",
" \"\"\"\n",
" Load archaeological context data from CSV file.\n",
"\n",
" Expected columns:\n",
" - Context_ID: Unique identifier for each archaeological context\n",
" - Context_Type: Type of context (layer, cut, fill, etc.)\n",
" - Description: Detailed description of the context\n",
" - Earliest_Date_Year: Earliest possible date (year)\n",
" - Earliest_Date_Era: Era for earliest date (BC/AD)\n",
" - Latest_Date_Year: Latest possible date (year)\n",
" - Latest_Date_Era: Era for latest date (BC/AD)\n",
" - Date_Type: Type of dating evidence\n",
" - Phase_ID: Phase identifier\n",
" - Phase_Name: Phase name/description\n",
" - Group_ID: Group identifier\n",
" - Group_Name: Group name/description\n",
" - Sub-Group_ID: Sub-group identifier\n",
" - Sub-Group_Name: Sub-group name/description\n",
" - Relationship_Type: Type of stratigraphic relationship\n",
" - Related_Context_ID: ID of related context\n",
" - Temporal_Conflict: Whether there's a temporal conflict\n",
" - Is_Redundant: Whether the context is redundant\n",
" \"\"\"\n",
" try:\n",
" df = pd.read_csv(csv_path)\n",
" print(f\"Loaded {len(df)} archaeological contexts from {csv_path}\")\n",
" print(f\"Columns: {list(df.columns)}\")\n",
"\n",
" # Clean up any whitespace in column names\n",
" df.columns = df.columns.str.strip()\n",
"\n",
" return df\n",
" except Exception as e:\n",
" print(f\"Error loading data: {e}\")\n",
" return pd.DataFrame()\n",
"\n",
"def create_sample_data() -> pd.DataFrame:\n",
" \"\"\"\n",
" Create sample archaeological data matching the actual CSV structure.\n",
" This represents typical stratigraphic recording with proper temporal and hierarchical data.\n",
" \"\"\"\n",
" sample_data = {\n",
" 'Context_ID': ['C001', 'C002', 'C003', 'C004', 'C005', 'C006', 'C007', 'C008', 'C009', 'C010'],\n",
" 'Context_Type': ['Layer', 'Cut', 'Fill', 'Layer', 'Layer', 'Feature', 'Fill', 'Layer', 'Cut', 'Fill'],\n",
" 'Description': [\n",
" 'Dark brown silty clay layer with frequent charcoal flecks and pottery sherds',\n",
" 'Circular pit cut with steep sides and flat bottom, diameter 1.2m',\n",
" 'Light grey sandy fill of pit C002, containing burnt bone and flint tools',\n",
" 'Compact yellow clay natural subsoil layer',\n",
" 'Medieval mortar floor surface with tile fragments',\n",
" 'Stone-lined hearth with evidence of burning and ash deposits',\n",
" 'Ash and charcoal fill of hearth C006, rich in pottery and animal bone',\n",
" 'Post-medieval demolition layer with brick and tile rubble',\n",
" 'Rectangular foundation trench for stone wall',\n",
" 'Stone and mortar fill of foundation trench C009'\n",
" ],\n",
" 'Earliest_Date_Year': [100, 50, 50, None, 1200, 1250, 1250, 1600, 1200, 1200],\n",
" 'Earliest_Date_Era': ['AD', 'AD', 'AD', None, 'AD', 'AD', 'AD', 'AD', 'AD', 'AD'],\n",
" 'Latest_Date_Year': [400, 200, 200, None, 1400, 1350, 1350, 1700, 1400, 1400],\n",
" 'Latest_Date_Era': ['AD', 'AD', 'AD', None, 'AD', 'AD', 'AD', 'AD', 'AD', 'AD'],\n",
" 'Date_Type': ['Ceramic', 'Stratigraphic', 'Stratigraphic', 'Geological', 'Architectural', 'Radiocarbon', 'Stratigraphic', 'Ceramic', 'Architectural', 'Stratigraphic'],\n",
" 'Phase_ID': ['P1', 'P1', 'P1', 'P0', 'P2', 'P2', 'P2', 'P3', 'P2', 'P2'],\n",
" 'Phase_Name': ['Romano-British', 'Romano-British', 'Romano-British', 'Natural', 'Medieval', 'Medieval', 'Medieval', 'Post-Medieval', 'Medieval', 'Medieval'],\n",
" 'Group_ID': ['G1', 'G2', 'G2', 'G0', 'G3', 'G4', 'G4', 'G5', 'G6', 'G6'],\n",
" 'Group_Name': ['Occupation deposits', 'Pit group', 'Pit group', 'Natural', 'Domestic structures', 'Hearth activity', 'Hearth activity', 'Demolition', 'Wall construction', 'Wall construction'],\n",
" 'Sub-Group_ID': ['SG1', 'SG2', 'SG2', None, 'SG3', 'SG4', 'SG4', 'SG5', 'SG6', 'SG6'],\n",
" 'Sub-Group_Name': ['General occupation', 'Pit cutting and filling', 'Pit cutting and filling', None, 'Floor surfaces', 'Hearth construction and use', 'Hearth construction and use', 'Site abandonment', 'Foundation construction', 'Foundation construction'],\n",
" 'Relationship_Type': ['overlies', 'cuts', 'fills', 'cut by', 'overlies', 'built into', 'fills', 'overlies', 'cuts', 'fills'],\n",
" 'Related_Context_ID': ['C004', 'C004', 'C002', 'C002', 'C004', 'C005', 'C006', 'C005', 'C005', 'C009'],\n",
" 'Temporal_Conflict': [False, False, False, False, False, False, False, False, False, False],\n",
" 'Is_Redundant': [False, False, False, False, False, False, False, False, False, False]\n",
" }\n",
"\n",
" df = pd.DataFrame(sample_data)\n",
" print(\"Created sample archaeological dataset with proper structure\")\n",
" return df\n",
"\n",
"# Load or create data\n",
"# Uncomment the next line to load your own CSV file\n",
"# df = load_archaeological_data('your_archaeological_data.csv')\n",
"\n",
"# For demonstration, we'll use sample data\n",
"df = create_sample_data()\n",
"print(\"\\nSample of the data:\")\n",
"print(df.head())"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "6ab82d66-0d6e-472b-8ae8-b85a295acdc2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6ab82d66-0d6e-472b-8ae8-b85a295acdc2",
"outputId": "f65f0147-4b86-4e1c-cacb-fa3ae719d5f5"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Sample prepared text:\n",
"Context C001, Type: Layer | Description: Dark brown silty clay layer with frequent charcoal flecks and pottery sherds | Dated 100.0 AD to 400.0 AD | Dating method: Ceramic | Phase: Romano-British | Group: Occupation deposits | Sub-Group: General occupation | Stratigraphic relationship: overlies C004\n",
"\n",
"Average text length: 273 characters\n"
]
}
],
"source": [
"## 2. Text Preparation for Embeddings\n",
"def prepare_context_text(row: pd.Series) -> str:\n",
" \"\"\"\n",
" Combine all relevant fields into a single text string for embedding.\n",
" This creates a comprehensive representation of each archaeological context.\n",
" \"\"\"\n",
" text_parts = []\n",
"\n",
" # Add context identification\n",
" text_parts.append(f\"Context {row['Context_ID']}, Type: {row['Context_Type']}\")\n",
"\n",
" # Add description\n",
" if pd.notna(row['Description']):\n",
" text_parts.append(f\"Description: {row['Description']}\")\n",
"\n",
" # Add dating information\n",
" dating_info = []\n",
" if pd.notna(row['Earliest_Date_Year']) and pd.notna(row['Latest_Date_Year']):\n",
" earliest = f\"{row['Earliest_Date_Year']} {row['Earliest_Date_Era']}\"\n",
" latest = f\"{row['Latest_Date_Year']} {row['Latest_Date_Era']}\"\n",
" dating_info.append(f\"Dated {earliest} to {latest}\")\n",
" elif pd.notna(row['Earliest_Date_Year']):\n",
" dating_info.append(f\"Earliest date: {row['Earliest_Date_Year']} {row['Earliest_Date_Era']}\")\n",
" elif pd.notna(row['Latest_Date_Year']):\n",
" dating_info.append(f\"Latest date: {row['Latest_Date_Year']} {row['Latest_Date_Era']}\")\n",
"\n",
" if pd.notna(row['Date_Type']):\n",
" dating_info.append(f\"Dating method: {row['Date_Type']}\")\n",
"\n",
" if dating_info:\n",
" text_parts.append(\" | \".join(dating_info))\n",
"\n",
" # Add hierarchical organization\n",
" hierarchy_parts = []\n",
" if pd.notna(row['Phase_Name']):\n",
" hierarchy_parts.append(f\"Phase: {row['Phase_Name']}\")\n",
" if pd.notna(row['Group_Name']):\n",
" hierarchy_parts.append(f\"Group: {row['Group_Name']}\")\n",
" if pd.notna(row['Sub-Group_Name']):\n",
" hierarchy_parts.append(f\"Sub-Group: {row['Sub-Group_Name']}\")\n",
"\n",
" if hierarchy_parts:\n",
" text_parts.append(\" | \".join(hierarchy_parts))\n",
"\n",
" # Add stratigraphic relationship\n",
" if pd.notna(row['Relationship_Type']) and pd.notna(row['Related_Context_ID']):\n",
" text_parts.append(f\"Stratigraphic relationship: {row['Relationship_Type']} {row['Related_Context_ID']}\")\n",
"\n",
" # Add conflict and redundancy flags\n",
" flags = []\n",
" if row.get('Temporal_Conflict', False):\n",
" flags.append(\"Temporal conflict present\")\n",
" if row.get('Is_Redundant', False):\n",
" flags.append(\"Marked as redundant\")\n",
"\n",
" if flags:\n",
" text_parts.append(\" | \".join(flags))\n",
"\n",
" return \" | \".join(text_parts)\n",
"\n",
"def add_prepared_text(df: pd.DataFrame) -> pd.DataFrame:\n",
" \"\"\"\n",
" Add prepared text column to dataframe for embedding generation.\n",
" \"\"\"\n",
" df_copy = df.copy()\n",
" df_copy['prepared_text'] = df_copy.apply(prepare_context_text, axis=1)\n",
"\n",
" print(\"Sample prepared text:\")\n",
" print(df_copy['prepared_text'].iloc[0])\n",
" print(f\"\\nAverage text length: {df_copy['prepared_text'].str.len().mean():.0f} characters\")\n",
"\n",
" return df_copy\n",
"\n",
"# Prepare text for embeddings\n",
"df = add_prepared_text(df)"
]
},
{
"cell_type": "code",
"source": [
"# Default when I built this was the Orca model, because it is small and can be run locally\n",
"# But if we're willing to pass our materials to someone else's servers, then\n",
"# groq (the French company, not the Elmo's awful thing) returns generated\n",
"# text very quickly. Install and set keys here:\n",
"!llm install llm-groq\n",
"!llm keys set groq"
],
"metadata": {
"id": "vJixxmExBHzr"
},
"id": "vJixxmExBHzr",
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# The following models are now available; in the set_models() you can change to your desired one by name.\n",
"!llm models\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "0jEsDbaXFQxL",
"outputId": "2dffcc8c-9e07-41ac-cf1e-f461cfe04916"
},
"id": "0jEsDbaXFQxL",
"execution_count": 43,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"OpenAI Chat: gpt-4o (aliases: 4o)\n",
"OpenAI Chat: chatgpt-4o-latest (aliases: chatgpt-4o)\n",
"OpenAI Chat: gpt-4o-mini (aliases: 4o-mini)\n",
"OpenAI Chat: gpt-4o-audio-preview\n",
"OpenAI Chat: gpt-4o-audio-preview-2024-12-17\n",
"OpenAI Chat: gpt-4o-audio-preview-2024-10-01\n",
"OpenAI Chat: gpt-4o-mini-audio-preview\n",
"OpenAI Chat: gpt-4o-mini-audio-preview-2024-12-17\n",
"OpenAI Chat: gpt-4.1 (aliases: 4.1)\n",
"OpenAI Chat: gpt-4.1-mini (aliases: 4.1-mini)\n",
"OpenAI Chat: gpt-4.1-nano (aliases: 4.1-nano)\n",
"OpenAI Chat: gpt-3.5-turbo (aliases: 3.5, chatgpt)\n",
"OpenAI Chat: gpt-3.5-turbo-16k (aliases: chatgpt-16k, 3.5-16k)\n",
"OpenAI Chat: gpt-4 (aliases: 4, gpt4)\n",
"OpenAI Chat: gpt-4-32k (aliases: 4-32k)\n",
"OpenAI Chat: gpt-4-1106-preview\n",
"OpenAI Chat: gpt-4-0125-preview\n",
"OpenAI Chat: gpt-4-turbo-2024-04-09\n",
"OpenAI Chat: gpt-4-turbo (aliases: gpt-4-turbo-preview, 4-turbo, 4t)\n",
"OpenAI Chat: gpt-4.5-preview-2025-02-27\n",
"OpenAI Chat: gpt-4.5-preview (aliases: gpt-4.5)\n",
"OpenAI Chat: o1\n",
"OpenAI Chat: o1-2024-12-17\n",
"OpenAI Chat: o1-preview\n",
"OpenAI Chat: o1-mini\n",
"OpenAI Chat: o3-mini\n",
"OpenAI Chat: o3\n",
"OpenAI Chat: o4-mini\n",
"OpenAI Completion: gpt-3.5-turbo-instruct (aliases: 3.5-instruct, chatgpt-instruct)\n",
"gpt4all: orca-mini-3b-gguf2-q4_0 - Mini Orca (Small), 1.84GB download, needs 4GB RAM (installed)\n",
"gpt4all: all-MiniLM-L6-v2-f16 - SBert, 43.76MB download, needs 1GB RAM\n",
"gpt4all: all-MiniLM-L6-v2 - SBert, 43.82MB download, needs 1GB RAM\n",
"gpt4all: nomic-embed-text-v1 - Nomic Embed Text v1, 261.58MB download, needs 1GB RAM\n",
"gpt4all: nomic-embed-text-v1 - Nomic Embed Text v1.5, 261.58MB download, needs 1GB RAM\n",
"gpt4all: Llama-3 - Llama 3.2 1B Instruct, 737.21MB download, needs 2GB RAM\n",
"gpt4all: qwen2-1_5b-instruct-q4_0 - Qwen2-1.5B-Instruct, 894.10MB download, needs 3GB RAM\n",
"gpt4all: DeepSeek-R1-Distill-Qwen-1 - DeepSeek-R1-Distill-Qwen-1.5B, 1019.29MB download, needs 3GB RAM\n",
"gpt4all: Llama-3 - Llama 3.2 3B Instruct, 1.79GB download, needs 4GB RAM\n",
"gpt4all: replit-code-v1_5-3b-newbpe-q4_0 - Replit, 1.82GB download, needs 4GB RAM\n",
"gpt4all: Phi-3-mini-4k-instruct - Phi-3 Mini Instruct, 2.03GB download, needs 4GB RAM\n",
"gpt4all: mpt-7b-chat - MPT Chat, 3.54GB download, needs 8GB RAM\n",
"gpt4all: orca-2-7b - Orca 2 (Medium), 3.56GB download, needs 8GB RAM\n",
"gpt4all: rift-coder-v0-7b-q4_0 - Rift coder, 3.56GB download, needs 8GB RAM\n",
"gpt4all: mpt-7b-chat-newbpe-q4_0 - MPT Chat, 3.64GB download, needs 8GB RAM\n",
"gpt4all: em_german_mistral_v01 - EM German Mistral, 3.83GB download, needs 8GB RAM\n",
"gpt4all: mistral-7b-instruct-v0 - Mistral Instruct, 3.83GB download, needs 8GB RAM\n",
"gpt4all: ghost-7b-v0 - Ghost 7B v0.9.1, 3.83GB download, needs 8GB RAM\n",
"gpt4all: Nous-Hermes-2-Mistral-7B-DPO - Nous Hermes 2 Mistral DPO, 3.83GB download, needs 8GB RAM\n",
"gpt4all: mistral-7b-openorca - Mistral OpenOrca, 3.83GB download, needs 8GB RAM\n",
"gpt4all: gpt4all-falcon-newbpe-q4_0 - GPT4All Falcon, 3.92GB download, needs 8GB RAM\n",
"gpt4all: qwen2 - Reasoner v1, 4.13GB download, needs 8GB RAM\n",
"gpt4all: DeepSeek-R1-Distill-Qwen-7B-Q4_0 - DeepSeek-R1-Distill-Qwen-7B, 4.14GB download, needs 8GB RAM\n",
"gpt4all: Meta-Llama-3 - Llama 3.1 8B Instruct 128k, 4.34GB download, needs 8GB RAM\n",
"gpt4all: Meta-Llama-3-8B-Instruct - Llama 3 8B Instruct, 4.34GB download, needs 8GB RAM\n",
"gpt4all: DeepSeek-R1-Distill-Llama-8B-Q4_0 - DeepSeek-R1-Distill-Llama-8B, 4.35GB download, needs 8GB RAM\n",
"gpt4all: gpt4all-13b-snoozy-q4_0 - Snoozy, 6.86GB download, needs 16GB RAM\n",
"gpt4all: wizardlm-13b-v1 - Wizard v1.2, 6.86GB download, needs 16GB RAM\n",
"gpt4all: orca-2-13b - Orca 2 (Full), 6.86GB download, needs 16GB RAM\n",
"gpt4all: nous-hermes-llama2-13b - Hermes, 6.86GB download, needs 16GB RAM\n",
"gpt4all: DeepSeek-R1-Distill-Qwen-14B-Q4_0 - DeepSeek-R1-Distill-Qwen-14B, 7.96GB download, needs 16GB RAM\n",
"gpt4all: starcoder-newbpe-q4_0 - Starcoder, 8.37GB download, needs 4GB RAM\n",
"LLMGroq: groq/whisper-large-v3\n",
"LLMGroq: groq/playai-tts\n",
"LLMGroq: groq/meta-llama/llama-4-scout-17b-16e-instruct\n",
"LLMGroq: groq/gemma2-9b-it (aliases: groq-gemma2)\n",
"LLMGroq: groq/qwen/qwen3-32b\n",
"LLMGroq: groq/llama-3.3-70b-versatile (aliases: groq-llama-3.3-70b)\n",
"LLMGroq: groq/meta-llama/llama-prompt-guard-2-22m\n",
"LLMGroq: groq/meta-llama/llama-prompt-guard-2-86m\n",
"LLMGroq: groq/qwen-qwq-32b\n",
"LLMGroq: groq/playai-tts-arabic\n",
"LLMGroq: groq/llama3-8b-8192 (aliases: groq-llama3)\n",
"LLMGroq: groq/llama3-70b-8192 (aliases: groq-llama3-70b)\n",
"LLMGroq: groq/compound-beta-mini\n",
"LLMGroq: groq/mistral-saba-24b\n",
"LLMGroq: groq/distil-whisper-large-v3-en\n",
"LLMGroq: groq/compound-beta\n",
"LLMGroq: groq/llama-3.1-8b-instant (aliases: groq-llama3.1-8b)\n",
"LLMGroq: groq/allam-2-7b\n",
"LLMGroq: groq/whisper-large-v3-turbo\n",
"LLMGroq: groq/meta-llama/llama-4-maverick-17b-128e-instruct\n",
"LLMGroq: groq/meta-llama/llama-guard-4-12b\n",
"LLMGroq: groq/deepseek-r1-distill-llama-70b\n",
"Default: gpt-4o-mini\n"
]
}
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "67da3265-9330-4a6b-aa47-c95e48fc7a13",
"metadata": {
"id": "67da3265-9330-4a6b-aa47-c95e48fc7a13"
},
"outputs": [],
"source": [
"## 3. Initialize LLM and Generate Embeddings\n",
"\n",
"\n",
"def generate_embeddings_via_cli(text: str, model_name: str) -> List[float]:\n",
" \"\"\"\n",
" Generate embeddings using the LLM command line interface.\n",
" This works around API issues with sentence-transformers models.\n",
" \"\"\"\n",
" import subprocess\n",
" import json\n",
"\n",
" try:\n",
" # Use subprocess to call llm embed\n",
" cmd = [\"llm\", \"embed\", \"-c\", text, \"-m\", model_name]\n",
" result = subprocess.run(cmd, capture_output=True, text=True, check=True)\n",
"\n",
" # Parse the output - it should be a JSON array of floats\n",
" embedding = json.loads(result.stdout.strip())\n",
" return embedding\n",
"\n",
" except subprocess.CalledProcessError as e:\n",
" print(f\"CLI embedding failed: {e.stderr}\")\n",
" return None\n",
" except json.JSONDecodeError as e:\n",
" print(f\"Failed to parse embedding output: {e}\")\n",
" print(f\"Raw output: {result.stdout}\")\n",
" return None\n",
" except Exception as e:\n",
" print(f\"Unexpected error generating embedding: {e}\")\n",
" return None\n",
"\n",
"def set_chat(text: str, model_name: str) -> List[float]:\n",
" \"\"\"\n",
" Set up cmd line use of the chat model.\n",
" \"\"\"\n",
" import subprocess\n",
" cmd = [\"llm\", \"-m\", model_name, text]\n",
" result = subprocess.run(cmd, capture_output=True, text=True, check=True)\n",
" return result\n",
"\n",
"def setup_llm_models():\n",
" \"\"\"\n",
" Set up both embedding and chat models using local LLM plugins.\n",
" Uses sentence-transformers for embeddings and orca-mini for text generation.\n",
" Both models use CLI interface for consistency.\n",
" \"\"\"\n",
" embedding_model = None\n",
" chat_model = None\n",
"\n",
" try:\n",
" # List available models using CLI\n",
" print(\"Available models:\")\n",
" import subprocess\n",
" try:\n",
" result = subprocess.run([\"llm\", \"models\"], capture_output=True, text=True, check=True)\n",
" print(result.stdout)\n",
" except subprocess.CalledProcessError:\n",
" print(\"Could not list models via CLI\")\n",
"\n",
" # For embeddings, we'll use the command line interface since the Python API\n",
" # sometimes has issues with sentence-transformers models\n",
" print(f\"\\nUsing embedding model: sentence-transformers/all-MiniLM-L6-v2 (via CLI)\")\n",
" embedding_model = \"sentence-transformers/all-MiniLM-L6-v2\" # Store as string for CLI usage\n",
"\n",
" # Initialize chat model (orca-mini) - store as string for CLI usage\n",
" # use orca if you don't have a groq key; otherwise groq is faster\n",
" #chat_model_name = \"orca-mini-3b-gguf2-q4_0\"\n",
" chat_model_name = \"groq-llama-3.3-70b\"\n",
" try:\n",
" # Test if the chat model is available by trying to use it\n",
" test_result = set_chat(\"Hello\", chat_model_name)\n",
" if test_result.returncode == 0:\n",
" chat_model = chat_model_name # Store as string, not model object\n",
" print(f\"Using chat model: {chat_model_name} (via CLI)\")\n",
" else:\n",
" raise Exception(f\"Chat model test failed: {test_result.stderr}\")\n",
" except Exception as e:\n",
" print(f\"Warning: Could not load chat model: {e}\")\n",
" print(\"Make sure you have installed the gpt4all plugin and downloaded orca-mini:\")\n",
" print(\" llm install llm-gpt4all\")\n",
" print(\" llm gpt4all download orca-mini-3b-gguf2-q4_0\")\n",
" chat_model = None # Ensure it's None if there's an error\n",
"\n",
" return embedding_model, chat_model\n",
"\n",
" except Exception as e:\n",
" print(f\"Error setting up LLM models: {e}\")\n",
" return None, None\n",
"\n",
"\n",
"def generate_embeddings(df: pd.DataFrame, embedding_model) -> pd.DataFrame:\n",
" \"\"\"\n",
" Generate embeddings for all prepared context texts using sentence-transformers via CLI.\n",
" \"\"\"\n",
" if embedding_model is None:\n",
" print(\"No embedding model available for embedding generation\")\n",
" print(\"Creating sample embeddings for demonstration...\")\n",
" # Create sample embeddings with correct dimension for all-MiniLM-L6-v2 (384)\n",
" np.random.seed(42)\n",
" sample_embeddings = [np.random.normal(0, 1, 384).tolist() for _ in range(len(df))]\n",
" df_copy = df.copy()\n",
" df_copy['embedding'] = sample_embeddings\n",
" return df_copy\n",
"\n",
" embeddings = []\n",
"\n",
" print(\"Generating embeddings using sentence-transformers via CLI...\")\n",
" for i, text in enumerate(df['prepared_text']):\n",
" try:\n",
" # Generate embedding using CLI interface\n",
" embedding = generate_embeddings_via_cli(text, embedding_model)\n",
"\n",
" if embedding is not None:\n",
" embeddings.append(embedding)\n",
" else:\n",
" # Use zero vector as fallback (384 dimensions for all-MiniLM-L6-v2)\n",
" print(f\" Failed to generate embedding for context {i+1}, using fallback\")\n",
" embeddings.append([0.0] * 384)\n",
"\n",
" if (i + 1) % 5 == 0: # Progress indicator\n",
" print(f\" Processed {i + 1}/{len(df)} contexts\")\n",
"\n",
" except Exception as e:\n",
" print(f\"Error generating embedding for context {i}: {e}\")\n",
" # Use zero vector as fallback (384 dimensions for all-MiniLM-L6-v2)\n",
" embeddings.append([0.0] * 384)\n",
"\n",
" # Add embeddings to dataframe\n",
" df_copy = df.copy()\n",
" df_copy['embedding'] = embeddings\n",
"\n",
" print(f\"\\nGenerated {len(embeddings)} embeddings\")\n",
" print(f\"Embedding dimension: {len(embeddings[0]) if embeddings else 0}\")\n",
"\n",
" return df_copy\n",
"\n"
]
},
{
"cell_type": "code",
"source": [
"def test_embedding_model(embedding_model):\n",
" \"\"\"\n",
" Test the embedding model with a sample text to verify it's working.\n",
" \"\"\"\n",
" if embedding_model is None:\n",
" print(\"No embedding model to test\")\n",
" return False\n",
"\n",
" try:\n",
" test_text = \"This is a test archaeological context for embedding generation.\"\n",
"\n",
" # Use CLI interface since embedding_model is now a string\n",
" test_embedding = generate_embeddings_via_cli(test_text, embedding_model)\n",
"\n",
" if test_embedding is not None:\n",
" print(f\"Embedding model test successful!\")\n",
" print(f\"Test embedding dimension: {len(test_embedding)}\")\n",
" print(f\"Sample values: {test_embedding[:5]}...\")\n",
" return True\n",
" else:\n",
" print(f\"Embedding model test failed: No embedding returned\")\n",
" return False\n",
"\n",
" except Exception as e:\n",
" print(f\"Embedding model test failed: {e}\")\n",
" return False\n",
"\n",
"def test_chat_model(chat_model):\n",
" \"\"\"\n",
" Test the chat model with a simple prompt to verify it's working.\n",
" Uses set_chat function for consistency.\n",
" \"\"\"\n",
" if chat_model is None:\n",
" print(\"No chat model to test\")\n",
" return False\n",
"\n",
" try:\n",
" test_prompt = \"What is archaeology?\"\n",
" print(\"Testing chat model with sample prompt...\")\n",
"\n",
" # Use set_chat function to test the model\n",
" result = set_chat(test_prompt, chat_model)\n",
"\n",
" if result.returncode == 0:\n",
" response_text = result.stdout.strip()\n",
" print(f\"Chat model test successful!\")\n",
" print(f\"Sample response: {response_text[:200]}...\")\n",
" return True\n",
" else:\n",
" print(f\"Chat model test failed: {result.stderr}\")\n",
" return False\n",
"\n",
" except Exception as e:\n",
" print(f\"Chat model test failed with unexpected error: {e}\")\n",
" return False"
],
"metadata": {
"id": "WlGVTItP5t-V"
},
"id": "WlGVTItP5t-V",
"execution_count": 51,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Set up both models\n",
"embedding_model, chat_model = setup_llm_models()\n",
"\n",
"# Test the models\n",
"print(\"\\n\" + \"=\"*50)\n",
"print(\"TESTING MODELS\")\n",
"print(\"=\"*50)\n",
"\n",
"embedding_works = test_embedding_model(embedding_model)\n",
"chat_works = test_chat_model(chat_model)\n",
"\n",
"# Generate embeddings\n",
"print(\"\\n\" + \"=\"*50)\n",
"print(\"GENERATING EMBEDDINGS\")\n",
"print(\"=\"*50)\n",
"\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6yi7e2sK5z_9",
"outputId": "44b1c396-93c2-4c17-88c4-4f8f113ec799"
},
"id": "6yi7e2sK5z_9",
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Available models:\n",
" - gpt-4o\n",
" - chatgpt-4o-latest\n",
" - gpt-4o-mini\n",
" - gpt-4o-audio-preview\n",
" - gpt-4o-audio-preview-2024-12-17\n",
" - gpt-4o-audio-preview-2024-10-01\n",
" - gpt-4o-mini-audio-preview\n",
" - gpt-4o-mini-audio-preview-2024-12-17\n",
" - gpt-4.1\n",
" - gpt-4.1-mini\n",
" - gpt-4.1-nano\n",
" - gpt-3.5-turbo\n",
" - gpt-3.5-turbo-16k\n",
" - gpt-4\n",
" - gpt-4-32k\n",
" - gpt-4-1106-preview\n",
" - gpt-4-0125-preview\n",
" - gpt-4-turbo-2024-04-09\n",
" - gpt-4-turbo\n",
" - gpt-4.5-preview-2025-02-27\n",
" - gpt-4.5-preview\n",
" - o1\n",
" - o1-2024-12-17\n",
" - o1-preview\n",
" - o1-mini\n",
" - o3-mini\n",
" - o3\n",
" - o4-mini\n",
" - gpt-3.5-turbo-instruct\n",
" - all-MiniLM-L6-v2-f16\n",
" - all-MiniLM-L6-v2\n",
" - nomic-embed-text-v1\n",
" - nomic-embed-text-v1\n",
" - Llama-3\n",
" - qwen2-1_5b-instruct-q4_0\n",
" - DeepSeek-R1-Distill-Qwen-1\n",
" - Llama-3\n",
" - replit-code-v1_5-3b-newbpe-q4_0\n",
" - orca-mini-3b-gguf2-q4_0\n",
" - Phi-3-mini-4k-instruct\n",
" - mpt-7b-chat\n",
" - orca-2-7b\n",
" - rift-coder-v0-7b-q4_0\n",
" - mpt-7b-chat-newbpe-q4_0\n",
" - em_german_mistral_v01\n",
" - mistral-7b-instruct-v0\n",
" - ghost-7b-v0\n",
" - Nous-Hermes-2-Mistral-7B-DPO\n",
" - mistral-7b-openorca\n",
" - gpt4all-falcon-newbpe-q4_0\n",
" - qwen2\n",
" - DeepSeek-R1-Distill-Qwen-7B-Q4_0\n",
" - Meta-Llama-3\n",
" - Meta-Llama-3-8B-Instruct\n",
" - DeepSeek-R1-Distill-Llama-8B-Q4_0\n",
" - gpt4all-13b-snoozy-q4_0\n",
" - wizardlm-13b-v1\n",
" - orca-2-13b\n",
" - nous-hermes-llama2-13b\n",
" - DeepSeek-R1-Distill-Qwen-14B-Q4_0\n",
" - starcoder-newbpe-q4_0\n",
"\n",
"Using embedding model: sentence-transformers/all-MiniLM-L6-v2 (via CLI)\n",
"Using chat model: orca-mini-3b-gguf2-q4_0\n",
"\n",
"==================================================\n",
"TESTING MODELS\n",
"==================================================\n",
"Embedding model test successful!\n",
"Test embedding dimension: 384\n",
"Sample values: [-0.030803386121988297, 0.06299223005771637, 0.015342089347541332, 0.0027869814075529575, -0.01052077952772379]...\n",
"Testing chat model with sample prompt...\n",
"Chat model test successful!\n",
"Sample response: Archaeology is the scientific study of human history and prehistory through excavation, analysis, and interpretation of artifacts, structures, and other physical remains. It involves the examination ...\n",
"\n",
"==================================================\n",
"GENERATING EMBEDDINGS\n",
"==================================================\n",
"Generating embeddings using sentence-transformers via CLI...\n",
" Processed 5/10 contexts\n",
" Processed 10/10 contexts\n",
"\n",
"Generated 10 embeddings\n",
"Embedding dimension: 384\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"# if the test works (everything is set up)\n",
"# then let's turn our data into embeddings.\n",
"# this might take some time.\n",
"df = generate_embeddings(df, embedding_model)"
],
"metadata": {
"id": "WQsSW7dm6flW"
},
"id": "WQsSW7dm6flW",
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": 54,
"id": "55bb2012-adb6-450c-97e3-d2f2f47df33d",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 843
},
"id": "55bb2012-adb6-450c-97e3-d2f2f47df33d",
"outputId": "74527bbb-63b0-4d70-dcae-3bff44d03816"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Initialized embedding store with 10 contexts\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1200x800 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"PCA explains 46.94% of total variance\n"
]
}
],
"source": [
"## 4. Embedding Storage and Retrieval System\n",
"\n",
"class ArchaeologicalEmbeddingStore:\n",
" \"\"\"\n",
" A class to store and retrieve archaeological context embeddings.\n",
" \"\"\"\n",
"\n",
" def __init__(self, df: pd.DataFrame):\n",
" self.df = df.copy()\n",
" self.embeddings_matrix = np.array(df['embedding'].tolist())\n",
" print(f\"Initialized embedding store with {len(df)} contexts\")\n",
"\n",
" def search_similar_contexts(self, query_embedding: np.ndarray, top_k: int = 5) -> List[Dict]:\n",
" \"\"\"\n",
" Find the most similar contexts to a query embedding.\n",
" \"\"\"\n",
" # Calculate cosine similarity\n",
" similarities = cosine_similarity([query_embedding], self.embeddings_matrix)[0]\n",
"\n",
" # Get top-k most similar indices\n",
" top_indices = np.argsort(similarities)[-top_k:][::-1]\n",
"\n",
" results = []\n",
" for idx in top_indices:\n",
" context_data = self.df.iloc[idx].to_dict()\n",
" context_data['similarity_score'] = similarities[idx]\n",
" results.append(context_data)\n",
"\n",
" return results\n",
"\n",
" def visualize_embeddings(self, sample_size: int = 100):\n",
" \"\"\"\n",
" Create a 2D visualization of the embeddings using PCA.\n",
" \"\"\"\n",
" # Sample data if too large\n",
" if len(self.df) > sample_size:\n",
" sample_idx = np.random.choice(len(self.df), sample_size, replace=False)\n",
" embeddings_sample = self.embeddings_matrix[sample_idx]\n",
" contexts_sample = self.df.iloc[sample_idx]\n",
" else:\n",
" embeddings_sample = self.embeddings_matrix\n",
" contexts_sample = self.df\n",
"\n",
" # Reduce to 2D using PCA\n",
" pca = PCA(n_components=2, random_state=42)\n",
" embeddings_2d = pca.fit_transform(embeddings_sample)\n",
"\n",
" # Create visualization\n",
" plt.figure(figsize=(12, 8))\n",
"\n",
" # Color by context type\n",
" context_types = contexts_sample['Context_Type'].unique()\n",
" colors = plt.cm.Set3(np.linspace(0, 1, len(context_types)))\n",
"\n",
" for i, context_type in enumerate(context_types):\n",
" mask = contexts_sample['Context_Type'] == context_type\n",
" plt.scatter(embeddings_2d[mask, 0], embeddings_2d[mask, 1],\n",
" c=[colors[i]], label=context_type, alpha=0.7, s=100)\n",
"\n",
" plt.title('Archaeological Context Embeddings (2D PCA Projection)')\n",
" plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')\n",
" plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')\n",
" plt.legend()\n",
" plt.grid(True, alpha=0.3)\n",
"\n",
" # Add context labels\n",
" for i, row in contexts_sample.iterrows():\n",
" plt.annotate(row['Context_ID'],\n",
" (embeddings_2d[i - contexts_sample.index[0], 0],\n",
" embeddings_2d[i - contexts_sample.index[0], 1]),\n",
" xytext=(5, 5), textcoords='offset points',\n",
" fontsize=8, alpha=0.7)\n",
"\n",
" plt.tight_layout()\n",
" plt.show()\n",
"\n",
" print(f\"PCA explains {pca.explained_variance_ratio_.sum():.2%} of total variance\")\n",
"\n",
"# Initialize the embedding store\n",
"embedding_store = ArchaeologicalEmbeddingStore(df)\n",
"\n",
"# Visualize the embeddings\n",
"embedding_store.visualize_embeddings()"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "63ca011a-9447-4082-9390-94b1d5c3608d",
"metadata": {
"id": "63ca011a-9447-4082-9390-94b1d5c3608d"
},
"outputs": [],
"source": [
"## 5. RAG Query System\n",
"\n",
"class ArchaeologicalRAGSystem:\n",
" \"\"\"\n",
" A complete RAG system for querying archaeological context data.\n",
" \"\"\"\n",
"\n",
" def __init__(self, embedding_store, embedding_model, chat_model):\n",
" self.embedding_store = embedding_store\n",
" self.embedding_model = embedding_model\n",
" self.chat_model = chat_model\n",
"\n",
"\n",
" def query(self, user_query: str, top_k: int = 5, include_context: bool = True) -> Dict[str, Any]:\n",
" \"\"\"\n",
" Process a natural language query about archaeological contexts.\n",
" Default is 5 contexts; it would be good to work out a way for the\n",
" code to determine the appropriate number of contexts to return.\n",
" \"\"\"\n",
" results = {\n",
" 'query': user_query,\n",
" 'retrieved_contexts': [],\n",
" 'answer': '',\n",
" 'sources': []\n",
" }\n",
"\n",
" try:\n",
" # Generate embedding for the query\n",
" if self.embedding_model:\n",
" if isinstance(self.embedding_model, str):\n",
" # Using CLI interface\n",
" query_embedding = generate_embeddings_via_cli(user_query, self.embedding_model)\n",
" query_embedding = np.array(query_embedding) if query_embedding else np.random.normal(0, 1, 384)\n",
" else:\n",
" # Using Python API\n",
" query_embedding = np.array(self.embedding_model.embed(user_query))\n",
" else:\n",
" # Fallback: use random embedding for demonstration\n",
" query_embedding = np.random.normal(0, 1, 384) # all-MiniLM-L6-v2 dimension\n",
"\n",
" # Retrieve similar contexts\n",
" similar_contexts = self.embedding_store.search_similar_contexts(\n",
" query_embedding, top_k=top_k\n",
" )\n",
"\n",
" results['retrieved_contexts'] = similar_contexts\n",
"\n",
" # Prepare context for generation\n",
" context_text = self._prepare_context_for_generation(similar_contexts)\n",
"\n",
" # Generate answer using RAG\n",
" if self.chat_model:\n",
" answer = self._generate_answer(user_query, context_text)\n",
" results['answer'] = answer\n",
" else:\n",
" results['answer'] = \"Chat model not available. Here are the most relevant contexts found:\"\n",
"\n",
" # Add sources\n",
" results['sources'] = [ctx['Context_ID'] for ctx in similar_contexts]\n",
"\n",
" except Exception as e:\n",
" results['answer'] = f\"Error processing query: {e}\"\n",
"\n",
" return results\n",
"\n",
" def _prepare_context_for_generation(self, contexts: List[Dict]) -> str:\n",
" \"\"\"\n",
" Prepare retrieved contexts for the generation step.\n",
" \"\"\"\n",
" context_parts = []\n",
"\n",
" for i, ctx in enumerate(contexts, 1):\n",
" # Format dating information\n",
" dating_info = \"\"\n",
" if pd.notna(ctx.get('Earliest_Date_Year')) and pd.notna(ctx.get('Latest_Date_Year')):\n",
" earliest = f\"{ctx['Earliest_Date_Year']} {ctx['Earliest_Date_Era']}\"\n",
" latest = f\"{ctx['Latest_Date_Year']} {ctx['Latest_Date_Era']}\"\n",
" dating_info = f\"Dated {earliest} to {latest}\"\n",
"\n",
" context_part = f\"\"\"\n",
"Context {i} (ID: {ctx['Context_ID']}, Similarity: {ctx['similarity_score']:.3f}):\n",
"- Type: {ctx['Context_Type']}\n",
"- Description: {ctx['Description']}\n",
"- Dating: {dating_info} (Method: {ctx.get('Date_Type', 'Unknown')})\n",
"- Phase: {ctx.get('Phase_Name', 'Unknown')}\n",
"- Group: {ctx.get('Group_Name', 'Unknown')}\n",
"- Sub-Group: {ctx.get('Sub-Group_Name', 'Unknown')}\n",
"- Stratigraphic relationship: {ctx.get('Relationship_Type', 'Unknown')} {ctx.get('Related_Context_ID', '')}\n",
"- Temporal conflict: {ctx.get('Temporal_Conflict', False)}\n",
"- Is redundant: {ctx.get('Is_Redundant', False)}\n",
"\"\"\"\n",
" context_parts.append(context_part)\n",
"\n",
" return \"\\n\".join(context_parts)\n",
"\n",
" def _generate_answer(self, query: str, context: str) -> str:\n",
" \"\"\"\n",
" Generate an answer using the retrieved context via CLI.\n",
" \"\"\"\n",
" prompt = f\"\"\"You are an expert archaeologist analyzing stratigraphic data. Based on the following archaeological context information, please answer the user's question accurately and comprehensively.\n",
"\n",
"ARCHAEOLOGICAL CONTEXTS:\n",
"{context}\n",
"\n",
"USER QUESTION: {query}\n",
"\n",
"Please provide a detailed answer based only on the archaeological evidence provided. Include references to specific context IDs when relevant, and explain any stratigraphic relationships or dating evidence that supports your answer.\n",
"\n",
"ANSWER:\"\"\"\n",
"\n",
" try:\n",
" # Use set_chat function to generate response via CLI\n",
" if isinstance(self.chat_model, str):\n",
" # chat_model is a string (model name), use CLI\n",
" result = set_chat(prompt, self.chat_model)\n",
" if result.returncode == 0:\n",
" return result.stdout.strip()\n",
" else:\n",
" return f\"Error generating answer: {result.stderr}\"\n",
" else:\n",
" # Fallback for old API (shouldn't happen with updated code)\n",
" response = self.chat_model.prompt(prompt)\n",
" return response.text()\n",
" except Exception as e:\n",
" return f\"Error generating answer: {e}\"\n",
"\n",
" def display_results(self, results: Dict[str, Any]):\n",
" \"\"\"\n",
" Display query results in a formatted way.\n",
" \"\"\"\n",
" print(\"=\" * 80)\n",
" print(f\"QUERY: {results['query']}\")\n",
" print(\"=\" * 80)\n",
"\n",
" print(f\"\\nANSWER:\")\n",
" print(results['answer'])\n",
"\n",
" print(f\"\\nSOURCES: {', '.join(results['sources'])}\")\n",
"\n",
" print(f\"\\nRETRIEVED CONTEXTS:\")\n",
" for i, ctx in enumerate(results['retrieved_contexts'], 1):\n",
" # Format dating for display\n",
" dating_display = \"\"\n",
" if pd.notna(ctx.get('Earliest_Date_Year')) and pd.notna(ctx.get('Latest_Date_Year')):\n",
" earliest = f\"{ctx['Earliest_Date_Year']} {ctx['Earliest_Date_Era']}\"\n",
" latest = f\"{ctx['Latest_Date_Year']} {ctx['Latest_Date_Era']}\"\n",
" dating_display = f\"{earliest} to {latest}\"\n",
"\n",
" print(f\"\\n{i}. Context {ctx['Context_ID']} (Similarity: {ctx['similarity_score']:.3f})\")\n",
" print(f\" Type: {ctx['Context_Type']}\")\n",
" print(f\" Description: {ctx['Description'][:100]}...\")\n",
" print(f\" Dating: {dating_display}\")\n",
" print(f\" Phase: {ctx.get('Phase_Name', 'Unknown')}\")\n",
" print(f\" Relationship: {ctx.get('Relationship_Type', '')} {ctx.get('Related_Context_ID', '')}\")\n",
"\n",
"\n",
"# Initialize the RAG system\n",
"rag_system = ArchaeologicalRAGSystem(embedding_store, embedding_model, chat_model)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "6b98a9f8-0cb1-4e8f-b456-d9611d32d449",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "6b98a9f8-0cb1-4e8f-b456-d9611d32d449",
"outputId": "1985fb7a-09d6-4f4c-b185-3a03b124fc65"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Archaeological Context RAG System\n",
"==================================================\n",
"Ask questions about the archaeological contexts!\n",
"Type 'quit' to exit, 'help' for sample queries\n",
"\n",
"\n",
"Enter your query: When was the site abandoned?\n",
"\n",
"Processing query...\n",
"================================================================================\n",
"QUERY: When was the site abandoned?\n",
"================================================================================\n",
"\n",
"ANSWER:\n",
"Based on the archaeological context information provided, the site abandonment can be inferred to have occurred during the Post-Medieval phase, specifically between 1600.0 AD and 1700.0 AD. This conclusion is supported by Context 1 (ID: C008), which is described as a \"Post-medieval demolition layer with brick and tile rubble\" and is dated to this time period using ceramic dating methods.\n",
"\n",
"The stratigraphic relationship of Context 1 (ID: C008) also provides valuable information. It is stated that C008 \"overlies C005\", which means that the demolition layer (C008) was deposited on top of the medieval mortar floor surface (C005). This relationship suggests that the medieval floor surface (C005) was already in place and was subsequently covered by the demolition debris, indicating that the site was abandoned and demolished during the Post-Medieval phase.\n",
"\n",
"Context 3 (ID: C005) provides additional context, as it describes the medieval mortar floor surface with tile fragments, dated to between 1200.0 AD and 1400.0 AD. The fact that C008 overlies C005 implies that the medieval occupation of the site (represented by C005) had ceased, and the site was subsequently abandoned and demolished, as evidenced by the demolition layer (C008).\n",
"\n",
"It is worth noting that Context 2 (ID: C010) and Context 3 (ID: C005) both date to the Medieval phase, between 1200.0 AD and 1400.0 AD. However, these contexts provide information about the construction and occupation of the site during the Medieval phase, rather than the abandonment of the site.\n",
"\n",
"In conclusion, based on the archaeological evidence, specifically the dating and stratigraphic relationship of Context 1 (ID: C008) and its relationship to Context 3 (ID: C005), the site abandonment is inferred to have occurred between 1600.0 AD and 1700.0 AD, during the Post-Medieval phase.\n",
"\n",
"SOURCES: C008, C010, C005\n",
"\n",
"RETRIEVED CONTEXTS:\n",
"\n",
"1. Context C008 (Similarity: 0.260)\n",
" Type: Layer\n",
" Description: Post-medieval demolition layer with brick and tile rubble...\n",
" Dating: 1600.0 AD to 1700.0 AD\n",
" Phase: Post-Medieval\n",
" Relationship: overlies C005\n",
"\n",
"2. Context C010 (Similarity: 0.154)\n",
" Type: Fill\n",
" Description: Stone and mortar fill of foundation trench C009...\n",
" Dating: 1200.0 AD to 1400.0 AD\n",
" Phase: Medieval\n",
" Relationship: fills C009\n",
"\n",
"3. Context C005 (Similarity: 0.154)\n",
" Type: Layer\n",
" Description: Medieval mortar floor surface with tile fragments...\n",
" Dating: 1200.0 AD to 1400.0 AD\n",
" Phase: Medieval\n",
" Relationship: overlies C004\n",
"\n",
"Enter your query: Are there logical inconsistencies in the stratigraphy?\n",
"\n",
"Processing query...\n",
"================================================================================\n",
"QUERY: Are there logical inconsistencies in the stratigraphy?\n",
"================================================================================\n",
"\n",
"ANSWER:\n",
"To assess logical inconsistencies in the stratigraphy, we need to examine the stratigraphic relationships and dating evidence provided for each context. \n",
"\n",
"1. **Stratigraphic Relationships**: \n",
" - Context 1 (C008) overlies Context 5 (C005), indicating that C008 is more recent than C005.\n",
" - Context 2 (C006) is built into Context 5 (C005), suggesting that C006 is contemporary with or slightly earlier than C005, but since it's a feature built into C005, it must predate the deposition of C005.\n",
" - Context 3 (C003) fills Context 2 (C002), implying C003 is later than C002. However, since C002 is not described, we can only consider the information provided for C003 and its relation to other contexts through dating and phase information.\n",
"\n",
"2. **Dating Evidence**:\n",
" - Context 1 (C008) is dated to 1600.0 AD to 1700.0 AD.\n",
" - Context 2 (C006) is dated to 1250.0 AD to 1350.0 AD.\n",
" - Context 3 (C003) is dated to 50.0 AD to 200.0 AD.\n",
"\n",
"Given these points, we can analyze potential inconsistencies:\n",
"- **Temporal Order**: C003 (50.0 AD to 200.0 AD) is the earliest based on its dating. C006 (1250.0 AD to 1350.0 AD) follows, and then C008 (1600.0 AD to 1700.0 AD) is the latest. This temporal order does not inherently suggest any inconsistencies since the events can be sequential.\n",
" \n",
"- **Stratigraphic Order**: The stratigraphic relationship of C006 being built into C005 and C008 overlying C005 indicates that C006 must be earlier than C008, which aligns with their dating (C006 is from the Medieval period, and C008 is from the Post-Medieval period). This relationship does not show any inconsistencies.\n",
"\n",
"However, without explicit information on Context 5 (C005) and Context 2 (C002), we must rely on the provided contexts for analysis. Given the available data:\n",
"- The dating of C003 to the Romano-British phase (50.0 AD to 200.0 AD) and its relationship to other contexts are not directly comparable due to the lack of information on C002 and C005's dating.\n",
"- The sequence from C006 (Medieval) to C008 (Post-Medieval) is consistent with expected chronological sequences.\n",
"\n",
"**Conclusion**: Based on the provided archaeological evidence, there do not appear to be logical inconsistencies in the stratigraphy that can be directly inferred from the relationships and dating of Contexts 1, 2, and 3. The temporal and stratigraphic relationships provided for these contexts are consistent with expected sequences, considering the available information. Any potential inconsistencies might arise from the missing details of Contexts 5 and 2, which are crucial for a comprehensive understanding of the site's stratigraphy.\n",
"\n",
"SOURCES: C008, C006, C003\n",
"\n",
"RETRIEVED CONTEXTS:\n",
"\n",
"1. Context C008 (Similarity: 0.376)\n",
" Type: Layer\n",
" Description: Post-medieval demolition layer with brick and tile rubble...\n",
" Dating: 1600.0 AD to 1700.0 AD\n",
" Phase: Post-Medieval\n",
" Relationship: overlies C005\n",
"\n",
"2. Context C006 (Similarity: 0.367)\n",
" Type: Feature\n",
" Description: Stone-lined hearth with evidence of burning and ash deposits...\n",
" Dating: 1250.0 AD to 1350.0 AD\n",
" Phase: Medieval\n",
" Relationship: built into C005\n",
"\n",
"3. Context C003 (Similarity: 0.362)\n",
" Type: Fill\n",
" Description: Light grey sandy fill of pit C002, containing burnt bone and flint tools...\n",
" Dating: 50.0 AD to 200.0 AD\n",
" Phase: Romano-British\n",
" Relationship: fills C002\n"
]
},
{
"output_type": "error",
"ename": "KeyboardInterrupt",
"evalue": "Interrupted by user",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/tmp/ipython-input-59-2817233373.py\u001b[0m in \u001b[0;36m<cell line: 0>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 49\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 50\u001b[0m \u001b[0;31m# Uncomment the following line to start an interactive session\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 51\u001b[0;31m \u001b[0minteractive_query_session\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/tmp/ipython-input-59-2817233373.py\u001b[0m in \u001b[0;36minteractive_query_session\u001b[0;34m()\u001b[0m\n\u001b[1;32m 26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0;32mwhile\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 28\u001b[0;31m \u001b[0muser_input\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\\nEnter your query: \"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 29\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 30\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0muser_input\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlower\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'quit'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py\u001b[0m in \u001b[0;36mraw_input\u001b[0;34m(self, prompt)\u001b[0m\n\u001b[1;32m 1175\u001b[0m \u001b[0;34m\"raw_input was called, but this frontend does not support input requests.\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1176\u001b[0m )\n\u001b[0;32m-> 1177\u001b[0;31m return self._input_request(\n\u001b[0m\u001b[1;32m 1178\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mprompt\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1179\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_parent_ident\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"shell\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py\u001b[0m in \u001b[0;36m_input_request\u001b[0;34m(self, prompt, ident, parent, password)\u001b[0m\n\u001b[1;32m 1217\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyboardInterrupt\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1218\u001b[0m \u001b[0;31m# re-raise KeyboardInterrupt, to truncate traceback\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1219\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyboardInterrupt\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Interrupted by user\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1220\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1221\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlog\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwarning\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Invalid Message:\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexc_info\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyboardInterrupt\u001b[0m: Interrupted by user"
]
}
],
"source": [
"## 6. Interactive Query Interface\n",
"## This is mostly for someone using Google Colab to power this notebook\n",
"\n",
"def interactive_query_session():\n",
" \"\"\"\n",
" Run an interactive query session for archaeological data.\n",
" \"\"\"\n",
" print(\"Archaeological Context RAG System\")\n",
" print(\"=\" * 50)\n",
" print(\"Ask questions about the archaeological contexts!\")\n",
" print(\"Type 'quit' to exit, 'help' for sample queries\")\n",
" print()\n",
"\n",
" sample_queries = [\n",
" \"What Romano-British contexts were found?\",\n",
" \"Tell me about pit contexts and their fills\",\n",
" \"What evidence is there for domestic activity?\",\n",
" \"Describe the stratigraphic sequence in Phase 2\",\n",
" \"What contexts show evidence of burning or hearth activity?\",\n",
" \"Compare the medieval contexts across different groups\",\n",
" \"Which contexts have dating conflicts?\",\n",
" \"What construction activities are represented?\",\n",
" \"Show me contexts dated using radiocarbon methods\",\n",
" \"What are the latest contexts in the sequence?\"\n",
" ]\n",
"\n",
" while True:\n",
" user_input = input(\"\\nEnter your query: \").strip()\n",
"\n",
" if user_input.lower() == 'quit':\n",
" print(\"Goodbye!\")\n",
" break\n",
"\n",
" elif user_input.lower() == 'help':\n",
" print(\"\\nSample queries you can try:\")\n",
" for i, query in enumerate(sample_queries, 1):\n",
" print(f\"{i}. {query}\")\n",
" continue\n",
"\n",
" elif not user_input:\n",
" continue\n",
"\n",
" # Process the query\n",
" print(\"\\nProcessing query...\")\n",
" results = rag_system.query(user_input)\n",
" rag_system.display_results(results)\n",
"\n",
"\n",
"\n",
"# Uncomment the following line to start an interactive session\n",
"interactive_query_session()"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "c5d77b52-551a-4f0d-9b2a-9e1ac18df5d7",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "c5d77b52-551a-4f0d-9b2a-9e1ac18df5d7",
"outputId": "1a112387-0047-41f3-d89a-0123aed8f639"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"DEMONSTRATION QUERIES\n",
"==================================================\n",
"\n",
"Running query: 'What contexts were found dating to the occupation by Rome and what do they tell us?'\n",
"================================================================================\n",
"QUERY: What contexts were found dating to the occupation by Rome and what do they tell us?\n",
"================================================================================\n",
"\n",
"ANSWER:\n",
"Based on the provided archaeological context information, the occupation by Rome, also known as the Romano-British period, is represented by Context 2 (ID: C001). This context is a layer of dark brown silty clay with frequent charcoal flecks and pottery sherds, which has been dated to 100.0 AD to 400.0 AD through ceramic analysis. This dating method suggests that the layer contains ceramic materials that are characteristic of the Romano-British period, allowing for a relatively precise chronology.\n",
"\n",
"The stratigraphic relationship of Context 2 (C001) indicates that it overlies Context 4 (C004), although the details of Context 4 are not provided in the given information. This relationship implies that the occupation deposit represented by Context 2 (C001) is situated above an earlier layer, which could potentially be related to an even earlier phase of occupation or a natural deposit.\n",
"\n",
"It is worth noting that there are no other contexts provided that date specifically to the Romano-British period, suggesting that Context 2 (C001) is currently the primary archaeological evidence for this time frame at the site. The absence of temporal conflict (indicated by \"Temporal conflict: False\" in the context information) and the context not being redundant (\"Is redundant: False\") support the significance and reliability of Context 2 (C001) as a source of information for the Romano-British occupation.\n",
"\n",
"In contrast, Contexts 1 (C009) and 3 (C005) are dated to the Medieval period, spanning from 1200.0 AD to 1400.0 AD, based on architectural methods. These contexts, therefore, relate to a much later period of occupation and construction at the site, with Context 1 (C009) representing a cut for a stone wall foundation and Context 3 (C005) describing a medieval mortar floor surface with tile fragments. The stratigraphic relationship between Context 1 (C009) and Context 3 (C005) shows that the foundation trench (C009) cuts into the layer containing the medieval mortar floor surface (C005), indicating a sequence of construction activities during the Medieval period.\n",
"\n",
"In summary, Context 2 (C001) provides the archaeological evidence for the occupation by Rome, offering insights into the material culture and deposition processes during the Romano-British period through its ceramic content and stratigraphic position. Further excavation and analysis, particularly of Context 4 (C004) and other potential underlying layers, could provide additional details about the site's occupation history during and preceding the Romano-British period.\n",
"\n",
"SOURCES: C009, C001, C005\n",
"\n",
"RETRIEVED CONTEXTS:\n",
"\n",
"1. Context C009 (Similarity: 0.337)\n",
" Type: Cut\n",
" Description: Rectangular foundation trench for stone wall...\n",
" Dating: 1200.0 AD to 1400.0 AD\n",
" Phase: Medieval\n",
" Relationship: cuts C005\n",
"\n",
"2. Context C001 (Similarity: 0.327)\n",
" Type: Layer\n",
" Description: Dark brown silty clay layer with frequent charcoal flecks and pottery sherds...\n",
" Dating: 100.0 AD to 400.0 AD\n",
" Phase: Romano-British\n",
" Relationship: overlies C004\n",
"\n",
"3. Context C005 (Similarity: 0.323)\n",
" Type: Layer\n",
" Description: Medieval mortar floor surface with tile fragments...\n",
" Dating: 1200.0 AD to 1400.0 AD\n",
" Phase: Medieval\n",
" Relationship: overlies C004\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\n",
"Running query: 'Describe any hearths and there associated deposits'\n",
"================================================================================\n",
"QUERY: Describe any hearths and there associated deposits\n",
"================================================================================\n",
"\n",
"ANSWER:\n",
"Based on the provided archaeological context information, there is evidence of a hearth and its associated deposits. The hearth is described in Context 1 (ID: C006), which is a stone-lined hearth featuring evidence of burning and ash deposits. This hearth is dated to the Medieval period, specifically between 1250.0 AD and 1350.0 AD, as determined by radiocarbon dating.\n",
"\n",
"The associated deposit with this hearth is described in Context 2 (ID: C007), which is an ash and charcoal fill of hearth C006. This fill is rich in pottery and animal bone, indicating a high level of activity and occupation around the hearth. The dating of Context 2, between 1250.0 AD and 1350.0 AD, is based on stratigraphic relationships, which supports the contemporaneity of the hearth and its fill. The stratigraphic relationship between Context 1 (C006) and Context 2 (C007) is clearly defined, with C007 filling C006, which suggests a direct association between the hearth and its deposit.\n",
"\n",
"It's worth noting that the hearth (C006) is built into another context (C005), although the details of C005 are not provided. This stratigraphic relationship indicates that the hearth was constructed at a specific point in time, potentially reusing or modifying an existing feature.\n",
"\n",
"In contrast, Context 3 (ID: C001) describes a layer from the Romano-British period, dated between 100.0 AD and 400.0 AD, based on ceramic evidence. While this context mentions charcoal flecks and pottery sherds, it does not describe a hearth or a direct association with one. Therefore, it is not directly relevant to the description of hearths and their associated deposits in the Medieval period.\n",
"\n",
"In summary, the archaeological evidence supports the existence of a stone-lined hearth (Context 1, C006) and its associated ash and charcoal fill (Context 2, C007) during the Medieval period, with both dated to between 1250.0 AD and 1350.0 AD. The stratigraphic relationship between these contexts confirms the direct association between the hearth and its deposit.\n",
"\n",
"SOURCES: C006, C007, C001\n",
"\n",
"RETRIEVED CONTEXTS:\n",
"\n",
"1. Context C006 (Similarity: 0.685)\n",
" Type: Feature\n",
" Description: Stone-lined hearth with evidence of burning and ash deposits...\n",
" Dating: 1250.0 AD to 1350.0 AD\n",
" Phase: Medieval\n",
" Relationship: built into C005\n",
"\n",
"2. Context C007 (Similarity: 0.658)\n",
" Type: Fill\n",
" Description: Ash and charcoal fill of hearth C006, rich in pottery and animal bone...\n",
" Dating: 1250.0 AD to 1350.0 AD\n",
" Phase: Medieval\n",
" Relationship: fills C006\n",
"\n",
"3. Context C001 (Similarity: 0.437)\n",
" Type: Layer\n",
" Description: Dark brown silty clay layer with frequent charcoal flecks and pottery sherds...\n",
" Dating: 100.0 AD to 400.0 AD\n",
" Phase: Romano-British\n",
" Relationship: overlies C004\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\n",
"Running query: 'What is the stratigraphic relationship between pit contexts and surrounding layers?'\n",
"================================================================================\n",
"QUERY: What is the stratigraphic relationship between pit contexts and surrounding layers?\n",
"================================================================================\n",
"\n",
"ANSWER:\n",
"The stratigraphic relationship between pit contexts and surrounding layers can be analyzed based on the provided archaeological context information. \n",
"\n",
"From the given data, we have a pit context, specifically Context 1 (ID: C002), which is a circular pit cut with steep sides and flat bottom. This pit is cut into another context, C004, as indicated by the stratigraphic relationship \"cuts C004\". However, since the details of Context 4 are not provided, we cannot determine the nature of the context that the pit cuts into.\n",
"\n",
"On the other hand, we know that the pit (C002) is filled by Context 2 (ID: C003), a light grey sandy fill containing burnt bone and flint tools, as shown by the stratigraphic relationship \"fills C002\". Both the pit (C002) and its fill (C003) are dated to the Romano-British period, specifically between 50.0 AD and 200.0 AD, based on stratigraphic dating methods.\n",
"\n",
"Regarding the surrounding layers, we have information on Context 3 (ID: C008), which is a post-medieval demolition layer consisting of brick and tile rubble, dated to between 1600.0 AD and 1700.0 AD through ceramic dating. This demolition layer overlies another context, C005, as indicated by the stratigraphic relationship \"overlies C005\". However, since the details of Context 5 are not provided, we cannot determine the nature of the context that the demolition layer overlies.\n",
"\n",
"In conclusion, the stratigraphic relationship between the pit context (C002) and its surrounding fill (C003) is well-defined, with C003 filling C002. However, the relationships between the pit and other surrounding layers, such as the context it cuts into (C004) and the post-medieval demolition layer (C008), are not fully clear due to the lack of information on contexts C004 and C005. Based on the provided data, we can only confirm that the pit and its fill are part of the Romano-British phase, while the demolition layer is part of the post-medieval phase, indicating a significant time gap between these two sets of contexts.\n",
"\n",
"SOURCES: C002, C003, C008\n",
"\n",
"RETRIEVED CONTEXTS:\n",
"\n",
"1. Context C002 (Similarity: 0.491)\n",
" Type: Cut\n",
" Description: Circular pit cut with steep sides and flat bottom, diameter 1.2m...\n",
" Dating: 50.0 AD to 200.0 AD\n",
" Phase: Romano-British\n",
" Relationship: cuts C004\n",
"\n",
"2. Context C003 (Similarity: 0.455)\n",
" Type: Fill\n",
" Description: Light grey sandy fill of pit C002, containing burnt bone and flint tools...\n",
" Dating: 50.0 AD to 200.0 AD\n",
" Phase: Romano-British\n",
" Relationship: fills C002\n",
"\n",
"3. Context C008 (Similarity: 0.426)\n",
" Type: Layer\n",
" Description: Post-medieval demolition layer with brick and tile rubble...\n",
" Dating: 1600.0 AD to 1700.0 AD\n",
" Phase: Post-Medieval\n",
" Relationship: overlies C005\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\n",
"Running query: 'What evidence exists for domestic activities across different phases?'\n",
"================================================================================\n",
"QUERY: What evidence exists for domestic activities across different phases?\n",
"================================================================================\n",
"\n",
"ANSWER:\n",
"The archaeological evidence provided suggests that domestic activities were present across different phases, particularly during the Medieval and Post-Medieval periods. The primary evidence for domestic activities comes from Context 1 (ID: C006) and Context 2 (ID: C007), which are related to hearth activity.\n",
"\n",
"Context 1 (ID: C006) is a stone-lined hearth with evidence of burning and ash deposits, dated to 1250.0 AD to 1350.0 AD through radiocarbon dating. This hearth is associated with the Medieval phase and is categorized under the sub-group \"Hearth construction and use\" (Group: Hearth activity). The presence of a hearth, a common feature of domestic spaces, suggests that domestic activities such as cooking and warmth provision were taking place during this period.\n",
"\n",
"Further evidence for domestic activities during the Medieval phase is provided by Context 2 (ID: C007), which is the ash and charcoal fill of hearth C006. This fill is rich in pottery and animal bone, indicating food preparation and consumption activities. The dating of Context 2, which is based on stratigraphic relationships, also falls within the 1250.0 AD to 1350.0 AD range, supporting the idea that domestic activities were ongoing during this time.\n",
"\n",
"The stratigraphic relationship between Context 1 (ID: C006) and Context 2 (ID: C007) is significant, as Context 2 fills Context 1. This relationship suggests that the hearth was in use for a period, accumulating ash and charcoal, before being filled. The presence of pottery and animal bone in the fill provides additional evidence for domestic activities such as cooking and eating.\n",
"\n",
"While the evidence from Context 1 and Context 2 primarily pertains to the Medieval phase, the site's occupation and activities extend into the Post-Medieval period, as indicated by Context 3 (ID: C008). Context 3 is a post-medieval demolition layer with brick and tile rubble, dated to 1600.0 AD to 1700.0 AD through ceramic dating. Although this context does not directly provide evidence for domestic activities, its stratigraphic relationship to earlier contexts (it overlies Context 5, which is associated with the hearth activity of Context 1) suggests that the site underwent changes and possibly abandonment, which could be related to shifts in domestic activities or the site's use over time.\n",
"\n",
"In summary, the evidence from Context 1 (ID: C006) and Context 2 (ID: C007) demonstrates that domestic activities, specifically those related to hearth use and food preparation, were present during the Medieval phase. The stratigraphic relationships and dating evidence support the interpretation that these activities were part of the site's use during this period. The transition into the Post-Medieval phase, marked by Context 3 (ID: C008), indicates changes at the site, which may reflect alterations in domestic activities or the site's purpose, but direct evidence for domestic activities during this later phase is not provided within the given archaeological contexts.\n",
"\n",
"SOURCES: C006, C007, C008\n",
"\n",
"RETRIEVED CONTEXTS:\n",
"\n",
"1. Context C006 (Similarity: 0.246)\n",
" Type: Feature\n",
" Description: Stone-lined hearth with evidence of burning and ash deposits...\n",
" Dating: 1250.0 AD to 1350.0 AD\n",
" Phase: Medieval\n",
" Relationship: built into C005\n",
"\n",
"2. Context C007 (Similarity: 0.243)\n",
" Type: Fill\n",
" Description: Ash and charcoal fill of hearth C006, rich in pottery and animal bone...\n",
" Dating: 1250.0 AD to 1350.0 AD\n",
" Phase: Medieval\n",
" Relationship: fills C006\n",
"\n",
"3. Context C008 (Similarity: 0.218)\n",
" Type: Layer\n",
" Description: Post-medieval demolition layer with brick and tile rubble...\n",
" Dating: 1600.0 AD to 1700.0 AD\n",
" Phase: Post-Medieval\n",
" Relationship: overlies C005\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\n",
"Running query: 'Which contexts have temporal conflicts and why?'\n",
"================================================================================\n",
"QUERY: Which contexts have temporal conflicts and why?\n",
"================================================================================\n",
"\n",
"ANSWER:\n",
"Based on the provided archaeological context information, none of the contexts have temporal conflicts. According to the data, the \"Temporal conflict\" field is set to \"False\" for all contexts: C008, C009, and C005.\n",
"\n",
"This indicates that the dating evidence and stratigraphic relationships within each context, as well as between contexts, do not suggest any inconsistencies or conflicts in terms of their chronological relationships.\n",
"\n",
"For instance, Context 1 (C008), dated to 1600-1700 AD, overlies Context 3 (C005), which is dated to 1200-1400 AD. This stratigraphic relationship is consistent with the expected chronological order, as the post-medieval demolition layer (C008) would indeed overlie a medieval mortar floor surface (C005).\n",
"\n",
"Similarly, Context 2 (C009), dated to 1200-1400 AD, cuts Context 3 (C005), which is also dated to the same period. This relationship is also consistent, as the construction of a foundation trench (C009) would have occurred during the same medieval phase as the creation of the mortar floor surface (C005).\n",
"\n",
"Therefore, based on the provided archaeological evidence, there are no temporal conflicts between the contexts, and their stratigraphic relationships and dating evidence support a consistent chronological narrative.\n",
"\n",
"SOURCES: C008, C009, C005\n",
"\n",
"RETRIEVED CONTEXTS:\n",
"\n",
"1. Context C008 (Similarity: 0.215)\n",
" Type: Layer\n",
" Description: Post-medieval demolition layer with brick and tile rubble...\n",
" Dating: 1600.0 AD to 1700.0 AD\n",
" Phase: Post-Medieval\n",
" Relationship: overlies C005\n",
"\n",
"2. Context C009 (Similarity: 0.195)\n",
" Type: Cut\n",
" Description: Rectangular foundation trench for stone wall...\n",
" Dating: 1200.0 AD to 1400.0 AD\n",
" Phase: Medieval\n",
" Relationship: cuts C005\n",
"\n",
"3. Context C005 (Similarity: 0.191)\n",
" Type: Layer\n",
" Description: Medieval mortar floor surface with tile fragments...\n",
" Dating: 1200.0 AD to 1400.0 AD\n",
" Phase: Medieval\n",
" Relationship: overlies C004\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\n",
"Running query: 'Compare the dating methods used across different context types'\n",
"================================================================================\n",
"QUERY: Compare the dating methods used across different context types\n",
"================================================================================\n",
"\n",
"ANSWER:\n",
"Based on the archaeological context information provided, we can compare the dating methods used across different context types. \n",
"\n",
"In Context 1 (ID: C006), a feature (stone-lined hearth), the dating method used is Radiocarbon, which provides a date range of 1250.0 AD to 1350.0 AD. This suggests that the hearth was constructed and used during the Medieval phase.\n",
"\n",
"Context 2 (ID: C007), a fill, is dated using the Stratigraphic method, which relies on the layering of deposits and their relationships to other dated contexts. The fill in C007 is dated to the same time period as Context 1 (1250.0 AD to 1350.0 AD), indicating that the fill accumulated during the use of the hearth. The stratigraphic relationship between C006 and C007, where C007 fills C006, supports this dating.\n",
"\n",
"In contrast, Context 3 (ID: C008), a layer, is dated using the Ceramic method, which is based on the analysis of ceramics found within the layer. This context is dated to a later time period (1600.0 AD to 1700.0 AD), corresponding to the Post-Medieval phase. The stratigraphic relationship between C008 and C005 (C008 overlies C005) suggests that C008 is a later deposit, and its dating supports this interpretation.\n",
"\n",
"It is worth noting that Context 1 (C006) and Context 2 (C007) have the same date range but use different dating methods (Radiocarbon and Stratigraphic, respectively). This highlights the importance of using multiple lines of evidence and dating methods to build a robust chronology. The fact that the dates obtained from different methods agree for these two contexts increases our confidence in the dating of the hearth and its associated fill.\n",
"\n",
"The variation in dating methods across context types (Radiocarbon for a feature, Stratigraphic for a fill, and Ceramic for a layer) reflects the different types of deposits and the materials they contain. Radiocarbon dating is often used for organic-rich deposits like hearths, while Stratigraphic dating relies on the relationships between layers. Ceramic dating, on the other hand, is commonly used for layers containing ceramics, as in the case of Context 3.\n",
"\n",
"Overall, the combination of dating methods used across different context types provides a more comprehensive understanding of the site's chronology, highlighting the complexity and nuance of the archaeological record. By considering the stratigraphic relationships and the specific dating methods used for each context, we can build a more detailed picture of the site's occupation and use over time.\n",
"\n",
"SOURCES: C006, C007, C008\n",
"\n",
"RETRIEVED CONTEXTS:\n",
"\n",
"1. Context C006 (Similarity: 0.329)\n",
" Type: Feature\n",
" Description: Stone-lined hearth with evidence of burning and ash deposits...\n",
" Dating: 1250.0 AD to 1350.0 AD\n",
" Phase: Medieval\n",
" Relationship: built into C005\n",
"\n",
"2. Context C007 (Similarity: 0.317)\n",
" Type: Fill\n",
" Description: Ash and charcoal fill of hearth C006, rich in pottery and animal bone...\n",
" Dating: 1250.0 AD to 1350.0 AD\n",
" Phase: Medieval\n",
" Relationship: fills C006\n",
"\n",
"3. Context C008 (Similarity: 0.317)\n",
" Type: Layer\n",
" Description: Post-medieval demolition layer with brick and tile rubble...\n",
" Dating: 1600.0 AD to 1700.0 AD\n",
" Phase: Post-Medieval\n",
" Relationship: overlies C005\n",
"\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"# Demonstration queries\n",
"# with slightly convoluted language to avoid simple keyword searches\n",
"def run_demo_queries():\n",
" \"\"\"\n",
" Run some demonstration queries to show the system in action.\n",
" \"\"\"\n",
" demo_queries = [\n",
" \"What contexts were found dating to the occupation by Rome and what do they tell us?\",\n",
" \"Describe any hearths and there associated deposits\",\n",
" \"What is the stratigraphic relationship between pit contexts and surrounding layers?\",\n",
" \"What evidence exists for domestic activities across different phases?\",\n",
" \"Which contexts have temporal conflicts and why?\",\n",
" \"Compare the dating methods used across different context types\"\n",
" ]\n",
"\n",
" print(\"DEMONSTRATION QUERIES\")\n",
" print(\"=\" * 50)\n",
"\n",
" for query in demo_queries:\n",
" print(f\"\\nRunning query: '{query}'\")\n",
" results = rag_system.query(query)\n",
" rag_system.display_results(results)\n",
" print(\"\\n\" + \"-\" * 80)\n",
"\n",
"# Run demonstration\n",
"run_demo_queries()"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "a548d5b3-4f86-440b-8d9f-f1efa882f76b",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "a548d5b3-4f86-440b-8d9f-f1efa882f76b",
"outputId": "9bb1cb1c-fd01-4ab0-8585-5cd680d4b559"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"STRATIGRAPHIC RELATIONSHIP ANALYSIS\n",
"==================================================\n",
"Relationship type frequencies:\n",
" overlies: 3\n",
" fills: 3\n",
" cuts: 2\n",
" cut by: 1\n",
" built into: 1\n",
"\n",
"Data quality metrics:\n",
" Temporal conflicts: 0\n",
" Redundant contexts: 0\n",
"\n",
"Dating methods:\n",
" Stratigraphic: 4\n",
" Ceramic: 2\n",
" Architectural: 2\n",
" Geological: 1\n",
" Radiocarbon: 1\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1500x1000 with 6 Axes>"
],
"image/png": "\n"
},
"metadata": {}
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"\n",
"Phase analysis:\n",
" Context_Count Dominant_Type Earliest_Year Latest_Year\n",
"Phase_Name \n",
"Medieval 5 Fill 1200.0 1400.0\n",
"Natural 1 Layer NaN NaN\n",
"Post-Medieval 1 Layer 1600.0 1700.0\n",
"Romano-British 3 Layer 50.0 400.0\n",
"CONTEXT NETWORK ANALYSIS\n",
"==================================================\n",
"Found 8 connected components:\n",
" Component 1: C001, C004, C002\n",
" PHASING: P1, P0\n",
"\n",
" Component 2: C003\n",
" PHASING: P1\n",
"\n",
" Component 3: C005\n",
" PHASING: P2\n",
"\n",
" Component 4: C006\n",
" PHASING: P2\n",
"\n",
" Component 5: C007\n",
" PHASING: P2\n",
"\n",
" Component 6: C008\n",
" PHASING: P3\n",
"\n",
" Component 7: C009\n",
" PHASING: P2\n",
"\n",
" Component 8: C010\n",
" PHASING: P2\n",
"\n"
]
}
],
"source": [
"# 7. Further Analysis\n",
"\n",
"def analyze_stratigraphic_relationships(df: pd.DataFrame):\n",
" \"\"\"\n",
" Analyze and visualize stratigraphic relationships in the dataset.\n",
" \"\"\"\n",
" print(\"STRATIGRAPHIC RELATIONSHIP ANALYSIS\")\n",
" print(\"=\" * 50)\n",
"\n",
" # Analyze relationship types\n",
" relationship_counts = df['Relationship_Type'].value_counts()\n",
" print(\"Relationship type frequencies:\")\n",
" for rel_type, count in relationship_counts.items():\n",
" print(f\" {rel_type}: {count}\")\n",
"\n",
" # Analyze temporal conflicts\n",
" temporal_conflicts = df['Temporal_Conflict'].sum() if 'Temporal_Conflict' in df.columns else 0\n",
" redundant_contexts = df['Is_Redundant'].sum() if 'Is_Redundant' in df.columns else 0\n",
"\n",
" print(f\"\\nData quality metrics:\")\n",
" print(f\" Temporal conflicts: {temporal_conflicts}\")\n",
" print(f\" Redundant contexts: {redundant_contexts}\")\n",
"\n",
" # Analyze dating methods\n",
" if 'Date_Type' in df.columns:\n",
" dating_methods = df['Date_Type'].value_counts()\n",
" print(f\"\\nDating methods:\")\n",
" for method, count in dating_methods.items():\n",
" print(f\" {method}: {count}\")\n",
"\n",
" # Visualize the data\n",
" plt.figure(figsize=(15, 10))\n",
"\n",
" # Context types\n",
" plt.subplot(2, 3, 1)\n",
" context_type_counts = df['Context_Type'].value_counts()\n",
" plt.pie(context_type_counts.values, labels=context_type_counts.index, autopct='%1.1f%%')\n",
" plt.title('Distribution of Context Types')\n",
"\n",
" # Relationship types\n",
" plt.subplot(2, 3, 2)\n",
" relationship_counts.plot(kind='bar')\n",
" plt.title('Stratigraphic Relationship Types')\n",
" plt.xticks(rotation=45)\n",
"\n",
" # Phases\n",
" plt.subplot(2, 3, 3)\n",
" if 'Phase_Name' in df.columns:\n",
" phase_counts = df['Phase_Name'].value_counts()\n",
" plt.bar(phase_counts.index, phase_counts.values)\n",
" plt.title('Contexts by Phase')\n",
" plt.xticks(rotation=45)\n",
"\n",
" # Dating methods\n",
" plt.subplot(2, 3, 4)\n",
" if 'Date_Type' in df.columns:\n",
" dating_methods.plot(kind='bar')\n",
" plt.title('Dating Methods Used')\n",
" plt.xticks(rotation=45)\n",
"\n",
" # Temporal range visualization\n",
" plt.subplot(2, 3, 5)\n",
" if 'Earliest_Date_Year' in df.columns and 'Latest_Date_Year' in df.columns:\n",
" # Filter out non-numeric dates\n",
" dated_contexts = df.dropna(subset=['Earliest_Date_Year', 'Latest_Date_Year'])\n",
" if not dated_contexts.empty:\n",
" for _, ctx in dated_contexts.iterrows():\n",
" plt.plot([ctx['Earliest_Date_Year'], ctx['Latest_Date_Year']],\n",
" [ctx['Context_ID'], ctx['Context_ID']], 'b-', alpha=0.6)\n",
" plt.scatter([ctx['Earliest_Date_Year'], ctx['Latest_Date_Year']],\n",
" [ctx['Context_ID'], ctx['Context_ID']], c='red', s=20)\n",
" plt.xlabel('Year')\n",
" plt.ylabel('Context ID')\n",
" plt.title('Temporal Ranges of Contexts')\n",
"\n",
" # Groups\n",
" plt.subplot(2, 3, 6)\n",
" if 'Group_Name' in df.columns:\n",
" group_counts = df['Group_Name'].value_counts()\n",
" plt.bar(range(len(group_counts)), group_counts.values)\n",
" plt.title('Contexts by Group')\n",
" plt.xticks(range(len(group_counts)), group_counts.index, rotation=45)\n",
"\n",
" plt.tight_layout()\n",
" plt.show()\n",
"\n",
" # Print phase analysis\n",
" if 'Phase_Name' in df.columns:\n",
" print(f\"\\nPhase analysis:\")\n",
" phase_analysis = df.groupby('Phase_Name').agg({\n",
" 'Context_ID': 'count',\n",
" 'Context_Type': lambda x: x.value_counts().index[0], # Most common type\n",
" 'Earliest_Date_Year': 'min',\n",
" 'Latest_Date_Year': 'max'\n",
" }).round(0)\n",
" phase_analysis.columns = ['Context_Count', 'Dominant_Type', 'Earliest_Year', 'Latest_Year']\n",
" print(phase_analysis)\n",
"\n",
"def find_context_networks(df: pd.DataFrame):\n",
" \"\"\"\n",
" Identify networks of related contexts based on stratigraphic relationships.\n",
" \"\"\"\n",
" print(\"CONTEXT NETWORK ANALYSIS\")\n",
" print(\"=\" * 50)\n",
"\n",
" # Build a simple network of relationships\n",
" networks = {}\n",
"\n",
" for _, row in df.iterrows():\n",
" context_id = row['Context_ID']\n",
" related = row['Related_Context_ID'].split(',') if pd.notna(row['Related_Context_ID']) else []\n",
" related = [ctx.strip() for ctx in related if ctx.strip()]\n",
"\n",
" networks[context_id] = related\n",
"\n",
" # Find connected components\n",
" visited = set()\n",
" components = []\n",
"\n",
" def dfs(context, component):\n",
" if context in visited or context not in networks:\n",
" return\n",
" visited.add(context)\n",
" component.append(context)\n",
" for related in networks[context]:\n",
" dfs(related, component)\n",
"\n",
" for context in networks:\n",
" if context not in visited:\n",
" component = []\n",
" dfs(context, component)\n",
" if component:\n",
" components.append(component)\n",
"\n",
" print(f\"Found {len(components)} connected components:\")\n",
" for i, component in enumerate(components, 1):\n",
" print(f\" Component {i}: {', '.join(component)}\")\n",
"\n",
" # Show the stratigraphic sequence for this component\n",
" component_df = df[df['Context_ID'].isin(component)]\n",
" print(f\" PHASING: {', '.join(component_df['Phase_ID'].unique())}\")\n",
" print()\n",
"\n",
"# Run advanced analyses\n",
"analyze_stratigraphic_relationships(df)\n",
"find_context_networks(df)"
]
},
{
"cell_type": "code",
"source": [
"df"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 921
},
"id": "LHr6O7INI3QX",
"outputId": "b296fa85-4a65-4b5a-8096-a86e6a25c64f"
},
"id": "LHr6O7INI3QX",
"execution_count": 63,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Context_ID Context_Type Description \\\n",
"0 C001 Layer Dark brown silty clay layer with frequent char... \n",
"1 C002 Cut Circular pit cut with steep sides and flat bot... \n",
"2 C003 Fill Light grey sandy fill of pit C002, containing ... \n",
"3 C004 Layer Compact yellow clay natural subsoil layer \n",
"4 C005 Layer Medieval mortar floor surface with tile fragments \n",
"5 C006 Feature Stone-lined hearth with evidence of burning an... \n",
"6 C007 Fill Ash and charcoal fill of hearth C006, rich in ... \n",
"7 C008 Layer Post-medieval demolition layer with brick and ... \n",
"8 C009 Cut Rectangular foundation trench for stone wall \n",
"9 C010 Fill Stone and mortar fill of foundation trench C009 \n",
"\n",
" Earliest_Date_Year Earliest_Date_Era Latest_Date_Year Latest_Date_Era \\\n",
"0 100.0 AD 400.0 AD \n",
"1 50.0 AD 200.0 AD \n",
"2 50.0 AD 200.0 AD \n",
"3 NaN None NaN None \n",
"4 1200.0 AD 1400.0 AD \n",
"5 1250.0 AD 1350.0 AD \n",
"6 1250.0 AD 1350.0 AD \n",
"7 1600.0 AD 1700.0 AD \n",
"8 1200.0 AD 1400.0 AD \n",
"9 1200.0 AD 1400.0 AD \n",
"\n",
" Date_Type Phase_ID Phase_Name Group_ID Group_Name \\\n",
"0 Ceramic P1 Romano-British G1 Occupation deposits \n",
"1 Stratigraphic P1 Romano-British G2 Pit group \n",
"2 Stratigraphic P1 Romano-British G2 Pit group \n",
"3 Geological P0 Natural G0 Natural \n",
"4 Architectural P2 Medieval G3 Domestic structures \n",
"5 Radiocarbon P2 Medieval G4 Hearth activity \n",
"6 Stratigraphic P2 Medieval G4 Hearth activity \n",
"7 Ceramic P3 Post-Medieval G5 Demolition \n",
"8 Architectural P2 Medieval G6 Wall construction \n",
"9 Stratigraphic P2 Medieval G6 Wall construction \n",
"\n",
" Sub-Group_ID Sub-Group_Name Relationship_Type \\\n",
"0 SG1 General occupation overlies \n",
"1 SG2 Pit cutting and filling cuts \n",
"2 SG2 Pit cutting and filling fills \n",
"3 None None cut by \n",
"4 SG3 Floor surfaces overlies \n",
"5 SG4 Hearth construction and use built into \n",
"6 SG4 Hearth construction and use fills \n",
"7 SG5 Site abandonment overlies \n",
"8 SG6 Foundation construction cuts \n",
"9 SG6 Foundation construction fills \n",
"\n",
" Related_Context_ID Temporal_Conflict Is_Redundant \\\n",
"0 C004 False False \n",
"1 C004 False False \n",
"2 C002 False False \n",
"3 C002 False False \n",
"4 C004 False False \n",
"5 C005 False False \n",
"6 C006 False False \n",
"7 C005 False False \n",
"8 C005 False False \n",
"9 C009 False False \n",
"\n",
" prepared_text \\\n",
"0 Context C001, Type: Layer | Description: Dark ... \n",
"1 Context C002, Type: Cut | Description: Circula... \n",
"2 Context C003, Type: Fill | Description: Light ... \n",
"3 Context C004, Type: Layer | Description: Compa... \n",
"4 Context C005, Type: Layer | Description: Medie... \n",
"5 Context C006, Type: Feature | Description: Sto... \n",
"6 Context C007, Type: Fill | Description: Ash an... \n",
"7 Context C008, Type: Layer | Description: Post-... \n",
"8 Context C009, Type: Cut | Description: Rectang... \n",
"9 Context C010, Type: Fill | Description: Stone ... \n",
"\n",
" embedding \n",
"0 [-0.0022556865587830544, 0.05589134246110916, ... \n",
"1 [-0.03904779255390167, 0.0598103292286396, -0.... \n",
"2 [-0.08637669682502747, 0.04328100010752678, -0... \n",
"3 [-0.0745324045419693, 0.08791496604681015, 0.0... \n",
"4 [-0.10213303565979004, 0.12052890658378601, 0.... \n",
"5 [-0.009370672516524792, 0.14452221989631653, 0... \n",
"6 [-0.00016425596550107002, 0.11362960189580917,... \n",
"7 [-0.073366180062294, 0.11150677502155304, 0.03... \n",
"8 [-0.06563512980937958, 0.08968332409858704, 0.... \n",
"9 [-0.07496478408575058, 0.07007502764463425, 0.... "
],
"text/html": [
"\n",
" <div id=\"df-86930272-c7a4-4720-ace9-702491ab37fc\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Context_ID</th>\n",
" <th>Context_Type</th>\n",
" <th>Description</th>\n",
" <th>Earliest_Date_Year</th>\n",
" <th>Earliest_Date_Era</th>\n",
" <th>Latest_Date_Year</th>\n",
" <th>Latest_Date_Era</th>\n",
" <th>Date_Type</th>\n",
" <th>Phase_ID</th>\n",
" <th>Phase_Name</th>\n",
" <th>Group_ID</th>\n",
" <th>Group_Name</th>\n",
" <th>Sub-Group_ID</th>\n",
" <th>Sub-Group_Name</th>\n",
" <th>Relationship_Type</th>\n",
" <th>Related_Context_ID</th>\n",
" <th>Temporal_Conflict</th>\n",
" <th>Is_Redundant</th>\n",
" <th>prepared_text</th>\n",
" <th>embedding</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>C001</td>\n",
" <td>Layer</td>\n",
" <td>Dark brown silty clay layer with frequent char...</td>\n",
" <td>100.0</td>\n",
" <td>AD</td>\n",
" <td>400.0</td>\n",
" <td>AD</td>\n",
" <td>Ceramic</td>\n",
" <td>P1</td>\n",
" <td>Romano-British</td>\n",
" <td>G1</td>\n",
" <td>Occupation deposits</td>\n",
" <td>SG1</td>\n",
" <td>General occupation</td>\n",
" <td>overlies</td>\n",
" <td>C004</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Context C001, Type: Layer | Description: Dark ...</td>\n",
" <td>[-0.0022556865587830544, 0.05589134246110916, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>C002</td>\n",
" <td>Cut</td>\n",
" <td>Circular pit cut with steep sides and flat bot...</td>\n",
" <td>50.0</td>\n",
" <td>AD</td>\n",
" <td>200.0</td>\n",
" <td>AD</td>\n",
" <td>Stratigraphic</td>\n",
" <td>P1</td>\n",
" <td>Romano-British</td>\n",
" <td>G2</td>\n",
" <td>Pit group</td>\n",
" <td>SG2</td>\n",
" <td>Pit cutting and filling</td>\n",
" <td>cuts</td>\n",
" <td>C004</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Context C002, Type: Cut | Description: Circula...</td>\n",
" <td>[-0.03904779255390167, 0.0598103292286396, -0....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>C003</td>\n",
" <td>Fill</td>\n",
" <td>Light grey sandy fill of pit C002, containing ...</td>\n",
" <td>50.0</td>\n",
" <td>AD</td>\n",
" <td>200.0</td>\n",
" <td>AD</td>\n",
" <td>Stratigraphic</td>\n",
" <td>P1</td>\n",
" <td>Romano-British</td>\n",
" <td>G2</td>\n",
" <td>Pit group</td>\n",
" <td>SG2</td>\n",
" <td>Pit cutting and filling</td>\n",
" <td>fills</td>\n",
" <td>C002</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Context C003, Type: Fill | Description: Light ...</td>\n",
" <td>[-0.08637669682502747, 0.04328100010752678, -0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>C004</td>\n",
" <td>Layer</td>\n",
" <td>Compact yellow clay natural subsoil layer</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td>Geological</td>\n",
" <td>P0</td>\n",
" <td>Natural</td>\n",
" <td>G0</td>\n",
" <td>Natural</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>cut by</td>\n",
" <td>C002</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Context C004, Type: Layer | Description: Compa...</td>\n",
" <td>[-0.0745324045419693, 0.08791496604681015, 0.0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>C005</td>\n",
" <td>Layer</td>\n",
" <td>Medieval mortar floor surface with tile fragments</td>\n",
" <td>1200.0</td>\n",
" <td>AD</td>\n",
" <td>1400.0</td>\n",
" <td>AD</td>\n",
" <td>Architectural</td>\n",
" <td>P2</td>\n",
" <td>Medieval</td>\n",
" <td>G3</td>\n",
" <td>Domestic structures</td>\n",
" <td>SG3</td>\n",
" <td>Floor surfaces</td>\n",
" <td>overlies</td>\n",
" <td>C004</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Context C005, Type: Layer | Description: Medie...</td>\n",
" <td>[-0.10213303565979004, 0.12052890658378601, 0....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>C006</td>\n",
" <td>Feature</td>\n",
" <td>Stone-lined hearth with evidence of burning an...</td>\n",
" <td>1250.0</td>\n",
" <td>AD</td>\n",
" <td>1350.0</td>\n",
" <td>AD</td>\n",
" <td>Radiocarbon</td>\n",
" <td>P2</td>\n",
" <td>Medieval</td>\n",
" <td>G4</td>\n",
" <td>Hearth activity</td>\n",
" <td>SG4</td>\n",
" <td>Hearth construction and use</td>\n",
" <td>built into</td>\n",
" <td>C005</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Context C006, Type: Feature | Description: Sto...</td>\n",
" <td>[-0.009370672516524792, 0.14452221989631653, 0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>C007</td>\n",
" <td>Fill</td>\n",
" <td>Ash and charcoal fill of hearth C006, rich in ...</td>\n",
" <td>1250.0</td>\n",
" <td>AD</td>\n",
" <td>1350.0</td>\n",
" <td>AD</td>\n",
" <td>Stratigraphic</td>\n",
" <td>P2</td>\n",
" <td>Medieval</td>\n",
" <td>G4</td>\n",
" <td>Hearth activity</td>\n",
" <td>SG4</td>\n",
" <td>Hearth construction and use</td>\n",
" <td>fills</td>\n",
" <td>C006</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Context C007, Type: Fill | Description: Ash an...</td>\n",
" <td>[-0.00016425596550107002, 0.11362960189580917,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>C008</td>\n",
" <td>Layer</td>\n",
" <td>Post-medieval demolition layer with brick and ...</td>\n",
" <td>1600.0</td>\n",
" <td>AD</td>\n",
" <td>1700.0</td>\n",
" <td>AD</td>\n",
" <td>Ceramic</td>\n",
" <td>P3</td>\n",
" <td>Post-Medieval</td>\n",
" <td>G5</td>\n",
" <td>Demolition</td>\n",
" <td>SG5</td>\n",
" <td>Site abandonment</td>\n",
" <td>overlies</td>\n",
" <td>C005</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Context C008, Type: Layer | Description: Post-...</td>\n",
" <td>[-0.073366180062294, 0.11150677502155304, 0.03...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>C009</td>\n",
" <td>Cut</td>\n",
" <td>Rectangular foundation trench for stone wall</td>\n",
" <td>1200.0</td>\n",
" <td>AD</td>\n",
" <td>1400.0</td>\n",
" <td>AD</td>\n",
" <td>Architectural</td>\n",
" <td>P2</td>\n",
" <td>Medieval</td>\n",
" <td>G6</td>\n",
" <td>Wall construction</td>\n",
" <td>SG6</td>\n",
" <td>Foundation construction</td>\n",
" <td>cuts</td>\n",
" <td>C005</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Context C009, Type: Cut | Description: Rectang...</td>\n",
" <td>[-0.06563512980937958, 0.08968332409858704, 0....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>C010</td>\n",
" <td>Fill</td>\n",
" <td>Stone and mortar fill of foundation trench C009</td>\n",
" <td>1200.0</td>\n",
" <td>AD</td>\n",
" <td>1400.0</td>\n",
" <td>AD</td>\n",
" <td>Stratigraphic</td>\n",
" <td>P2</td>\n",
" <td>Medieval</td>\n",
" <td>G6</td>\n",
" <td>Wall construction</td>\n",
" <td>SG6</td>\n",
" <td>Foundation construction</td>\n",
" <td>fills</td>\n",
" <td>C009</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Context C010, Type: Fill | Description: Stone ...</td>\n",
" <td>[-0.07496478408575058, 0.07007502764463425, 0....</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-86930272-c7a4-4720-ace9-702491ab37fc')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-86930272-c7a4-4720-ace9-702491ab37fc button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-86930272-c7a4-4720-ace9-702491ab37fc');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
" <div id=\"df-0fc17b7c-6dc8-4d51-bdf3-2419ec09e1ac\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-0fc17b7c-6dc8-4d51-bdf3-2419ec09e1ac')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-0fc17b7c-6dc8-4d51-bdf3-2419ec09e1ac button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
" </div>\n",
"\n",
" <div id=\"id_8c862a23-4ff9-4008-8521-5232168657fd\">\n",
" <style>\n",
" .colab-df-generate {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-generate:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-generate {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-generate:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
" <button class=\"colab-df-generate\" onclick=\"generateWithVariable('df')\"\n",
" title=\"Generate code using this dataframe.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
" </svg>\n",
" </button>\n",
" <script>\n",
" (() => {\n",
" const buttonEl =\n",
" document.querySelector('#id_8c862a23-4ff9-4008-8521-5232168657fd button.colab-df-generate');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" buttonEl.onclick = () => {\n",
" google.colab.notebook.generateWithVariable('df');\n",
" }\n",
" })();\n",
" </script>\n",
" </div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "df",
"summary": "{\n \"name\": \"df\",\n \"rows\": 10,\n \"fields\": [\n {\n \"column\": \"Context_ID\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"C009\",\n \"C002\",\n \"C006\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Context_Type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"Cut\",\n \"Feature\",\n \"Layer\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Description\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Rectangular foundation trench for stone wall\",\n \"Circular pit cut with steep sides and flat bottom, diameter 1.2m\",\n \"Stone-lined hearth with evidence of burning and ash deposits\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Earliest_Date_Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 621.0430294628903,\n \"min\": 50.0,\n \"max\": 1600.0,\n \"num_unique_values\": 5,\n \"samples\": [\n 50.0,\n 1600.0,\n 1200.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Earliest_Date_Era\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"AD\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Latest_Date_Year\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 595.5273106900956,\n \"min\": 200.0,\n \"max\": 1700.0,\n \"num_unique_values\": 5,\n \"samples\": [\n 200.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Latest_Date_Era\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"AD\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Date_Type\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Stratigraphic\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Phase_ID\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"P0\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Phase_Name\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"Natural\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Group_ID\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"G1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Group_Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"Occupation deposits\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Sub-Group_ID\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 6,\n \"samples\": [\n \"SG1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Sub-Group_Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 6,\n \"samples\": [\n \"General occupation\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Relationship_Type\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"cuts\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Related_Context_ID\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"C002\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Temporal_Conflict\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Is_Redundant\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"prepared_text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Context C009, Type: Cut | Description: Rectangular foundation trench for stone wall | Dated 1200.0 AD to 1400.0 AD | Dating method: Architectural | Phase: Medieval | Group: Wall construction | Sub-Group: Foundation construction | Stratigraphic relationship: cuts C005\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"embedding\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 63
}
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8c727ed-ae1c-42a5-838c-c1f0a92ed61b",
"metadata": {
"id": "b8c727ed-ae1c-42a5-838c-c1f0a92ed61b"
},
"outputs": [],
"source": [
"## 8. Export and Persistence\n",
"def save_embeddings_to_database(df: pd.DataFrame, db_path: str = 'archaeological_contexts.db'):\n",
" \"\"\"\n",
" Save the context data and embeddings to a SQLite database for persistence.\n",
" \"\"\"\n",
" conn = sqlite3.connect(db_path)\n",
"\n",
" # Prepare data for storage\n",
" df_to_save = df.copy()\n",
" df_to_save['embedding_json'] = df_to_save['embedding'].apply(json.dumps)\n",
" df_to_save = df_to_save.drop('embedding', axis=1)\n",
"\n",
" # Save to database\n",
" df_to_save.to_sql('archaeological_contexts', conn, if_exists='replace', index=False)\n",
"\n",
" conn.close()\n",
" print(f\"Saved {len(df)} contexts to database: {db_path}\")\n",
"\n",
"def load_embeddings_from_database(db_path: str = 'archaeological_contexts.db') -> pd.DataFrame:\n",
" \"\"\"\n",
" Load context data and embeddings from SQLite database.\n",
" \"\"\"\n",
" conn = sqlite3.connect(db_path)\n",
" df = pd.read_sql_query(\"SELECT * FROM archaeological_contexts\", conn)\n",
" conn.close()\n",
"\n",
" # Convert embeddings back from JSON\n",
" df['embedding'] = df['embedding_json'].apply(json.loads)\n",
" df = df.drop('embedding_json', axis=1)\n",
"\n",
" print(f\"Loaded {len(df)} contexts from database: {db_path}\")\n",
" return df\n",
"\n",
"def export_results_to_csv(results_list: List[Dict], output_path: str = 'query_results.csv'):\n",
" \"\"\"\n",
" Export query results to CSV for further analysis.\n",
" \"\"\"\n",
" # Flatten results for CSV export\n",
" flattened_results = []\n",
"\n",
" for result in results_list:\n",
" for ctx in result['retrieved_contexts']:\n",
" flattened_results.append({\n",
" 'query': result['query'],\n",
" 'context_id': ctx['Context_ID'],\n",
" 'similarity_score': ctx['similarity_score'],\n",
" 'context_type': ctx['Context_Type'],\n",
" 'phase_name': ctx.get('Phase_Name', ''),\n",
" 'group_name': ctx.get('Group_Name', ''),\n",
" 'relationship_type': ctx.get('Relationship_Type', ''),\n",
" 'related_context_id': ctx.get('Related_Context_ID', ''),\n",
" 'description': ctx['Description'][:100] + '...' # Truncate for CSV\n",
" })\n",
"\n",
" results_df = pd.DataFrame(flattened_results)\n",
" results_df.to_csv(output_path, index=False)\n",
" print(f\"Exported {len(results_df)} query results to: {output_path}\")\n",
"\n",
"# Save current work\n",
"save_embeddings_to_database(df)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
},
"colab": {
"provenance": [],
"gpuType": "T4"
},
"accelerator": "GPU"
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment