Created
November 18, 2025 20:49
-
-
Save shawngraham/928a67fe34d26beacc7e57f0741a0ccb to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "nbformat": 4, | |
| "nbformat_minor": 0, | |
| "metadata": { | |
| "colab": { | |
| "provenance": [], | |
| "collapsed_sections": [ | |
| "7yIWmxj3rwX8" | |
| ], | |
| "gpuType": "T4", | |
| "authorship_tag": "ABX9TyOkx9ZAU9hKrUzwspC+ipl2", | |
| "include_colab_link": true | |
| }, | |
| "kernelspec": { | |
| "name": "python3", | |
| "display_name": "Python 3" | |
| }, | |
| "language_info": { | |
| "name": "python" | |
| }, | |
| "accelerator": "GPU" | |
| }, | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "So, normally you would mark up each inscription with annotations indicating the starting character and ending character position for each piece of data you want - name of the deceased, age at death, whatever. There are platforms that can help you do this, but it can take a long time and you'd have to pay someone for their time.\n", | |
| "\n", | |
| "Here, I'm trying a different approach where I am counting on the highly formulaic nature of Roman epigraphy. I have generated a few hundred 'inscriptions' and then directed an LLM to do the annotation work. Then I use that data to create an entity extraction model with spaCy. Once such a model is created, it's very deterministic, unlike an LLM which might get creative on us. It's also rather small and can be run on a variety of regular computers or in Colab.\n", | |
| "\n", | |
| "(On the other hand, the same regularity that permits me to create synthethic data with a commercial llm could probably just jump from reading the inscription to creating the structured data we're after.)\n", | |
| "\n", | |
| "Originally I built this with the spacy english language model, but obviously that's not great. So trying with Patrick Burns' [LatinCy](https://diyclassics.github.io/latincy-book/) model" | |
| ], | |
| "metadata": { | |
| "id": "6VdOS9H5mgch" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": { | |
| "id": "qqyyEFg1hl5P" | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "!mkdir assets # To store your raw data files (jsonl, csv)\n", | |
| "!mkdir configs # To store configuration files\n", | |
| "!mkdir scripts # To store helper scripts (like data conversion)\n", | |
| "!mkdir training # To store the output of the training process\n", | |
| "!mkdir corpus # To store the processed .spacy files" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!ls" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "V0eqKNQyitWN", | |
| "outputId": "aabd38d7-468e-4b14-94ce-fde04ab56887" | |
| }, | |
| "execution_count": 2, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "assets\tconfigs corpus sample_data scripts training\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "#!pip install -U spacy #already in colab\n", | |
| "#!python -m spacy download en_core_web_lg\n", | |
| "#!pip install \"la-core-web-sm @ https://huggingface.co/latincy/la_core_web_sm/resolve/main/la_core_web_sm-any-py3-none-any.whl\"\n", | |
| "!pip install \"la-core-web-lg @ https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl\"\n", | |
| "#\n", | |
| "# this is what we're going to retrain." | |
| ], | |
| "metadata": { | |
| "collapsed": true, | |
| "id": "mL16o-mOiFbS" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!pip install spacy-transformers" | |
| ], | |
| "metadata": { | |
| "id": "EC-TWE4ozu20" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# then you have to run this. It will say things have crashed. Ignore and continue.\n", | |
| "import os\n", | |
| "os.kill(os.getpid(), 9)" | |
| ], | |
| "metadata": { | |
| "id": "RHpzFqpNCvrm" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# start with some synthethic training annotations\n", | |
| "\n", | |
| "!wget https://gist.githubusercontent.com/shawngraham/f44663efc80916a75c736a38f024b371/raw/9b1d724d19b30ded168268af0fd959dccaae521e/synthetic-training.jsonl -O assets/synthethic-training.jsonl\n", | |
| "\n", | |
| "## and some synthetic testing data\n", | |
| "#!wget https://gist.githubusercontent.com/shawngraham/3633224a209ab01f650f9dee9183888d/raw/9cc9dfeb566dc8465d744e6745af98af363a227c/testing-epigraphs-synthetic.csv -O assets/test-fake-epigraphs.csv" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "LB__UbKm-CQ3", | |
| "outputId": "767be4ac-03dc-45d4-e354-f3c975db36f1" | |
| }, | |
| "execution_count": 79, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "--2025-11-18 19:53:57-- https://gist.githubusercontent.com/shawngraham/f44663efc80916a75c736a38f024b371/raw/9b1d724d19b30ded168268af0fd959dccaae521e/synthetic-training.jsonl\n", | |
| "Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", | |
| "Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.\n", | |
| "HTTP request sent, awaiting response... 200 OK\n", | |
| "Length: 161705 (158K) [text/plain]\n", | |
| "Saving to: ‘assets/synthethic-training.jsonl’\n", | |
| "\n", | |
| "assets/synthethic-t 100%[===================>] 157.92K --.-KB/s in 0.02s \n", | |
| "\n", | |
| "2025-11-18 19:53:58 (10.2 MB/s) - ‘assets/synthethic-training.jsonl’ saved [161705/161705]\n", | |
| "\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## prepare data" | |
| ], | |
| "metadata": { | |
| "id": "45Wkx0lkciHi" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# # scripts/convert_csv_to_jsonl.py\n", | |
| "# # only if your original data is in csv format\n", | |
| "\n", | |
| "# import pandas as pd\n", | |
| "# import json\n", | |
| "# import ast # Abstract Syntax Tree module to safely evaluate string-formatted lists\n", | |
| "\n", | |
| "# def convert_csv_to_jsonl(input_path, output_path):\n", | |
| "# \"\"\"\n", | |
| "# Converts a CSV with epigraphic annotations to a JSONL file\n", | |
| "# compatible with the spaCy training pipeline.\n", | |
| "# \"\"\"\n", | |
| "# try:\n", | |
| "# df = pd.read_csv(input_path)\n", | |
| "# except FileNotFoundError:\n", | |
| "# print(f\"Error: The file at {input_path} was not found.\")\n", | |
| "# return\n", | |
| "\n", | |
| "# with open(output_path, 'w') as f:\n", | |
| "# for index, row in df.iterrows():\n", | |
| "# # Safely evaluate the string representation of the list\n", | |
| "# try:\n", | |
| "# # ast.literal_eval is safer than eval() for this purpose\n", | |
| "# annotation_list = ast.literal_eval(row['annotations'])\n", | |
| "# except (ValueError, SyntaxError):\n", | |
| "# print(f\"Warning: Could not parse annotations for row {index + 1}. Skipping.\")\n", | |
| "# continue\n", | |
| "\n", | |
| "# # Create the final JSON structure for each line\n", | |
| "# json_record = {\n", | |
| "# \"id\": row['id'],\n", | |
| "# \"text\": row['text'],\n", | |
| "# \"transcription\": row['transcription'],\n", | |
| "# # The final JSONL needs the annotations nested inside a dictionary\n", | |
| "# \"annotations\": {\"annotations\": annotation_list}\n", | |
| "# }\n", | |
| "\n", | |
| "# # Write the JSON object as a string on a new line\n", | |
| "# f.write(json.dumps(json_record) + '\\n')\n", | |
| "\n", | |
| "# print(f\" Successfully converted {input_path} to {output_path}\")\n", | |
| "\n", | |
| "# # --- Main execution ---\n", | |
| "# if __name__ == '__main__':\n", | |
| "# # Define the input CSV and output JSONL file paths\n", | |
| "# csv_file = 'assets/training-annotations-synthetic.csv'\n", | |
| "# jsonl_file = 'assets/train.jsonl'\n", | |
| "\n", | |
| "# # Run the conversion\n", | |
| "# convert_csv_to_jsonl(csv_file, jsonl_file)" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "d0eF77b4ij-N", | |
| "outputId": "7e14a78a-6c7b-44ac-f74f-5cee0acb9503" | |
| }, | |
| "execution_count": 80, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| " Successfully converted assets/training-annotations-synthetic.csv to assets/train.jsonl\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# scripts/partition_data.py (Separates clean data from data that needs fixing)\n", | |
| "\n", | |
| "import spacy\n", | |
| "import json\n", | |
| "from spacy.tokens import Doc\n", | |
| "\n", | |
| "def get_annotation_spans(record):\n", | |
| " \"\"\"Safely navigates the nested dictionary to find the list of annotations.\"\"\"\n", | |
| " try: return record['annotations']['annotations']['annotations']\n", | |
| " except (KeyError, TypeError):\n", | |
| " try: return record['annotations']['annotations']\n", | |
| " except (KeyError, TypeError): return []\n", | |
| "\n", | |
| "def partition_data(input_path, clean_output_path, needs_fixing_output_path):\n", | |
| " \"\"\"\n", | |
| " Reads a JSONL file and splits it into two files: one with clean, alignable\n", | |
| " records, and one with records that contain at least one unalignable annotation.\n", | |
| " \"\"\"\n", | |
| " #nlp = spacy.blank(\"en\")\n", | |
| " nlp = spacy.load('la_core_web_lg')\n", | |
| " clean_records = []\n", | |
| " fix_records = []\n", | |
| "\n", | |
| " print(f\"--- Starting data partitioning for '{input_path}' ---\")\n", | |
| "\n", | |
| " with open(input_path, 'r', encoding='utf-8') as f:\n", | |
| " for line in f:\n", | |
| " try: record = json.loads(line)\n", | |
| " except json.JSONDecodeError: continue\n", | |
| "\n", | |
| " text = record.get('transcription', '')\n", | |
| " if not text:\n", | |
| " fix_records.append(record) # A record with no text needs fixing\n", | |
| " continue\n", | |
| "\n", | |
| " annotations = get_annotation_spans(record)\n", | |
| "\n", | |
| " # A record with no annotations is considered clean\n", | |
| " if not isinstance(annotations, list) or not annotations:\n", | |
| " clean_records.append(record)\n", | |
| " continue\n", | |
| "\n", | |
| " doc = nlp.make_doc(text)\n", | |
| " is_record_clean = True # Assume the record is clean until proven otherwise\n", | |
| "\n", | |
| " for entity in annotations:\n", | |
| " if not isinstance(entity, list) or len(entity) != 3:\n", | |
| " is_record_clean = False\n", | |
| " break # Malformed entity taints the whole record\n", | |
| "\n", | |
| " start, end, label = entity\n", | |
| "\n", | |
| " # Try all alignment strategies\n", | |
| " span = doc.char_span(start, end, label=label, alignment_mode=\"expand\")\n", | |
| " if span is None: span = doc.char_span(start + 1, end + 1, label=label, alignment_mode=\"expand\")\n", | |
| " if span is None: span = doc.char_span(start - 1, end - 1, label=label, alignment_mode=\"expand\")\n", | |
| "\n", | |
| " # If an entity STILL fails alignment, the whole record is tainted\n", | |
| " if span is None:\n", | |
| " is_record_clean = False\n", | |
| " break # No need to check other entities in this record\n", | |
| "\n", | |
| " # After checking all entities, sort the record into the correct list\n", | |
| " if is_record_clean:\n", | |
| " clean_records.append(record)\n", | |
| " else:\n", | |
| " fix_records.append(record)\n", | |
| "\n", | |
| " # Write the clean records file\n", | |
| " with open(clean_output_path, 'w', encoding='utf-8') as f:\n", | |
| " for record in clean_records:\n", | |
| " f.write(json.dumps(record) + '\\n')\n", | |
| "\n", | |
| " # Write the records that need fixing\n", | |
| " with open(needs_fixing_output_path, 'w', encoding='utf-8') as f:\n", | |
| " for record in fix_records:\n", | |
| " f.write(json.dumps(record) + '\\n')\n", | |
| "\n", | |
| " print(\"\\nPartitioning complete.\")\n", | |
| " print(f\" - Total Records Processed: {len(clean_records) + len(fix_records)}\")\n", | |
| " print(f\" - Clean Records: {len(clean_records)}\")\n", | |
| " print(f\" - Records Needing Fixes: {len(fix_records)}\")\n", | |
| " print(f\"Clean data saved to '{clean_output_path}'\")\n", | |
| " print(f\"Data to be fixed saved to '{needs_fixing_output_path}'\")\n", | |
| "\n", | |
| "\n", | |
| "# --- Run the Partitioning Script ---\n", | |
| "#INPUT_FILE = \"assets/train.jsonl\"\n", | |
| "INPUT_FILE = \"assets/synthethic-training.jsonl\"\n", | |
| "CLEAN_OUTPUT_FILE = \"assets/train_clean.jsonl\"\n", | |
| "FIX_OUTPUT_FILE = \"assets/train_needs_fixing.jsonl\"\n", | |
| "\n", | |
| "partition_data(INPUT_FILE, CLEAN_OUTPUT_FILE, FIX_OUTPUT_FILE)" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "D5gry-5uzGm4", | |
| "outputId": "8278a795-f68f-4afa-884b-f62ea2725f7d" | |
| }, | |
| "execution_count": 81, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "--- Starting data partitioning for 'assets/synthethic-training.jsonl' ---\n", | |
| "\n", | |
| "Partitioning complete.\n", | |
| " - Total Records Processed: 463\n", | |
| " - Clean Records: 463\n", | |
| " - Records Needing Fixes: 0\n", | |
| "Clean data saved to 'assets/train_clean.jsonl'\n", | |
| "Data to be fixed saved to 'assets/train_needs_fixing.jsonl'\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# scripts/split_data.py (Splits a JSONL file into train and dev sets)\n", | |
| "\n", | |
| "import json\n", | |
| "from sklearn.model_selection import train_test_split\n", | |
| "\n", | |
| "def split_data(input_path, train_output_path, dev_output_path, dev_size=0.2):\n", | |
| " \"\"\"\n", | |
| " Reads a JSONL file and splits its records into training and development sets.\n", | |
| " \"\"\"\n", | |
| " print(f\"--- Splitting data from '{input_path}' ---\")\n", | |
| "\n", | |
| " # Read all lines from the source file\n", | |
| " with open(input_path, 'r', encoding='utf-8') as f:\n", | |
| " lines = f.readlines()\n", | |
| "\n", | |
| " if len(lines) < 2:\n", | |
| " print(\"Warning: Not enough data to split. Need at least 2 records.\")\n", | |
| " return\n", | |
| "\n", | |
| " # Use train_test_split to randomly shuffle and split the lines\n", | |
| " train_lines, dev_lines = train_test_split(lines, test_size=dev_size, random_state=42)\n", | |
| "\n", | |
| " # Write the training set\n", | |
| " with open(train_output_path, 'w', encoding='utf-8') as f:\n", | |
| " for line in train_lines:\n", | |
| " f.write(line)\n", | |
| "\n", | |
| " # Write the development set\n", | |
| " with open(dev_output_path, 'w', encoding='utf-8') as f:\n", | |
| " for line in dev_lines:\n", | |
| " f.write(line)\n", | |
| "\n", | |
| " print(\"\\n✅ Data splitting complete.\")\n", | |
| " print(f\" - Total Records: {len(lines)}\")\n", | |
| " print(f\" - Training Records: {len(train_lines)} -> {train_output_path}\")\n", | |
| " print(f\" - Development Records: {len(dev_lines)} -> {dev_output_path}\")\n", | |
| "\n", | |
| "# --- Run the Splitting Script ---\n", | |
| "INPUT_FILE = \"assets/train_clean.jsonl\"\n", | |
| "TRAIN_SPLIT_FILE = \"assets/train_split.jsonl\"\n", | |
| "DEV_SPLIT_FILE = \"assets/dev_split.jsonl\"\n", | |
| "\n", | |
| "split_data(INPUT_FILE, TRAIN_SPLIT_FILE, DEV_SPLIT_FILE)" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "eWfzDyqVD6-T", | |
| "outputId": "4220793c-df17-4bda-8b8d-bf3085878c58" | |
| }, | |
| "execution_count": 82, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "--- Splitting data from 'assets/train_clean.jsonl' ---\n", | |
| "\n", | |
| "✅ Data splitting complete.\n", | |
| " - Total Records: 463\n", | |
| " - Training Records: 370 -> assets/train_split.jsonl\n", | |
| " - Development Records: 93 -> assets/dev_split.jsonl\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# # scripts/align_annotations.py\n", | |
| "\n", | |
| "\n", | |
| "import spacy\n", | |
| "import json\n", | |
| "from spacy.tokens import Doc\n", | |
| "\n", | |
| "def get_annotation_spans(record):\n", | |
| " \"\"\"\n", | |
| " Retrieves annotations whether they are a flat list (your current data)\n", | |
| " or nested in a dictionary (LabelStudio style).\n", | |
| " \"\"\"\n", | |
| " raw = record.get('annotations', [])\n", | |
| "\n", | |
| " # Case 1: It is already a list (Your current format)\n", | |
| " if isinstance(raw, list):\n", | |
| " return raw\n", | |
| "\n", | |
| " # Case 2: It is a dictionary (Nested format)\n", | |
| " if isinstance(raw, dict):\n", | |
| " try:\n", | |
| " return raw.get('annotations', [])\n", | |
| " except (AttributeError, TypeError):\n", | |
| " pass\n", | |
| "\n", | |
| " return []\n", | |
| "\n", | |
| "def align_annotations(input_path, output_path):\n", | |
| " # Load the model for tokenization reference\n", | |
| " nlp = spacy.load('la_core_web_lg')\n", | |
| " corrected_records = []\n", | |
| "\n", | |
| " # Stats tracking\n", | |
| " stats = {\n", | |
| " \"total_records\": 0,\n", | |
| " \"malformed_ents\": 0,\n", | |
| " \"unaligned_ents\": 0,\n", | |
| " \"auto_fixed_ents\": 0,\n", | |
| " \"perfect_ents\": 0\n", | |
| " }\n", | |
| "\n", | |
| " with open(input_path, 'r', encoding='utf-8') as f:\n", | |
| " for i, line in enumerate(f):\n", | |
| " try:\n", | |
| " record = json.loads(line)\n", | |
| " stats[\"total_records\"] += 1\n", | |
| " except json.JSONDecodeError:\n", | |
| " continue\n", | |
| "\n", | |
| " # Ensure we are targeting the field that matches the character offsets\n", | |
| " # Based on your example: 0-11 \"DIS MANIBUS\" matches 'transcription', not 'text'\n", | |
| " text = record.get('transcription', '')\n", | |
| " if not text:\n", | |
| " continue\n", | |
| "\n", | |
| " annotations = get_annotation_spans(record)\n", | |
| "\n", | |
| " doc = nlp.make_doc(text)\n", | |
| " corrected_ents = []\n", | |
| "\n", | |
| " for entity in annotations:\n", | |
| " # Safety check\n", | |
| " if not isinstance(entity, list) or len(entity) != 3:\n", | |
| " stats[\"malformed_ents\"] += 1\n", | |
| " continue\n", | |
| "\n", | |
| " start, end, label = entity\n", | |
| "\n", | |
| " span = None\n", | |
| "\n", | |
| " # Attempt A: Original indices\n", | |
| " span = doc.char_span(start, end, label=label, alignment_mode=\"expand\")\n", | |
| " if span is not None:\n", | |
| " stats[\"perfect_ents\"] += 1\n", | |
| "\n", | |
| " # Attempt B: +1 shift\n", | |
| " if span is None:\n", | |
| " span = doc.char_span(start + 1, end + 1, label=label, alignment_mode=\"expand\")\n", | |
| " if span is not None:\n", | |
| " stats[\"auto_fixed_ents\"] += 1\n", | |
| "\n", | |
| " # Attempt C: -1 shift\n", | |
| " if span is None:\n", | |
| " span = doc.char_span(start - 1, end - 1, label=label, alignment_mode=\"expand\")\n", | |
| " if span is not None:\n", | |
| " stats[\"auto_fixed_ents\"] += 1\n", | |
| "\n", | |
| " # Final Decision\n", | |
| " if span is not None:\n", | |
| " corrected_ents.append([span.start_char, span.end_char, label])\n", | |
| " else:\n", | |
| " # Optional: Print failures to debug specific lines\n", | |
| " # print(f\"Could not align: '{text[start:end]}' ({label}) in text\")\n", | |
| " stats[\"unaligned_ents\"] += 1\n", | |
| "\n", | |
| " # --- UPDATE RECORD ---\n", | |
| " # We save it back as a flat list to keep it simple and matching input format\n", | |
| " record['annotations'] = corrected_ents\n", | |
| " corrected_records.append(record)\n", | |
| "\n", | |
| " # Write output\n", | |
| " with open(output_path, 'w', encoding='utf-8') as f:\n", | |
| " for record in corrected_records:\n", | |
| " f.write(json.dumps(record) + '\\n')\n", | |
| "\n", | |
| " print(f\"\\n✅ Alignment complete for {input_path}.\")\n", | |
| " print(f\" - Processed Records: {stats['total_records']}\")\n", | |
| " print(f\" - Perfect Annotations: {stats['perfect_ents']}\")\n", | |
| " print(f\" - Fixed Offsets: {stats['auto_fixed_ents']}\")\n", | |
| " print(f\" - Malformed/Skipped: {stats['malformed_ents']}\")\n", | |
| " print(f\" - Unalignable (Dropped): {stats['unaligned_ents']}\")\n", | |
| "\n", | |
| "# Run it\n", | |
| "align_annotations('assets/train_split.jsonl', 'assets/train_aligned.jsonl')\n", | |
| "align_annotations('assets/dev_split.jsonl', 'assets/dev_aligned.jsonl')" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "n_OXbwRFkF7F", | |
| "outputId": "96bb5814-68ec-4fb5-f3e3-495302d617a7" | |
| }, | |
| "execution_count": 83, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "\n", | |
| "✅ Alignment complete for assets/train_split.jsonl.\n", | |
| " - Processed Records: 370\n", | |
| " - Perfect Annotations: 2730\n", | |
| " - Fixed Offsets: 0\n", | |
| " - Malformed/Skipped: 0\n", | |
| " - Unalignable (Dropped): 0\n", | |
| "\n", | |
| "✅ Alignment complete for assets/dev_split.jsonl.\n", | |
| " - Processed Records: 93\n", | |
| " - Perfect Annotations: 671\n", | |
| " - Fixed Offsets: 0\n", | |
| " - Malformed/Skipped: 0\n", | |
| " - Unalignable (Dropped): 0\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# # scripts/prepare_data.py\n", | |
| "\n", | |
| "\n", | |
| "import spacy\n", | |
| "import json\n", | |
| "from spacy.tokens import DocBin\n", | |
| "from spacy.util import filter_spans\n", | |
| "\n", | |
| "def get_annotation_spans(record):\n", | |
| " raw = record.get('annotations')\n", | |
| " if isinstance(raw, list): return raw\n", | |
| " try: return raw['annotations']\n", | |
| " except (KeyError, TypeError): return []\n", | |
| "\n", | |
| "def create_spacy_file(input_path, output_path):\n", | |
| " nlp = spacy.load('la_core_web_lg')\n", | |
| " db = DocBin()\n", | |
| "\n", | |
| " stats = {\n", | |
| " \"docs\": 0,\n", | |
| " \"total_ents\": 0,\n", | |
| " \"dropped_ents\": 0\n", | |
| " }\n", | |
| "\n", | |
| " print(f\"--- Processing '{input_path}' ---\")\n", | |
| "\n", | |
| " with open(input_path, 'r', encoding='utf-8') as f:\n", | |
| " for line in f:\n", | |
| " try:\n", | |
| " record = json.loads(line)\n", | |
| " except json.JSONDecodeError:\n", | |
| " continue\n", | |
| "\n", | |
| " text = record.get('transcription')\n", | |
| " if not text:\n", | |
| " continue\n", | |
| "\n", | |
| " doc = nlp.make_doc(text)\n", | |
| " ents = []\n", | |
| " annotations = get_annotation_spans(record)\n", | |
| "\n", | |
| " if isinstance(annotations, list):\n", | |
| " for entity in annotations:\n", | |
| " if len(entity) != 3: continue\n", | |
| " start, end, label = entity\n", | |
| "\n", | |
| " span = doc.char_span(start, end, label=label, alignment_mode=\"expand\")\n", | |
| "\n", | |
| " if span is not None:\n", | |
| " ents.append(span)\n", | |
| " else:\n", | |
| " stats[\"dropped_ents\"] += 1\n", | |
| " # Optional: print what was dropped to debug\n", | |
| " print(f\"Dropping invalid span: [{start}:{end}] in '{text[:20]}...'\")\n", | |
| "\n", | |
| " # Remove duplicates/overlaps\n", | |
| " original_count = len(ents)\n", | |
| " filtered_ents = filter_spans(ents)\n", | |
| "\n", | |
| " if len(filtered_ents) < original_count:\n", | |
| " stats[\"dropped_ents\"] += (original_count - len(filtered_ents))\n", | |
| "\n", | |
| " doc.ents = filtered_ents\n", | |
| " stats[\"total_ents\"] += len(filtered_ents)\n", | |
| " stats[\"docs\"] += 1\n", | |
| " db.add(doc)\n", | |
| "\n", | |
| " db.to_disk(output_path)\n", | |
| "\n", | |
| " print(f\"✅ Saved {output_path}\")\n", | |
| " print(f\" - Documents: {stats['docs']}\")\n", | |
| " print(f\" - Total Entities: {stats['total_ents']} (Avg: {stats['total_ents']/stats['docs']:.1f} per doc)\")\n", | |
| " print(f\" - Dropped/Failed: {stats['dropped_ents']}\")\n", | |
| "\n", | |
| "# --- Execute ---\n", | |
| "create_spacy_file('assets/train_aligned.jsonl', './corpus/train.spacy')\n", | |
| "create_spacy_file('assets/dev_aligned.jsonl', './corpus/dev.spacy')" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "hfhsbQTwhqxE", | |
| "outputId": "5613d010-13fc-4d94-d12d-09e49fe46c52" | |
| }, | |
| "execution_count": 84, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "--- Processing 'assets/train_aligned.jsonl' ---\n", | |
| "✅ Saved ./corpus/train.spacy\n", | |
| " - Documents: 370\n", | |
| " - Total Entities: 1802 (Avg: 4.9 per doc)\n", | |
| " - Dropped/Failed: 928\n", | |
| "--- Processing 'assets/dev_aligned.jsonl' ---\n", | |
| "✅ Saved ./corpus/dev.spacy\n", | |
| " - Documents: 93\n", | |
| " - Total Entities: 426 (Avg: 4.6 per doc)\n", | |
| " - Dropped/Failed: 245\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## Train the Model" | |
| ], | |
| "metadata": { | |
| "id": "D69ZbsLOceOD" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import spacy\n", | |
| "from pathlib import Path\n", | |
| "\n", | |
| "# --- 1. Generate the base config ---\n", | |
| "!python -m spacy init config configs/config.cfg --lang la --pipeline tok2vec,ner --optimize accuracy --force\n", | |
| "\n", | |
| "print(\"✅ Base 'config.cfg' generated.\")\n", | |
| "\n", | |
| "# --- 2. Load and Modify ---\n", | |
| "config_path = Path(\"configs/config.cfg\")\n", | |
| "config = spacy.util.load_config(config_path)\n", | |
| "\n", | |
| "# Define the model we are using\n", | |
| "LATIN_MODEL = \"la_core_web_lg\"\n", | |
| "\n", | |
| "# --- Part A: Initialize Vectors (CRITICAL FOR LG MODELS) ---\n", | |
| "# This loads the 300-dim vectors into the vocab so the tok2vec layer can find them.\n", | |
| "config[\"initialize\"][\"vectors\"] = LATIN_MODEL\n", | |
| "\n", | |
| "# --- Part B: Source the tok2vec component ---\n", | |
| "config[\"components\"][\"tok2vec\"] = {\n", | |
| " \"source\": LATIN_MODEL,\n", | |
| " \"component\": \"tok2vec\"\n", | |
| "}\n", | |
| "\n", | |
| "# --- Part C: Connect NER to the vectors ---\n", | |
| "# ERROR CORRECTION: The tok2vec OUTPUT width is 96, even if the input vectors are 300.\n", | |
| "config[\"components\"][\"ner\"][\"model\"][\"tok2vec\"] = {\n", | |
| " \"@architectures\": \"spacy.Tok2VecListener.v1\",\n", | |
| " \"width\": 96, # <--- Reverted to 96. This matches the output of la_core_web_lg.\n", | |
| " \"upstream\": \"tok2vec\"\n", | |
| "}\n", | |
| "\n", | |
| "config[\"nlp\"][\"batch_size\"] = 200\n", | |
| "\n", | |
| "# --- Part D: Paths and Freezing ---\n", | |
| "config[\"paths\"][\"train\"] = \"./corpus/train.spacy\"\n", | |
| "config[\"paths\"][\"dev\"] = \"./corpus/dev.spacy\"\n", | |
| "\n", | |
| "# Freeze tok2vec so we don't ruin the pretrained Latin intelligence\n", | |
| "#config[\"training\"][\"frozen_components\"] = [\"tok2vec\"]\n", | |
| "# or unfreeze it, see what happens\n", | |
| "config[\"training\"][\"frozen_components\"] = []\n", | |
| "config[\"training\"][\"max_epochs\"]= 100\n", | |
| "# Mark it as annotating so it actually runs\n", | |
| "config[\"training\"][\"annotating_components\"] = [\"tok2vec\"]\n", | |
| "\n", | |
| "# --- 3. Save ---\n", | |
| "config.to_disk(config_path)\n", | |
| "\n", | |
| "print(f\"✅ Config updated for {LATIN_MODEL}. Listener width set to 96 (correct output dim).\")" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "AzoGViEoiBaZ", | |
| "outputId": "f59453c5-b27c-4800-bd95-fa446837bd67" | |
| }, | |
| "execution_count": 95, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "\u001b[38;5;4mℹ Generated config template specific for your use case\u001b[0m\n", | |
| "- Language: la\n", | |
| "- Pipeline: ner\n", | |
| "- Optimize for: accuracy\n", | |
| "- Hardware: CPU\n", | |
| "- Transformer: None\n", | |
| "\u001b[38;5;2m✔ Auto-filled config with all values\u001b[0m\n", | |
| "\u001b[38;5;2m✔ Saved config\u001b[0m\n", | |
| "configs/config.cfg\n", | |
| "You can now add your data and train your pipeline:\n", | |
| "python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy\n", | |
| "✅ Base 'config.cfg' generated.\n", | |
| "✅ Config updated for la_core_web_lg. Listener width set to 96 (correct output dim).\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# Start the training process!\n", | |
| "!python -m spacy train configs/config.cfg --output ./training/ --gpu-id 0" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "S4TD1wEdjMny", | |
| "outputId": "be070055-b60e-4665-fa95-158c0af971da" | |
| }, | |
| "execution_count": 96, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "\u001b[38;5;4mℹ Saving to output directory: training\u001b[0m\n", | |
| "\u001b[38;5;4mℹ Using GPU: 0\u001b[0m\n", | |
| "\u001b[1m\n", | |
| "=========================== Initializing pipeline ===========================\u001b[0m\n", | |
| "\u001b[38;5;2m✔ Initialized pipeline\u001b[0m\n", | |
| "\u001b[1m\n", | |
| "============================= Training pipeline =============================\u001b[0m\n", | |
| "\u001b[38;5;4mℹ Pipeline: ['tok2vec', 'ner']\u001b[0m\n", | |
| "\u001b[38;5;4mℹ Set annotations on update for: ['tok2vec']\u001b[0m\n", | |
| "\u001b[38;5;4mℹ Initial learn rate: 0.001\u001b[0m\n", | |
| "E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE \n", | |
| "--- ------ ------------ -------- ------ ------ ------ ------\n", | |
| " 0 0 0.00 76.23 19.74 25.27 16.20 0.20\n", | |
| " 5 200 161.75 5016.42 86.74 88.32 85.21 0.87\n", | |
| " 12 400 154.83 2110.54 89.13 89.76 88.50 0.89\n", | |
| " 20 600 201.80 1797.31 88.60 87.79 89.44 0.89\n", | |
| " 30 800 110.59 1658.81 89.20 88.28 90.14 0.89\n", | |
| " 43 1000 118.37 1814.81 88.79 88.37 89.20 0.89\n", | |
| " 58 1200 119.99 2013.56 89.02 88.60 89.44 0.89\n", | |
| " 77 1400 144.55 2340.61 89.46 89.25 89.67 0.89\n", | |
| " 99 1600 155.80 2655.94 88.04 87.13 88.97 0.88\n", | |
| "\u001b[38;5;2m✔ Saved pipeline to output directory\u001b[0m\n", | |
| "training/model-last\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## Interpreting those numbers:\n", | |
| "+ (Epoch): An epoch is one complete pass through your entire training dataset. The training process shows you the state of the model after E epochs have been completed.\n", | |
| "\n", | |
| "+ (Step): This is the number of batches or steps the model has processed so far.\n", | |
| "\n", | |
| "+ LOSS TOK2VEC (Loss for Token-to-Vector): This component is responsible for learning meaningful numerical representations of your words. Like any \"loss\" value, you want to see it decrease over time. Its fluctuation is normal as it and the NER component influence each other.\n", | |
| "\n", | |
| "+ LOSS NER (Loss for Named Entity Recognition): This is the most important loss metric for the project. It measures how \"wrong\" the model's entity predictions are. A lower number is better. If NER loss dropped from a massive 3064 to just 66.73, that would be a clear indication of successful learning.\n", | |
| "\n", | |
| "+ ENTS_P (Entities Precision): Of all the entities the model predicted, this is the percentage that were actually correct. A score of 100.00 means the model made no false positive predictions.\n", | |
| "\n", | |
| "+ ENTS_R (Entities Recall): Of all the actual entities in the data, this is the percentage that the model successfully found. A score of 100.00 means the model made no false negative predictions (it didn't miss anything).\n", | |
| "\n", | |
| "+ ENTS_F (Entities F-score): This is the harmonic mean of Precision and Recall, and it's generally considered the single most important metric for evaluating a model's performance. It gives you a balanced measure of its accuracy.\n", | |
| "\n", | |
| "+ SCORE: This is the final score spaCy uses to evaluate the pipeline. For an NER project, this is the ENTS_F score. The training process will save the model from the epoch with the highest score as model-best.\n", | |
| "\n", | |
| "+ Perfect Scores and Overfitting: If the model achieved a perfect F-score of 100.00 on the training data then it has essentially memorized your training set. In many machine learning scenarios, this would be a major red flag for a problem called overfitting. Overfitting is when a model learns its training data so perfectly that it fails to generalize to new, unseen data. In the case of Roman funerary inscriptions, these are highly formulaic. The patterns for names, military units, and ages are quite regular. With a dataset of 197 records, a modern model should be able to memorize these patterns." | |
| ], | |
| "metadata": { | |
| "id": "NQz6yNhjoH2v" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# testing for overfitting\n", | |
| "import spacy\n", | |
| "from spacy.scorer import Scorer\n", | |
| "from spacy.training import Example\n", | |
| "from spacy.tokens import DocBin\n", | |
| "\n", | |
| "def evaluate_detailed(model_path, dev_data_path):\n", | |
| " # Load the best model\n", | |
| " nlp = spacy.load(model_path)\n", | |
| "\n", | |
| " # Load the Dev data\n", | |
| " db = DocBin().from_disk(dev_data_path)\n", | |
| " docs = list(db.get_docs(nlp.vocab))\n", | |
| "\n", | |
| " examples = []\n", | |
| " for doc in docs:\n", | |
| " # Create an Example object (predicted vs reference)\n", | |
| " pred_doc = nlp(doc.text)\n", | |
| " examples.append(Example(pred_doc, doc))\n", | |
| "\n", | |
| " # Calculate scores\n", | |
| " scorer = Scorer()\n", | |
| " scores = scorer.score(examples)\n", | |
| "\n", | |
| " # Print Global Score\n", | |
| " print(f\"--- Global Results ---\")\n", | |
| " print(f\"Precision: {scores['ents_p']:.2f}\")\n", | |
| " print(f\"Recall: {scores['ents_r']:.2f}\")\n", | |
| " print(f\"F-Score: {scores['ents_f']:.2f}\")\n", | |
| "\n", | |
| " # Print Per-Label Score\n", | |
| " print(f\"\\n--- Per-Label Breakdown ---\")\n", | |
| " print(f\"{'LABEL':<30} {'PREC':<10} {'REC':<10} {'F-SCORE':<10}\")\n", | |
| " print(\"-\" * 60)\n", | |
| "\n", | |
| " for label, metrics in scores['ents_per_type'].items():\n", | |
| " print(f\"{label:<30} {metrics['p']:.2f} {metrics['r']:.2f} {metrics['f']:.2f}\")\n", | |
| "\n", | |
| "# Run evaluation\n", | |
| "evaluate_detailed(\"training/model-best\", \"corpus/dev.spacy\")" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "2Ai3y7pTb_IU", | |
| "outputId": "aa0625e5-c9ae-4179-b976-0a340344062c" | |
| }, | |
| "execution_count": 97, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "--- Global Results ---\n", | |
| "Precision: 0.89\n", | |
| "Recall: 0.90\n", | |
| "F-Score: 0.89\n", | |
| "\n", | |
| "--- Per-Label Breakdown ---\n", | |
| "LABEL PREC REC F-SCORE \n", | |
| "------------------------------------------------------------\n", | |
| "DEDICATION_TO_THE_GODS 0.97 0.97 0.97\n", | |
| "NOMEN 0.99 0.99 0.99\n", | |
| "COGNOMEN 0.90 0.89 0.90\n", | |
| "AGE_PREFIX 0.98 1.00 0.99\n", | |
| "DEDICATOR_NAME 1.00 1.00 1.00\n", | |
| "BENE_MERENTI 0.35 0.39 0.37\n", | |
| "FUNERARY_FORMULA 1.00 1.00 1.00\n", | |
| "RELATIONSHIP 0.14 0.33 0.20\n", | |
| "MILITARY_UNIT 0.59 0.54 0.57\n", | |
| "TRIBE 1.00 1.00 1.00\n", | |
| "OCCUPATION 1.00 0.50 0.67\n", | |
| "VERB 0.25 0.25 0.25\n", | |
| "AGE_DAYS 0.00 0.00 0.00\n", | |
| "AGE_YEARS 0.00 0.00 0.00\n", | |
| "FILIATION 0.67 0.67 0.67\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import spacy\n", | |
| "from spacy.scorer import Scorer\n", | |
| "from spacy.training import Example\n", | |
| "from spacy.tokens import DocBin\n", | |
| "\n", | |
| "def evaluate_final(model_path, dev_data_path):\n", | |
| " print(f\"--- Evaluating {model_path} ---\")\n", | |
| " nlp = spacy.load(model_path)\n", | |
| " db = DocBin().from_disk(dev_data_path)\n", | |
| " docs = list(db.get_docs(nlp.vocab))\n", | |
| "\n", | |
| " examples = []\n", | |
| " for doc in docs:\n", | |
| " examples.append(Example(nlp(doc.text), doc))\n", | |
| "\n", | |
| " scores = Scorer().score(examples)\n", | |
| "\n", | |
| " print(f\"{'LABEL':<30} {'PREC':<8} {'REC':<8} {'F1':<8}\")\n", | |
| " print(\"-\" * 60)\n", | |
| " for label, metrics in scores['ents_per_type'].items():\n", | |
| " print(f\"{label:<30} {metrics['p']:.2f} {metrics['r']:.2f} {metrics['f']:.2f}\")\n", | |
| "\n", | |
| "evaluate_final(\"training/model-best\", \"corpus/dev.spacy\")" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "_IyZVwTtnD74", | |
| "outputId": "c3c79fae-561e-4b8a-f8be-ea82db3e4a7b" | |
| }, | |
| "execution_count": 103, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "--- Evaluating training/model-best ---\n", | |
| "LABEL PREC REC F1 \n", | |
| "------------------------------------------------------------\n", | |
| "DEDICATION_TO_THE_GODS 0.97 0.97 0.97\n", | |
| "NOMEN 0.99 0.99 0.99\n", | |
| "COGNOMEN 0.90 0.89 0.90\n", | |
| "AGE_PREFIX 0.98 1.00 0.99\n", | |
| "DEDICATOR_NAME 1.00 1.00 1.00\n", | |
| "BENE_MERENTI 0.35 0.39 0.37\n", | |
| "FUNERARY_FORMULA 1.00 1.00 1.00\n", | |
| "RELATIONSHIP 0.14 0.33 0.20\n", | |
| "MILITARY_UNIT 0.59 0.54 0.57\n", | |
| "TRIBE 1.00 1.00 1.00\n", | |
| "OCCUPATION 1.00 0.50 0.67\n", | |
| "VERB 0.25 0.25 0.25\n", | |
| "AGE_DAYS 0.00 0.00 0.00\n", | |
| "AGE_YEARS 0.00 0.00 0.00\n", | |
| "FILIATION 0.67 0.67 0.67\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import spacy\n", | |
| "import pandas as pd\n", | |
| "import re\n", | |
| "\n", | |
| "# 1. Load Model\n", | |
| "print(\"⏳ Loading Model...\")\n", | |
| "nlp = spacy.load(\"training/model-best\")\n", | |
| "\n", | |
| "# 2. Stricter Regex (The \"Calculator\")\n", | |
| "# Only match specific keywords like ANNIS/MENS.\n", | |
| "# We removed single letters (A, M, D) unless they are followed immediately by a dot to avoid matching names like \"Marcus\".\n", | |
| "REGEX_YEARS = re.compile(r'\\b(VIXIT|VIX|ANNIS|ANN|AN|A\\.)\\s*([IVXLCDM]+)\\b', re.IGNORECASE)\n", | |
| "REGEX_MONTHS = re.compile(r'\\b(MENSIBUS|MENS|MEN|M\\.)\\s*([IVXLCDM]+)\\b', re.IGNORECASE)\n", | |
| "REGEX_DAYS = re.compile(r'\\b(DIEBUS|DIE|D\\.)\\s*([IVXLCDM]+)\\b', re.IGNORECASE)\n", | |
| "\n", | |
| "# 3. Clean & Normalize Text\n", | |
| "def normalize_text(text):\n", | |
| " if pd.isna(text) or text == \"\": return \"\"\n", | |
| " text = str(text)\n", | |
| "\n", | |
| " # Standardize Leiden Brackets\n", | |
| " text = re.sub(r\"\\[\\s*-+\\??\\s*\\]\", \"\", text) # Remove [---]\n", | |
| " text = re.sub(r\"-+\\]\", \"\", text)\n", | |
| " text = re.sub(r\"\\[-+\", \"\", text)\n", | |
| " text = text.replace(\"/\", \" \").replace(\"(\", \"\").replace(\")\", \"\")\n", | |
| " text = text.replace(\"[\", \"\").replace(\"]\", \"\").replace(\"?\", \"\")\n", | |
| " text = re.sub(r\"\\s+\", \" \", text).strip()\n", | |
| "\n", | |
| " # CRITICAL FIX: Latin Inscriptions are traditionally processed in UPPERCASE\n", | |
| " # This matches how the vectors were likely trained.\n", | |
| " return text.upper()\n", | |
| "\n", | |
| "def extract_data(text):\n", | |
| " data = {}\n", | |
| " doc = nlp(text)\n", | |
| "\n", | |
| " # --- A. Neural Network Extraction ---\n", | |
| " ents_by_label = {}\n", | |
| "\n", | |
| " for ent in doc.ents:\n", | |
| " # Skip specific labels we handle with Regex\n", | |
| " if ent.label_ in [\"AGE_YEARS\", \"AGE_MONTHS\", \"AGE_DAYS\", \"AGE_PREFIX\"]:\n", | |
| " continue\n", | |
| "\n", | |
| " # Organize by label\n", | |
| " if ent.label_ not in ents_by_label:\n", | |
| " ents_by_label[ent.label_] = []\n", | |
| " ents_by_label[ent.label_].append(ent.text)\n", | |
| "\n", | |
| " for label, values in ents_by_label.items():\n", | |
| " data[label] = \"; \".join(values)\n", | |
| "\n", | |
| " # --- B. Regex Extraction ---\n", | |
| " # Years\n", | |
| " match_yr = REGEX_YEARS.search(text)\n", | |
| " if match_yr:\n", | |
| " data[\"AGE_YEARS\"] = match_yr.group(2)\n", | |
| "\n", | |
| " # Months\n", | |
| " match_mo = REGEX_MONTHS.search(text)\n", | |
| " if match_mo:\n", | |
| " data[\"AGE_MONTHS\"] = match_mo.group(2)\n", | |
| "\n", | |
| " # Days\n", | |
| " match_da = REGEX_DAYS.search(text)\n", | |
| " if match_da:\n", | |
| " data[\"AGE_DAYS\"] = match_da.group(2)\n", | |
| "\n", | |
| " return data\n", | |
| "\n", | |
| "# --- Execution ---\n", | |
| "input_csv = \"assets/inscriptions.csv\"\n", | |
| "output_csv = \"assets/final_database_fixed.csv\"\n", | |
| "\n", | |
| "df = pd.read_csv(input_csv)\n", | |
| "results = []\n", | |
| "\n", | |
| "print(f\"🚀 Processing {len(df)} records...\")\n", | |
| "\n", | |
| "for index, row in df.iterrows():\n", | |
| " # Prefer diplomatic text if available, otherwise transcription\n", | |
| " raw = row.get('text', '')\n", | |
| " if not raw or pd.isna(raw):\n", | |
| " raw = row.get('transcription', '')\n", | |
| "\n", | |
| " clean_text = normalize_text(raw)\n", | |
| "\n", | |
| " if len(clean_text) < 3: continue\n", | |
| "\n", | |
| " info = extract_data(clean_text)\n", | |
| " info['id'] = row.get('id')\n", | |
| " info['clean_text'] = clean_text # Inspect this to ensure it is UPPERCASE\n", | |
| " results.append(info)\n", | |
| "\n", | |
| "# --- Save ---\n", | |
| "results_df = pd.DataFrame(results)\n", | |
| "\n", | |
| "# Column Ordering Logic\n", | |
| "desired_order = [\n", | |
| " 'id', 'clean_text',\n", | |
| " 'DEDICATION_TO_THE_GODS',\n", | |
| " 'NOMEN', 'COGNOMEN', 'PRAENOMEN', 'DECEASED_NAME',\n", | |
| " 'AGE_YEARS', 'AGE_MONTHS', 'AGE_DAYS',\n", | |
| " 'DEDICATOR_NAME', 'RELATIONSHIP',\n", | |
| " 'MILITARY_UNIT', 'OCCUPATION', 'TRIBE'\n", | |
| "]\n", | |
| "\n", | |
| "# Find which columns actually exist in the results\n", | |
| "existing_cols = results_df.columns.tolist()\n", | |
| "final_cols = [c for c in desired_order if c in existing_cols] + [c for c in existing_cols if c not in desired_order]\n", | |
| "\n", | |
| "results_df = results_df[final_cols]\n", | |
| "results_df.to_csv(output_csv, index=False)\n", | |
| "\n", | |
| "print(\"✅ Done. Check 'clean_text' column - it should be ALL CAPS.\")" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "TP1SEgmfcKJr", | |
| "outputId": "f1f8c67d-1023-4939-b9a0-ef71dddbb0a5" | |
| }, | |
| "execution_count": 118, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "⏳ Loading Model...\n", | |
| "🚀 Processing 500 records...\n", | |
| "✅ Done. Check 'clean_text' column - it should be ALL CAPS.\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## let's get latinepi tools and download some real data" | |
| ], | |
| "metadata": { | |
| "id": "7yIWmxj3rwX8" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!git clone https://github.com/shawngraham/latinepi.git\n", | |
| "%cd latinepi\n", | |
| "\n", | |
| "# Install the package (includes pandas and requests dependencies)\n", | |
| "!pip install -e .\n", | |
| "%cd .." | |
| ], | |
| "metadata": { | |
| "id": "PWBJD1ZcrzHM" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# Search for 1st century AD inscriptions\n", | |
| "print(\"🔍 Searching for 1st century AD inscriptions...\\n\")\n", | |
| "\n", | |
| "!latinepi \\\n", | |
| " --search-edh \\\n", | |
| " --search-year-from 1 \\\n", | |
| " --search-year-to 100 \\\n", | |
| " --search-limit 500 \\\n", | |
| " --download-dir edh_downloads/first_century/\n", | |
| "\n", | |
| "# Check results\n", | |
| "century_files = list(Path('edh_downloads/first_century').glob('*.json'))\n", | |
| "print(f\"\\n✅ Downloaded {len(century_files)} inscriptions from 1st century AD\")\n" | |
| ], | |
| "metadata": { | |
| "id": "ddH0y2lor8yC" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import json\n", | |
| "import csv\n", | |
| "from pathlib import Path\n", | |
| "\n", | |
| "def ingest_inscriptions_to_csv(input_dir, output_file):\n", | |
| " \"\"\"\n", | |
| " Reads all JSON files from a directory and exports specific fields to a CSV.\n", | |
| "\n", | |
| " Args:\n", | |
| " input_dir (str): Path to the folder containing JSON files.\n", | |
| " output_file (str): Path where the CSV should be saved.\n", | |
| " \"\"\"\n", | |
| " source_path = Path(input_dir)\n", | |
| " output_path = Path(output_file)\n", | |
| "\n", | |
| " # List to hold extracted data\n", | |
| " extracted_data = []\n", | |
| "\n", | |
| " # Get all .json files in the directory\n", | |
| " json_files = list(source_path.glob(\"*.json\"))\n", | |
| "\n", | |
| " if not json_files:\n", | |
| " print(f\"⚠️ No JSON files found in directory: {input_dir}\")\n", | |
| " return\n", | |
| "\n", | |
| " print(f\"Processing {len(json_files)} files from {input_dir}...\")\n", | |
| "\n", | |
| " for file_path in json_files:\n", | |
| " try:\n", | |
| " with open(file_path, \"r\", encoding=\"utf-8\") as f:\n", | |
| " data = json.load(f)\n", | |
| "\n", | |
| " # Extract the required fields.\n", | |
| " # .get() returns an empty string if the field is missing.\n", | |
| " row = {\n", | |
| " \"id\": data.get(\"id\", \"\"),\n", | |
| " \"text\": data.get(\"diplomatic_text\", \"\"), # Mapping diplomatic_text -> text\n", | |
| " \"transcription\": data.get(\"transcription\", \"\")\n", | |
| " }\n", | |
| "\n", | |
| " extracted_data.append(row)\n", | |
| "\n", | |
| " except json.JSONDecodeError:\n", | |
| " print(f\"❌ Error decoding JSON: {file_path.name}\")\n", | |
| " except Exception as e:\n", | |
| " print(f\"❌ Error processing {file_path.name}: {e}\")\n", | |
| "\n", | |
| " # Write to CSV\n", | |
| " try:\n", | |
| " with open(output_path, \"w\", newline=\"\", encoding=\"utf-8\") as csvfile:\n", | |
| " fieldnames = [\"id\", \"text\", \"transcription\"]\n", | |
| " writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\n", | |
| "\n", | |
| " writer.writeheader()\n", | |
| " writer.writerows(extracted_data)\n", | |
| "\n", | |
| " print(f\"✅ Successfully saved {len(extracted_data)} records to '{output_file}'\")\n", | |
| "\n", | |
| " except Exception as e:\n", | |
| " print(f\"❌ Error writing CSV file: {e}\")\n", | |
| "\n", | |
| "# --- Execution ---\n", | |
| "# Adjust the path below if your folder name is slightly different\n", | |
| "input_folder = \"edh_downloads/first_century\"\n", | |
| "output_csv = \"assets/inscriptions.csv\"\n", | |
| "\n", | |
| "# Ensure the output directory exists\n", | |
| "Path(\"corpus\").mkdir(exist_ok=True)\n", | |
| "\n", | |
| "ingest_inscriptions_to_csv(input_folder, output_csv)" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "S_Pumy2cQfjU", | |
| "outputId": "7a9a0b6d-346f-4413-e4c1-a06a53e5b406" | |
| }, | |
| "execution_count": 18, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "Processing 500 files from edh_downloads/first_century...\n", | |
| "✅ Successfully saved 500 records to 'assets/inscriptions.csv'\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import spacy\n", | |
| "import pandas as pd\n", | |
| "import re\n", | |
| "\n", | |
| "# --- 1. The Cleaning Function ---\n", | |
| "def clean_leiden_text(text):\n", | |
| " \"\"\"\n", | |
| " Converts Epigraphic Transcription (Leiden) to Natural Latin.\n", | |
| " Example: \"D(is) M(anibus) / [---]us\" -> \"Dis Manibus us\"\n", | |
| " \"\"\"\n", | |
| " if pd.isna(text) or text == \"\":\n", | |
| " return \"\"\n", | |
| "\n", | |
| " # Convert to string just in case\n", | |
| " text = str(text)\n", | |
| "\n", | |
| " # 1. Remove \"Lost text\" markers like [---], [---?], or ------\n", | |
| " # Matches brackets containing hyphens, with optional spaces or question marks\n", | |
| " text = re.sub(r\"\\[\\s*-+\\??\\s*\\]\", \"\", text)\n", | |
| " # Matches loose hyphens at start/end of lines (e.g. ------])\n", | |
| " text = re.sub(r\"-+\\]\", \"\", text)\n", | |
| " text = re.sub(r\"\\[-+\", \"\", text)\n", | |
| "\n", | |
| " # 2. Replace line breaks \"/\" with space\n", | |
| " text = text.replace(\"/\", \" \")\n", | |
| "\n", | |
| " # 3. Remove parentheses () to keep the expansion\n", | |
| " # \"D(is)\" -> \"Dis\"\n", | |
| " text = text.replace(\"(\", \"\").replace(\")\", \"\")\n", | |
| "\n", | |
| " # 4. Remove brackets [] to keep the restoration\n", | |
| " # \"Cl]audi[us\" -> \"Claudius\"\n", | |
| " text = text.replace(\"[\", \"\").replace(\"]\", \"\")\n", | |
| "\n", | |
| " # 5. Remove question marks inside words (uncertain readings)\n", | |
| " # \"dom[um?]\" -> \"domum\"\n", | |
| " text = text.replace(\"?\", \"\")\n", | |
| "\n", | |
| " # 6. Collapse multiple spaces into one\n", | |
| " text = re.sub(r\"\\s+\", \" \", text).strip()\n", | |
| "\n", | |
| " #text = text.upper()\n", | |
| "\n", | |
| " return text\n", | |
| "\n", | |
| "# --- 2. Load Data and Model ---\n", | |
| "print(\"⏳ Loading data and model...\")\n", | |
| "\n", | |
| "# Load the CSV generated in the previous step\n", | |
| "df = pd.read_csv(\"assets/inscriptions.csv\")\n", | |
| "\n", | |
| "# --- 3. Process the Inscriptions ---\n", | |
| "print(\"🧹 Cleaning text and extracting entities...\")\n", | |
| "\n", | |
| "results = []\n", | |
| "\n", | |
| "for index, row in df.iterrows():\n", | |
| " raw_text = row['transcription']\n", | |
| "\n", | |
| " # Apply cleaning\n", | |
| " clean_text = clean_leiden_text(raw_text)\n", | |
| "\n", | |
| " # Skip if text is empty after cleaning\n", | |
| " if not clean_text or len(clean_text) < 2:\n", | |
| " continue\n", | |
| "\n", | |
| " # Store result\n", | |
| " results.append({\n", | |
| " \"id\": row['id'],\n", | |
| " \"clean_text\": clean_text,\n", | |
| " \"raw_transcription\": raw_text\n", | |
| " })\n", | |
| "\n", | |
| "# --- 4. Save Results ---\n", | |
| "output_df = pd.DataFrame(results)\n", | |
| "output_filename = \"assets/cleaned_inscriptions.csv\"\n", | |
| "output_df.to_csv(output_filename, index=False)\n", | |
| "\n", | |
| "print(f\"✅ Done! Processed {len(results)} inscriptions.\")\n", | |
| "print(f\"📂 Results saved to: {output_filename}\")\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "pN79MrDWTXNv", | |
| "outputId": "6a46726c-7375-4537-9d7a-232fa323fba6" | |
| }, | |
| "execution_count": 66, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "⏳ Loading data and model...\n", | |
| "🧹 Cleaning text and extracting entities...\n", | |
| "✅ Done! Processed 500 inscriptions.\n", | |
| "📂 Results saved to: assets/cleaned_inscriptions.csv\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## Test the model or use it on new data" | |
| ], | |
| "metadata": { | |
| "id": "tzZ96_jJ3E3g" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# another bit of code to load the model in and clean things up.\n", | |
| "import spacy\n", | |
| "import pandas as pd\n", | |
| "import json\n", | |
| "from collections import defaultdict\n", | |
| "\n", | |
| "# --- 1. Load your custom-trained model ---\n", | |
| "print(\"Loading model from 'training/model-best'...\")\n", | |
| "nlp_ner = spacy.load(\"training/model-best\")\n", | |
| "\n", | |
| "\n", | |
| "# --- 2. Define input and output paths ---\n", | |
| "#input_csv_path = \"assets/test-fake-epigraphs.csv\"\n", | |
| "input_csv_path = \"assets/cleaned_inscriptions.csv\"\n", | |
| "output_csv_path = \"assets/results_wide.csv\" # New filename for the wide format\n", | |
| "\n", | |
| "try:\n", | |
| " df_new = pd.read_csv(input_csv_path)\n", | |
| "except FileNotFoundError:\n", | |
| " print(f\"Error: Input file not found at {input_csv_path}\")\n", | |
| " df_new = pd.DataFrame() # Create an empty DataFrame to prevent a crash\n", | |
| "\n", | |
| "# --- 3. Process data and collect results for a wide format ---\n", | |
| "print(f\"\\n--- Applying model to data from '{input_csv_path}' ---\")\n", | |
| "\n", | |
| "# This list will hold a dictionary for each row in our final CSV\n", | |
| "processed_rows = []\n", | |
| "\n", | |
| "for index, row in df_new.iterrows():\n", | |
| "\n", | |
| " # We use 'clean_text' because that is what we want to feed the model.\n", | |
| " if 'clean_text' not in row or not isinstance(row['clean_text'], str):\n", | |
| " print(f\"Skipping row {index} due to missing 'clean_text'.\")\n", | |
| " continue\n", | |
| "\n", | |
| " text = row['clean_text'] #\n", | |
| " doc = nlp_ner(text)\n", | |
| "\n", | |
| " # Start building the dictionary for our output row\n", | |
| " # It includes the original data from the input file\n", | |
| " output_row = {\n", | |
| " \"id\": row.get('id', index),\n", | |
| " \"text\": row.get('text', ''),\n", | |
| " \"transcription\": text\n", | |
| " }\n", | |
| "\n", | |
| " # Use a defaultdict to easily collect multiple entities of the same type\n", | |
| " entities_by_label = defaultdict(list)\n", | |
| " if doc.ents:\n", | |
| " print(f\"✅ Found entities in: '{text[:70]}...'\")\n", | |
| " for ent in doc.ents:\n", | |
| " entities_by_label[ent.label_].append(ent.text)\n", | |
| " else:\n", | |
| " print(f\"ℹ️ No entities found in: '{text[:70]}...'\")\n", | |
| "\n", | |
| " # Now, pivot the collected entities into the output_row dictionary.\n", | |
| " # If multiple entities of the same type were found (e.g., two names),\n", | |
| " # they will be joined together with a semicolon.\n", | |
| " for label, texts in entities_by_label.items():\n", | |
| " output_row[label] = \"; \".join(texts)\n", | |
| "\n", | |
| " # Add the completed row dictionary to our list\n", | |
| " processed_rows.append(output_row)\n", | |
| "\n", | |
| "\n", | |
| "# --- 4. Write the collected data to a wide format CSV file ---\n", | |
| "if processed_rows:\n", | |
| " try:\n", | |
| " # Convert the list of dictionaries directly into a pandas DataFrame\n", | |
| " # Pandas will automatically create columns for all unique entity labels found\n", | |
| " results_df = pd.DataFrame(processed_rows)\n", | |
| "\n", | |
| " # Reorder columns to have the source data first, for clarity\n", | |
| " # Get all unique entity labels found during processing\n", | |
| " entity_columns = sorted([col for col in results_df.columns if col not in ['id', 'text', 'transcription']])\n", | |
| " column_order = ['id', 'text', 'transcription'] + entity_columns\n", | |
| "\n", | |
| " # Ensure all columns exist before trying to reorder\n", | |
| " results_df = results_df.reindex(columns=column_order)\n", | |
| "\n", | |
| " # Save the DataFrame to a CSV file\n", | |
| " results_df.to_csv(output_csv_path, index=False)\n", | |
| " print(f\"\\n✅ Successfully saved wide-format results to {output_csv_path}\")\n", | |
| " except Exception as e:\n", | |
| " print(f\"Error saving CSV file: {e}\")\n", | |
| "else:\n", | |
| " print(\"ℹ️ No data was processed, so no CSV file was created.\")" | |
| ], | |
| "metadata": { | |
| "id": "Ba5s3RRWjrA9" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "results_df" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/", | |
| "height": 1000 | |
| }, | |
| "id": "CTiapyuG8EVr", | |
| "outputId": "01d1faf3-92ec-4274-fd4f-82bce4379509" | |
| }, | |
| "execution_count": 121, | |
| "outputs": [ | |
| { | |
| "output_type": "execute_result", | |
| "data": { | |
| "text/plain": [ | |
| " id clean_text \\\n", | |
| "0 HD000010 D M L ASINI POLI SECVNDVS ET ORPHAEVS LIB P B M \n", | |
| "1 HD000476 T AV S T L CRY ENES \n", | |
| "2 HD000251 M PILI PRIMIG GRANIANI \n", | |
| "3 HD000324 AVDI ITAE N AM CAELO OVANS DOM IVST VERV M C \n", | |
| "4 HD000280 Ε ΗΘΙΚ A POSTYM A F SENECA V AN IV \n", | |
| ".. ... ... \n", | |
| "494 HD000232 D M S P VELLEIVS DONATVS ET SIBI POMPEIAE AM P... \n", | |
| "495 HD000132 D M L VOCONIO VERO MIL COH VI PR | CASSI MIL A... \n", | |
| "496 HD000252 D M S L IVN O NESIPH ORVS AN LX P I S \n", | |
| "497 HD000119 K D L PETRE IVS VIC TOR ALI ARIVS D K M V S L M \n", | |
| "498 HD000300 RIAE CARISS ET RESCENTIS EORVM ALEMERA TERNOR \n", | |
| "\n", | |
| " DEDICATION_TO_THE_GODS NOMEN COGNOMEN PRAENOMEN AGE_YEARS AGE_MONTHS \\\n", | |
| "0 NaN NaN NaN NaN NaN NaN \n", | |
| "1 NaN NaN NaN NaN NaN NaN \n", | |
| "2 NaN NaN NaN NaN NaN NaN \n", | |
| "3 NaN NaN NaN NaN NaN NaN \n", | |
| "4 NaN NaN NaN NaN IV NaN \n", | |
| ".. ... ... ... ... ... ... \n", | |
| "494 NaN NaN NaN NaN NaN NaN \n", | |
| "495 NaN NaN NaN NaN X NaN \n", | |
| "496 NaN NaN NaN NaN LX NaN \n", | |
| "497 NaN NaN NaN NaN NaN NaN \n", | |
| "498 NaN NaN NaN NaN NaN NaN \n", | |
| "\n", | |
| " AGE_DAYS DEDICATOR_NAME MILITARY_UNIT OCCUPATION TRIBE VERB \\\n", | |
| "0 NaN NaN SECVNDVS ET NaN NaN NaN \n", | |
| "1 NaN NaN NaN NaN NaN NaN \n", | |
| "2 NaN NaN NaN NaN NaN NaN \n", | |
| "3 NaN NaN NaN NaN NaN NaN \n", | |
| "4 NaN NaN NaN NaN NaN NaN \n", | |
| ".. ... ... ... ... ... ... \n", | |
| "494 NaN NaN NaN NaN NaN NaN \n", | |
| "495 NaN NaN NaN NaN NaN NaN \n", | |
| "496 NaN NaN NaN NaN NaN NaN \n", | |
| "497 NaN TOR ALI ARIVS D NaN NaN NaN NaN \n", | |
| "498 NaN NaN NaN NaN NaN NaN \n", | |
| "\n", | |
| " BENE_MERENTI FILIATION FUNERARY_FORMULA \n", | |
| "0 NaN NaN NaN \n", | |
| "1 NaN NaN NaN \n", | |
| "2 NaN NaN NaN \n", | |
| "3 NaN NaN NaN \n", | |
| "4 NaN NaN NaN \n", | |
| ".. ... ... ... \n", | |
| "494 NaN NaN NaN \n", | |
| "495 NaN NaN NaN \n", | |
| "496 NaN O NaN \n", | |
| "497 NaN NaN NaN \n", | |
| "498 NaN NaN NaN \n", | |
| "\n", | |
| "[499 rows x 17 columns]" | |
| ], | |
| "text/html": [ | |
| "\n", | |
| " <div id=\"df-22b8f64d-1edb-4aa4-895d-01eaea477fd2\" class=\"colab-df-container\">\n", | |
| " <div>\n", | |
| "<style scoped>\n", | |
| " .dataframe tbody tr th:only-of-type {\n", | |
| " vertical-align: middle;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe tbody tr th {\n", | |
| " vertical-align: top;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe thead th {\n", | |
| " text-align: right;\n", | |
| " }\n", | |
| "</style>\n", | |
| "<table border=\"1\" class=\"dataframe\">\n", | |
| " <thead>\n", | |
| " <tr style=\"text-align: right;\">\n", | |
| " <th></th>\n", | |
| " <th>id</th>\n", | |
| " <th>clean_text</th>\n", | |
| " <th>DEDICATION_TO_THE_GODS</th>\n", | |
| " <th>NOMEN</th>\n", | |
| " <th>COGNOMEN</th>\n", | |
| " <th>PRAENOMEN</th>\n", | |
| " <th>AGE_YEARS</th>\n", | |
| " <th>AGE_MONTHS</th>\n", | |
| " <th>AGE_DAYS</th>\n", | |
| " <th>DEDICATOR_NAME</th>\n", | |
| " <th>MILITARY_UNIT</th>\n", | |
| " <th>OCCUPATION</th>\n", | |
| " <th>TRIBE</th>\n", | |
| " <th>VERB</th>\n", | |
| " <th>BENE_MERENTI</th>\n", | |
| " <th>FILIATION</th>\n", | |
| " <th>FUNERARY_FORMULA</th>\n", | |
| " </tr>\n", | |
| " </thead>\n", | |
| " <tbody>\n", | |
| " <tr>\n", | |
| " <th>0</th>\n", | |
| " <td>HD000010</td>\n", | |
| " <td>D M L ASINI POLI SECVNDVS ET ORPHAEVS LIB P B M</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>SECVNDVS ET</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1</th>\n", | |
| " <td>HD000476</td>\n", | |
| " <td>T AV S T L CRY ENES</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>2</th>\n", | |
| " <td>HD000251</td>\n", | |
| " <td>M PILI PRIMIG GRANIANI</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>3</th>\n", | |
| " <td>HD000324</td>\n", | |
| " <td>AVDI ITAE N AM CAELO OVANS DOM IVST VERV M C</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>4</th>\n", | |
| " <td>HD000280</td>\n", | |
| " <td>Ε ΗΘΙΚ A POSTYM A F SENECA V AN IV</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>IV</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>...</th>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>494</th>\n", | |
| " <td>HD000232</td>\n", | |
| " <td>D M S P VELLEIVS DONATVS ET SIBI POMPEIAE AM P...</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>495</th>\n", | |
| " <td>HD000132</td>\n", | |
| " <td>D M L VOCONIO VERO MIL COH VI PR | CASSI MIL A...</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>X</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>496</th>\n", | |
| " <td>HD000252</td>\n", | |
| " <td>D M S L IVN O NESIPH ORVS AN LX P I S</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>LX</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>O</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>497</th>\n", | |
| " <td>HD000119</td>\n", | |
| " <td>K D L PETRE IVS VIC TOR ALI ARIVS D K M V S L M</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>TOR ALI ARIVS D</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>498</th>\n", | |
| " <td>HD000300</td>\n", | |
| " <td>RIAE CARISS ET RESCENTIS EORVM ALEMERA TERNOR</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " </tbody>\n", | |
| "</table>\n", | |
| "<p>499 rows × 17 columns</p>\n", | |
| "</div>\n", | |
| " <div class=\"colab-df-buttons\">\n", | |
| "\n", | |
| " <div class=\"colab-df-container\">\n", | |
| " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-22b8f64d-1edb-4aa4-895d-01eaea477fd2')\"\n", | |
| " title=\"Convert this dataframe to an interactive table.\"\n", | |
| " style=\"display:none;\">\n", | |
| "\n", | |
| " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n", | |
| " <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n", | |
| " </svg>\n", | |
| " </button>\n", | |
| "\n", | |
| " <style>\n", | |
| " .colab-df-container {\n", | |
| " display:flex;\n", | |
| " gap: 12px;\n", | |
| " }\n", | |
| "\n", | |
| " .colab-df-convert {\n", | |
| " background-color: #E8F0FE;\n", | |
| " border: none;\n", | |
| " border-radius: 50%;\n", | |
| " cursor: pointer;\n", | |
| " display: none;\n", | |
| " fill: #1967D2;\n", | |
| " height: 32px;\n", | |
| " padding: 0 0 0 0;\n", | |
| " width: 32px;\n", | |
| " }\n", | |
| "\n", | |
| " .colab-df-convert:hover {\n", | |
| " background-color: #E2EBFA;\n", | |
| " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", | |
| " fill: #174EA6;\n", | |
| " }\n", | |
| "\n", | |
| " .colab-df-buttons div {\n", | |
| " margin-bottom: 4px;\n", | |
| " }\n", | |
| "\n", | |
| " [theme=dark] .colab-df-convert {\n", | |
| " background-color: #3B4455;\n", | |
| " fill: #D2E3FC;\n", | |
| " }\n", | |
| "\n", | |
| " [theme=dark] .colab-df-convert:hover {\n", | |
| " background-color: #434B5C;\n", | |
| " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", | |
| " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", | |
| " fill: #FFFFFF;\n", | |
| " }\n", | |
| " </style>\n", | |
| "\n", | |
| " <script>\n", | |
| " const buttonEl =\n", | |
| " document.querySelector('#df-22b8f64d-1edb-4aa4-895d-01eaea477fd2 button.colab-df-convert');\n", | |
| " buttonEl.style.display =\n", | |
| " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", | |
| "\n", | |
| " async function convertToInteractive(key) {\n", | |
| " const element = document.querySelector('#df-22b8f64d-1edb-4aa4-895d-01eaea477fd2');\n", | |
| " const dataTable =\n", | |
| " await google.colab.kernel.invokeFunction('convertToInteractive',\n", | |
| " [key], {});\n", | |
| " if (!dataTable) return;\n", | |
| "\n", | |
| " const docLinkHtml = 'Like what you see? Visit the ' +\n", | |
| " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", | |
| " + ' to learn more about interactive tables.';\n", | |
| " element.innerHTML = '';\n", | |
| " dataTable['output_type'] = 'display_data';\n", | |
| " await google.colab.output.renderOutput(dataTable, element);\n", | |
| " const docLink = document.createElement('div');\n", | |
| " docLink.innerHTML = docLinkHtml;\n", | |
| " element.appendChild(docLink);\n", | |
| " }\n", | |
| " </script>\n", | |
| " </div>\n", | |
| "\n", | |
| "\n", | |
| " <div id=\"df-5d26b5b9-5070-4c10-a973-453fbcee62b5\">\n", | |
| " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-5d26b5b9-5070-4c10-a973-453fbcee62b5')\"\n", | |
| " title=\"Suggest charts\"\n", | |
| " style=\"display:none;\">\n", | |
| "\n", | |
| "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", | |
| " width=\"24px\">\n", | |
| " <g>\n", | |
| " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", | |
| " </g>\n", | |
| "</svg>\n", | |
| " </button>\n", | |
| "\n", | |
| "<style>\n", | |
| " .colab-df-quickchart {\n", | |
| " --bg-color: #E8F0FE;\n", | |
| " --fill-color: #1967D2;\n", | |
| " --hover-bg-color: #E2EBFA;\n", | |
| " --hover-fill-color: #174EA6;\n", | |
| " --disabled-fill-color: #AAA;\n", | |
| " --disabled-bg-color: #DDD;\n", | |
| " }\n", | |
| "\n", | |
| " [theme=dark] .colab-df-quickchart {\n", | |
| " --bg-color: #3B4455;\n", | |
| " --fill-color: #D2E3FC;\n", | |
| " --hover-bg-color: #434B5C;\n", | |
| " --hover-fill-color: #FFFFFF;\n", | |
| " --disabled-bg-color: #3B4455;\n", | |
| " --disabled-fill-color: #666;\n", | |
| " }\n", | |
| "\n", | |
| " .colab-df-quickchart {\n", | |
| " background-color: var(--bg-color);\n", | |
| " border: none;\n", | |
| " border-radius: 50%;\n", | |
| " cursor: pointer;\n", | |
| " display: none;\n", | |
| " fill: var(--fill-color);\n", | |
| " height: 32px;\n", | |
| " padding: 0;\n", | |
| " width: 32px;\n", | |
| " }\n", | |
| "\n", | |
| " .colab-df-quickchart:hover {\n", | |
| " background-color: var(--hover-bg-color);\n", | |
| " box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n", | |
| " fill: var(--button-hover-fill-color);\n", | |
| " }\n", | |
| "\n", | |
| " .colab-df-quickchart-complete:disabled,\n", | |
| " .colab-df-quickchart-complete:disabled:hover {\n", | |
| " background-color: var(--disabled-bg-color);\n", | |
| " fill: var(--disabled-fill-color);\n", | |
| " box-shadow: none;\n", | |
| " }\n", | |
| "\n", | |
| " .colab-df-spinner {\n", | |
| " border: 2px solid var(--fill-color);\n", | |
| " border-color: transparent;\n", | |
| " border-bottom-color: var(--fill-color);\n", | |
| " animation:\n", | |
| " spin 1s steps(1) infinite;\n", | |
| " }\n", | |
| "\n", | |
| " @keyframes spin {\n", | |
| " 0% {\n", | |
| " border-color: transparent;\n", | |
| " border-bottom-color: var(--fill-color);\n", | |
| " border-left-color: var(--fill-color);\n", | |
| " }\n", | |
| " 20% {\n", | |
| " border-color: transparent;\n", | |
| " border-left-color: var(--fill-color);\n", | |
| " border-top-color: var(--fill-color);\n", | |
| " }\n", | |
| " 30% {\n", | |
| " border-color: transparent;\n", | |
| " border-left-color: var(--fill-color);\n", | |
| " border-top-color: var(--fill-color);\n", | |
| " border-right-color: var(--fill-color);\n", | |
| " }\n", | |
| " 40% {\n", | |
| " border-color: transparent;\n", | |
| " border-right-color: var(--fill-color);\n", | |
| " border-top-color: var(--fill-color);\n", | |
| " }\n", | |
| " 60% {\n", | |
| " border-color: transparent;\n", | |
| " border-right-color: var(--fill-color);\n", | |
| " }\n", | |
| " 80% {\n", | |
| " border-color: transparent;\n", | |
| " border-right-color: var(--fill-color);\n", | |
| " border-bottom-color: var(--fill-color);\n", | |
| " }\n", | |
| " 90% {\n", | |
| " border-color: transparent;\n", | |
| " border-bottom-color: var(--fill-color);\n", | |
| " }\n", | |
| " }\n", | |
| "</style>\n", | |
| "\n", | |
| " <script>\n", | |
| " async function quickchart(key) {\n", | |
| " const quickchartButtonEl =\n", | |
| " document.querySelector('#' + key + ' button');\n", | |
| " quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n", | |
| " quickchartButtonEl.classList.add('colab-df-spinner');\n", | |
| " try {\n", | |
| " const charts = await google.colab.kernel.invokeFunction(\n", | |
| " 'suggestCharts', [key], {});\n", | |
| " } catch (error) {\n", | |
| " console.error('Error during call to suggestCharts:', error);\n", | |
| " }\n", | |
| " quickchartButtonEl.classList.remove('colab-df-spinner');\n", | |
| " quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n", | |
| " }\n", | |
| " (() => {\n", | |
| " let quickchartButtonEl =\n", | |
| " document.querySelector('#df-5d26b5b9-5070-4c10-a973-453fbcee62b5 button');\n", | |
| " quickchartButtonEl.style.display =\n", | |
| " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", | |
| " })();\n", | |
| " </script>\n", | |
| " </div>\n", | |
| "\n", | |
| " <div id=\"id_fa9581a4-c2c5-469d-8464-c36306ed0024\">\n", | |
| " <style>\n", | |
| " .colab-df-generate {\n", | |
| " background-color: #E8F0FE;\n", | |
| " border: none;\n", | |
| " border-radius: 50%;\n", | |
| " cursor: pointer;\n", | |
| " display: none;\n", | |
| " fill: #1967D2;\n", | |
| " height: 32px;\n", | |
| " padding: 0 0 0 0;\n", | |
| " width: 32px;\n", | |
| " }\n", | |
| "\n", | |
| " .colab-df-generate:hover {\n", | |
| " background-color: #E2EBFA;\n", | |
| " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", | |
| " fill: #174EA6;\n", | |
| " }\n", | |
| "\n", | |
| " [theme=dark] .colab-df-generate {\n", | |
| " background-color: #3B4455;\n", | |
| " fill: #D2E3FC;\n", | |
| " }\n", | |
| "\n", | |
| " [theme=dark] .colab-df-generate:hover {\n", | |
| " background-color: #434B5C;\n", | |
| " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", | |
| " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", | |
| " fill: #FFFFFF;\n", | |
| " }\n", | |
| " </style>\n", | |
| " <button class=\"colab-df-generate\" onclick=\"generateWithVariable('results_df')\"\n", | |
| " title=\"Generate code using this dataframe.\"\n", | |
| " style=\"display:none;\">\n", | |
| "\n", | |
| " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", | |
| " width=\"24px\">\n", | |
| " <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n", | |
| " </svg>\n", | |
| " </button>\n", | |
| " <script>\n", | |
| " (() => {\n", | |
| " const buttonEl =\n", | |
| " document.querySelector('#id_fa9581a4-c2c5-469d-8464-c36306ed0024 button.colab-df-generate');\n", | |
| " buttonEl.style.display =\n", | |
| " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", | |
| "\n", | |
| " buttonEl.onclick = () => {\n", | |
| " google.colab.notebook.generateWithVariable('results_df');\n", | |
| " }\n", | |
| " })();\n", | |
| " </script>\n", | |
| " </div>\n", | |
| "\n", | |
| " </div>\n", | |
| " </div>\n" | |
| ], | |
| "application/vnd.google.colaboratory.intrinsic+json": { | |
| "type": "dataframe", | |
| "variable_name": "results_df", | |
| "summary": "{\n \"name\": \"results_df\",\n \"rows\": 499,\n \"fields\": [\n {\n \"column\": \"id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 499,\n \"samples\": [\n \"HD000090\",\n \"HD000069\",\n \"HD000182\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"clean_text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 499,\n \"samples\": [\n \"MINERVAE DE SVLI DONAVI FVREM QVI CARACALLAM MEAM INVO LAVIT SI SERVVS SI LIBER SI BA RO SI MVLIER HOC DONVM NON REDEMAT NESSI SANGVNE SVO\",\n \"LVCR S CATVRON S F LARI B EIRADI GO EX VOT POS AR SAC\",\n \"D M L LVCILI MAR TIALIS VIX AN XXII D XXI CA QVINTA MATER MISER RIMA FECIT\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"DEDICATION_TO_THE_GODS\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 11,\n \"samples\": [\n \"DIS DONVM\",\n \"O\",\n \"DIS\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"NOMEN\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 16,\n \"samples\": [\n \"SAPEONI CERA\",\n \"ETATE LVNA\",\n \"ZOSIMI\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"COGNOMEN\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 35,\n \"samples\": [\n \"APOLLINI AVG\",\n \"QVE EIVS\",\n \"SEVERA\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"PRAENOMEN\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"SEX\",\n \"DEAE\",\n \"OMINO\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AGE_YEARS\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 47,\n \"samples\": [\n \"XXXX\",\n \"LXV\",\n \"LV\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AGE_MONTHS\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"V\",\n \"III\",\n \"VII\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AGE_DAYS\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"IIII\",\n \"VIIII\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"DEDICATOR_NAME\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 30,\n \"samples\": [\n \"RT BVNO F\",\n \"COPON LIBERTIS RISQVE\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MILITARY_UNIT\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 56,\n \"samples\": [\n \"SECVNDVS ET\",\n \"COPORICI MATERNI\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"OCCUPATION\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"MINDI\",\n \"SACTA SERIE\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TRIBE\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"MER VRI\",\n \"O PROCONS\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"VERB\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 34,\n \"samples\": [\n \"PONPONIO; PONPO\",\n \"CN\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"BENE_MERENTI\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 17,\n \"samples\": [\n \"MERCES\",\n \"PROBA\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FILIATION\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"AT\",\n \"O\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FUNERARY_FORMULA\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 8,\n \"samples\": [\n \"LXVI HIC\",\n \"LXX HIC\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" | |
| } | |
| }, | |
| "metadata": {}, | |
| "execution_count": 121 | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import pandas as pd\n", | |
| "import matplotlib.pyplot as plt\n", | |
| "import seaborn as sns\n", | |
| "import spacy\n", | |
| "from spacy import displacy\n", | |
| "import re\n", | |
| "from collections import Counter\n", | |
| "import io\n", | |
| "\n", | |
| "# --- Helper function to convert Roman numerals AND Latin words to integers ---\n", | |
| "def convert_latin_age_to_int(age_str: str) -> int:\n", | |
| " \"\"\"\n", | |
| " Converts an age from a Latin inscription (Roman numeral or word) to an integer.\n", | |
| " Handles errors gracefully by returning 0 for invalid inputs.\n", | |
| " \"\"\"\n", | |
| " if not isinstance(age_str, str):\n", | |
| " return 0\n", | |
| "\n", | |
| " # Standardize the input string and handle common endings\n", | |
| " processed_age_str = age_str.strip().upper().rstrip('.;,')\n", | |
| "\n", | |
| " # 1. Check for Latin number words first\n", | |
| " LATIN_WORDS_TO_INT = {\n", | |
| " 'UNUM': 1, 'DUO': 2, 'DUOBUS': 2, 'TRES': 3, 'TRIBUS': 3, 'QUATTUOR': 4,\n", | |
| " 'QUINQUE': 5, 'SEX': 6, 'SEPTEM': 7, 'OCTO': 8, 'NOVEM': 9, 'DECEM': 10,\n", | |
| " 'UNDECIM': 11, 'DUODECIM': 12, 'TREDECIM': 13, 'QUATTUORDECIM': 14,\n", | |
| " 'QUINDECIM': 15, 'SEDECIM': 16, 'SEPTENDECIM': 17, 'DUODEVIGINTI': 18,\n", | |
| " 'UNDEVIGINTI': 19, 'VIGINTI': 20, 'TRIGINTA': 30, 'QUADRAGINTA': 40,\n", | |
| " 'QUINQUAGINTA': 50, 'SEXAGINTA': 60, 'SEPTUAGINTA': 70, 'OCTOGINTA': 80,\n", | |
| " 'NONAGINTA': 90, 'CENTUM': 100\n", | |
| " }\n", | |
| "\n", | |
| " if processed_age_str in LATIN_WORDS_TO_INT:\n", | |
| " return LATIN_WORDS_TO_INT[processed_age_str]\n", | |
| "\n", | |
| " # 2. If it's not a word, try parsing as a Roman numeral\n", | |
| " roman_map = {'I': 1, 'V': 5, 'X': 10, 'L': 50, 'C': 100, 'D': 500, 'M': 1000}\n", | |
| " val = 0\n", | |
| " try:\n", | |
| " for i in range(len(processed_age_str)):\n", | |
| " if i > 0 and roman_map[processed_age_str[i]] > roman_map[processed_age_str[i-1]]:\n", | |
| " val += roman_map[processed_age_str[i]] - 2 * roman_map[processed_age_str[i-1]]\n", | |
| " else:\n", | |
| " val += roman_map[processed_age_str[i]]\n", | |
| " return val\n", | |
| " except KeyError:\n", | |
| " return 0\n", | |
| "\n", | |
| "def visualize_ner_results(\n", | |
| " results_df: pd.DataFrame,\n", | |
| " num_examples_to_render: int = 5,\n", | |
| " top_n_occupations: int = 10\n", | |
| "):\n", | |
| " \"\"\"\n", | |
| " Reads a wide-format DataFrame of NER results and generates visualizations.\n", | |
| " This version correctly handles ages as Latin words and semicolon-separated values.\n", | |
| " \"\"\"\n", | |
| " if not isinstance(results_df, pd.DataFrame):\n", | |
| " print(\"❌ Error: Input must be a pandas DataFrame.\")\n", | |
| " return\n", | |
| "\n", | |
| " # --- 1. Visualization: Entity Type Frequency ---\n", | |
| " print(\"--- Visualization 1: Entity Type Frequency ---\")\n", | |
| " entity_columns = [col for col in results_df.columns if col not in ['id', 'text', 'transcription', 'clean_text']]\n", | |
| " entity_counts = results_df[entity_columns].count().sort_values(ascending=False)\n", | |
| "\n", | |
| " plt.style.use('seaborn-v0_8-whitegrid')\n", | |
| " plt.figure(figsize=(12, 8))\n", | |
| " sns.barplot(x=entity_counts.index, y=entity_counts.values, palette=\"viridis\")\n", | |
| " plt.title('Frequency of Each Entity Type', fontsize=16, weight='bold')\n", | |
| " plt.ylabel('Total Count', fontsize=12)\n", | |
| " plt.xlabel('Entity Type', fontsize=12)\n", | |
| " plt.xticks(rotation=45, ha='right')\n", | |
| " plt.tight_layout()\n", | |
| " plt.show()\n", | |
| " print(\"\\n\")\n", | |
| "\n", | |
| " # --- 2. Visualization: Top N Nomen ---\n", | |
| " print(f\"--- Visualization 2: Top {top_n_occupations} Most Common Nomen ---\")\n", | |
| " if 'NOMEN' in results_df.columns and not results_df['NOMEN'].isnull().all():\n", | |
| " # Explode the semicolon-separated strings into separate rows\n", | |
| " occupations = results_df['NOMEN'].dropna().str.split(';').explode().str.strip()\n", | |
| " top_occupations = occupations.value_counts().nlargest(top_n_occupations)\n", | |
| "\n", | |
| " plt.figure(figsize=(10, 8))\n", | |
| " sns.barplot(x=top_occupations.values, y=top_occupations.index, orient='h', palette='magma')\n", | |
| " plt.title(f'Top {top_n_occupations} NOMEN', fontsize=16, weight='bold')\n", | |
| " plt.xlabel('Count', fontsize=12)\n", | |
| " plt.ylabel('NOMEN', fontsize=12)\n", | |
| " plt.tight_layout()\n", | |
| " plt.show()\n", | |
| " else:\n", | |
| " print(\"⚠️ 'NOMEN' column not found or is empty. Skipping this visualization.\")\n", | |
| " print(\"\\n\")\n", | |
| "\n", | |
| " # --- 3. Visualization: Age Distribution (with updated conversion) ---\n", | |
| " print(\"--- Visualization 3: Age Distribution of the Deceased ---\")\n", | |
| " if 'AGE_YEARS' in results_df.columns and not results_df['AGE_YEARS'].isnull().all():\n", | |
| " # Explode semicolon-separated values, then apply the conversion function\n", | |
| " ages_str = results_df['AGE_YEARS'].dropna().str.split(';').explode().str.strip()\n", | |
| " ages_int = ages_str.apply(convert_latin_age_to_int)\n", | |
| " valid_ages = ages_int[(ages_int > 0) & (ages_int <= 120)]\n", | |
| "\n", | |
| " if not valid_ages.empty:\n", | |
| " plt.figure(figsize=(12, 7))\n", | |
| " sns.histplot(valid_ages, bins=30, kde=True, color='teal')\n", | |
| " plt.title('Distribution of Deceased Ages (in Years)', fontsize=16, weight='bold')\n", | |
| " plt.xlabel('Age (Years)', fontsize=12)\n", | |
| " plt.ylabel('Number of Inscriptions', fontsize=12)\n", | |
| " plt.tight_layout()\n", | |
| " plt.show()\n", | |
| " else:\n", | |
| " print(\"⚠️ No valid ages could be converted from the 'AGE_YEARS' column.\")\n", | |
| " else:\n", | |
| " print(\"⚠️ 'AGE_YEARS' column not found. Skipping age distribution plot.\")\n", | |
| " print(\"\\n\")\n", | |
| "\n", | |
| " # --- 4. Visualization: Top N Tribe ---\n", | |
| " print(f\"--- Visualization 2: Top {top_n_occupations} Most Common Tribe ---\")\n", | |
| " if 'TRIBE' in results_df.columns and not results_df['TRIBE'].isnull().all():\n", | |
| " # Explode the semicolon-separated strings into separate rows\n", | |
| " occupations = results_df['TRIBE'].dropna().str.split(';').explode().str.strip()\n", | |
| " top_occupations = occupations.value_counts().nlargest(top_n_occupations)\n", | |
| "\n", | |
| " plt.figure(figsize=(10, 8))\n", | |
| " sns.barplot(x=top_occupations.values, y=top_occupations.index, orient='h', palette='magma')\n", | |
| " plt.title(f'Top {top_n_occupations} TRIBE', fontsize=16, weight='bold')\n", | |
| " plt.xlabel('Count', fontsize=12)\n", | |
| " plt.ylabel('TRIBE', fontsize=12)\n", | |
| " plt.tight_layout()\n", | |
| " plt.show()\n", | |
| " else:\n", | |
| " print(\"⚠️ 'TRIBE' column not found or is empty. Skipping this visualization.\")\n", | |
| " print(\"\\n\")\n", | |
| "\n", | |
| " # --- 5. Visualization: Top N Military Unit ---\n", | |
| " print(f\"--- Visualization 2: Top {top_n_occupations} Most Common Military Unit ---\")\n", | |
| " if 'TRIBE' in results_df.columns and not results_df['MILITARY_UNIT'].isnull().all():\n", | |
| " # Explode the semicolon-separated strings into separate rows\n", | |
| " occupations = results_df['MILITARY_UNIT'].dropna().str.split(';').explode().str.strip()\n", | |
| " top_occupations = occupations.value_counts().nlargest(top_n_occupations)\n", | |
| "\n", | |
| " plt.figure(figsize=(10, 8))\n", | |
| " sns.barplot(x=top_occupations.values, y=top_occupations.index, orient='h', palette='magma')\n", | |
| " plt.title(f'Top {top_n_occupations} MILITARY_UNIT', fontsize=16, weight='bold')\n", | |
| " plt.xlabel('Count', fontsize=12)\n", | |
| " plt.ylabel('MILITARY_UNIT', fontsize=12)\n", | |
| " plt.tight_layout()\n", | |
| " plt.show()\n", | |
| " else:\n", | |
| " print(\"⚠️ 'MILITARY_UNIT' column not found or is empty. Skipping this visualization.\")\n", | |
| " print(\"\\n\")\n", | |
| "\n", | |
| "visualize_ner_results(results_df)" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/", | |
| "height": 1000 | |
| }, | |
| "id": "lTAMNIZQ7PsP", | |
| "outputId": "a91683fe-5533-49cc-d533-2f872f9bbe63" | |
| }, | |
| "execution_count": 130, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "--- Visualization 1: Entity Type Frequency ---\n" | |
| ] | |
| }, | |
| { | |
| "output_type": "stream", | |
| "name": "stderr", | |
| "text": [ | |
| "/tmp/ipython-input-2836028133.py:69: FutureWarning: \n", | |
| "\n", | |
| "Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n", | |
| "\n", | |
| " sns.barplot(x=entity_counts.index, y=entity_counts.values, palette=\"viridis\")\n" | |
| ] | |
| }, | |
| { | |
| "output_type": "display_data", | |
| "data": { | |
| "text/plain": [ | |
| "<Figure size 1200x800 with 1 Axes>" | |
| ], | |
| "image/png": "\n" | |
| }, | |
| "metadata": {} | |
| }, | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "\n", | |
| "\n", | |
| "--- Visualization 2: Top 10 Most Common Nomen ---\n" | |
| ] | |
| }, | |
| { | |
| "output_type": "stream", | |
| "name": "stderr", | |
| "text": [ | |
| "/tmp/ipython-input-2836028133.py:86: FutureWarning: \n", | |
| "\n", | |
| "Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.\n", | |
| "\n", | |
| " sns.barplot(x=top_occupations.values, y=top_occupations.index, orient='h', palette='magma')\n" | |
| ] | |
| }, | |
| { | |
| "output_type": "display_data", | |
| "data": { | |
| "text/plain": [ | |
| "<Figure size 1000x800 with 1 Axes>" | |
| ], | |
| "image/png": "\n" | |
| }, | |
| "metadata": {} | |
| }, | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "\n", | |
| "\n", | |
| "--- Visualization 3: Age Distribution of the Deceased ---\n" | |
| ] | |
| }, | |
| { | |
| "output_type": "display_data", | |
| "data": { | |
| "text/plain": [ | |
| "<Figure size 1200x700 with 1 Axes>" | |
| ], | |
| "image/png": "\n" | |
| }, | |
| "metadata": {} | |
| }, | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "\n", | |
| "\n", | |
| "--- Visualization 2: Top 10 Most Common Tribe ---\n" | |
| ] | |
| }, | |
| { | |
| "output_type": "stream", | |
| "name": "stderr", | |
| "text": [ | |
| "/tmp/ipython-input-2836028133.py:126: FutureWarning: \n", | |
| "\n", | |
| "Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.\n", | |
| "\n", | |
| " sns.barplot(x=top_occupations.values, y=top_occupations.index, orient='h', palette='magma')\n" | |
| ] | |
| }, | |
| { | |
| "output_type": "display_data", | |
| "data": { | |
| "text/plain": [ | |
| "<Figure size 1000x800 with 1 Axes>" | |
| ], | |
| "image/png": "\n" | |
| }, | |
| "metadata": {} | |
| }, | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "\n", | |
| "\n", | |
| "--- Visualization 2: Top 10 Most Common Military Unit ---\n" | |
| ] | |
| }, | |
| { | |
| "output_type": "stream", | |
| "name": "stderr", | |
| "text": [ | |
| "/tmp/ipython-input-2836028133.py:144: FutureWarning: \n", | |
| "\n", | |
| "Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.\n", | |
| "\n", | |
| " sns.barplot(x=top_occupations.values, y=top_occupations.index, orient='h', palette='magma')\n" | |
| ] | |
| }, | |
| { | |
| "output_type": "display_data", | |
| "data": { | |
| "text/plain": [ | |
| "<Figure size 1000x800 with 1 Axes>" | |
| ], | |
| "image/png": "\n" | |
| }, | |
| "metadata": {} | |
| }, | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "\n", | |
| "\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "...alright. SIGH. There are problems. But it's amazing how far you can get with synthetic data." | |
| ], | |
| "metadata": { | |
| "id": "9JgQYp4ksd0J" | |
| } | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment