YouTube Talk to Article – Lean MVP Technical Spec

Derived from the original PRD but intentionally simplified for fastest viable implementation. Focus: “folder in, article out” with minimal config, vision-first extraction, and straightforward alignment (no embeddings initially).

1. MVP Goal & Non-Goals

Goal: Given a folder containing a deck.pptx (and optionally a transcript + config.yaml), produce article.md plus slide images using a simple CLI: talk2article <folder>.

Included in MVP:

Slide rendering to PNG
Vision model extraction of slide textual content
Basic transcript acquisition & segmentation
Naive sequential alignment of transcript chunks to slides
First-person narrative (intro + per-slide sections)
Single Markdown output with front matter

Deferred (Future Enhancements): embeddings-based alignment, confidence scoring, detailed quality report, regeneration of specific slides, hallucination heuristics, SEO/TOC logic, advanced logging, multi-speaker handling.

2. Lean Architecture Overview

Six linear functions executed in order (no complex orchestration layer):

load_config_and_inputs(folder)
render_and_extract_slides(context)
fetch_and_normalize_transcript(context)
segment_and_align(context)
generate_narrative(context)
assemble_markdown(context)

Shared context dict accumulates results. Only essential artifacts written.

3. Minimal Inputs & Conventions

Required: deck.pptx in target folder.

Optional files (if absent, defaults apply):

config.yaml
transcript.(srt|vtt|txt)

Output files written alongside inputs:

article.md
slides/slide-N.png
slides.json (basic slide metadata and fused text)
meta.json (simple run metadata & warnings)

4. Configuration (Single YAML, Optional)

If config.yaml missing, defaults apply. Supported keys:

title: "Optional override"            # Else derived from first slide text
author: "Unknown"                    # Optional
date: "2025-10-08"                   # Default = today
tags: [talk, article]                 # Optional list
verbosity: standard                   # concise|standard|expanded (guides narrative tone)
paragraphs_per_slide: 2               # Fixed integer
front_matter: true                    # Include YAML front matter
vision_model: gpt-4o-mini             # Single model for vision + text generation

No CLI flags besides the folder path in MVP.

5. Core Function Responsibilities

5.1 load_config_and_inputs(folder)

Validate deck.pptx exists
Load config.yaml if present (no schema library; simple key allowlist)
Detect transcript file if present (extension sniff)
Initialize context = {config, slides: [], transcript_raw: None, warnings: []}

5.2 render_and_extract_slides(context)

Use python-pptx to iterate slides, export each slide as PNG (via Pillow composition or conversion method)
Invoke vision model per slide (one request per slide for MVP; batching later) with prompt: “Extract plain text, bullet lists, code blocks (with language if obvious), and tables in JSON.”
Build fused_text = notes + shapes + vision (deduplicated line-wise)
Append slide record: {index, image_path, fused_text}
Write slides/slide-N.png progressively

5.3 fetch_and_normalize_transcript(context)

If transcript file provided: parse depending on extension (very light regex for timestamps)
Else: attempt fetch via yt-dlp --write-auto-subs --skip-download (future) OR raise warning and set transcript to empty
Normalize: remove leading timestamps, collapse multiple spaces, keep basic punctuation

Transcript Fetch Strategy (Public Videos – No YouTube API Key Required):

Preferred: use a lightweight captions retrieval library (e.g., python implementation similar to youtube-transcript-api) to request English captions with preferred language codes ["en", "en-US"]. This accesses YouTube's public timed text endpoint directly; no API key needed for public or unlisted videos with captions enabled.
If manual transcript file exists (transcript.srt, .vtt, or .txt), it overrides any remote fetch.
If library fetch fails (no captions, disabled, or endpoint error): optionally attempt a fallback shell call to yt-dlp (if installed) with auto subs flags. This step is deferred in the MVP unless explicitly enabled later.
If still unavailable, proceed with empty transcript and append warning: "no transcript available; relying on slides only".

Normalization Rules:

Strip non‑speech cues like lines enclosed in brackets [Music], [Applause].
Merge consecutive very short ( < 40 chars) fragments into preceding fragment to reduce fragmentation.
Preserve original capitalization; do not lowercase.
Remove timestamps if present (HH:MM:SS.xxx -->) and extraneous numbering lines (common in SRT).

Warnings Generated:

Missing transcript entirely.
Transcript present but yielded zero usable speech lines after filtering.
Fetch failure (network or captions disabled).

Future Enhancements (Not in MVP):

Language auto-detect & optional translation pipeline.
Multi-track selection with user-specified priority list.
Partial transcript gap detection (long silence spans) for improved alignment heuristics.

5.4 segment_and_align(context)

Split transcript into sentences (naive period/question/exclamation split)
Group sentences into chunks ~250–400 chars
Naive alignment: distribute chunks sequentially across slides proportionally:
- chunks_per_slide = ceil(total_chunks / total_slides)
- Slice list accordingly; if remainder, append to last slide
Attach aligned_chunks list to each slide record
If a slide gets 0 chunks, add warning ("slide X has no transcript chunks")

5.5 generate_narrative(context)

For each slide:
- Prompt includes: fused_text + concatenated aligned chunks + style directive (verbosity, first-person)
- Ask model: “Produce exactly {paragraphs_per_slide} paragraphs in first person; no new facts.”
- Capture heading: either first line of fused_text truncated (max 60 chars) or model-suggested heading (in first line) if fused_text short
Intro: separate prompt using first slide fused_text + first 2 transcript chunks overall
No conclusion in MVP (can add if last slide heading contains ‘Conclusion’ later)

5.6 assemble_markdown(context)

YAML front matter if enabled
H1 = Title
Intro paragraphs
For each slide: ## Slide {i} + image reference + paragraphs
Append HTML comment with warnings if any
Write article.md
Write slides.json (array of minimal slide objects)
Write meta.json with: total_slides, total_chunks, model_used, generated_at, warnings

6. Data Structures (MVP Simplified)

Slide: { index, image_path, fused_text, aligned_chunks: ["..."], narrative_paragraphs: ["..."], heading } Context: { config, slides: [Slide], transcript_raw, warnings: [str] } meta.json: { total_slides, total_chunks, model, generated_at, warnings }

7. Prompt Sketches (Conceptual)

Slide prompt user content structure:

SLIDE TEXT:
<fused_text>

TRANSCRIPT CHUNKS:
<joined_chunks>

Instructions: Write exactly {N} paragraphs in first person, authentic but concise ({verbosity}). Do not introduce facts not present above.

System message: “You are converting a presentation slide plus transcript snippets into a faithful first-person narrative.”

Intro prompt: similar but includes only early transcript portion and first slide fused_text. No paragraphs-per-slide constraint; request 1–2 paragraphs.

8. Error Handling (MVP)

Missing deck → abort (exit non-zero)
Vision API failure for a slide → retry once then fallback to shapes/notes only, warning
Transcript fetch failure → continue with empty transcript; narrative relies on slides
Model generation failure for a slide → warning; insert placeholder paragraph

9. Logging (Simple)

Stdout lines: [stage] message. No JSON logs.

10. Testing (Initial Set)

test_single_slide_minimal() – Produces article with one slide section
test_alignment_distribution() – Ensures each slide gets >= 0 chunks and total preserved
test_article_structure() – Confirms front matter, H1 title, slide headings, image references

11. Performance Considerations

Vision calls dominate latency → run sequentially first; add batching later
For a 30-slide deck expect linear scaling; caching keyed by image hash prevents redundant re-runs

12. Security / Privacy (MVP Scope)

Only use OPENAI_API_KEY from environment
Do not log raw transcript or fused text (stdout shows only counts)

13. Future Enhancements (Graduation Path)

Area	Future Feature
Alignment	Add embeddings + semantic similarity + confidence scoring
Regeneration	Slide-specific regeneration with anchor markers
Quality	Hallucination heuristic, coverage stats, per-slide confidence JSON
Narrative	Conclusion generation, resources extraction, TOC, multi-speaker voice blending
Performance	Parallel vision batching, rate-limit adaptive retry, token budgeting
Observability	Structured JSON logs, metrics export, cost tracking
Security	Redaction filters, PII token masking before prompts

14. Implementation Order (Lean)

CLI entry + config loader
Slide render + vision extraction + write images & slides.json
Transcript load/fetch + segmentation
Naive alignment distribution
Narrative generation (slides then intro)
Markdown assembly
Minimal tests & meta.json

15. Acceptance Criteria Mapping (MVP Subset)

PRD Story	MVP Handling
Upload & Initiate	Folder + deck + optional transcript recognized; errors surfaced
Transcript Normalization	Basic sentence split & whitespace cleanup
Slide Extraction	PNG + fused_text via vision model
Alignment	Sequential distribution (baseline)
Narrative Generation	Per-slide first-person paragraphs
Output Assembly	Single `article.md` with images
Quality & Consistency	Warnings list only (no numeric confidence)
Configuration & Reruns	Implicit rerun by re-executing command (no partial regen)
Storage & Retrieval	Outputs co-located in folder; minimal JSON metadata

16. Definition of Done (MVP)

Running talk2article folder/ with sample deck yields article.md with all slide sections
No unhandled exceptions on empty or missing transcript case
At least one warning appears when a slide has no transcript chunks
All referenced slide images exist and render

This Lean MVP spec supersedes the prior comprehensive design for the initial implementation phase. Advanced capabilities are deferred to iterative milestones.

3. Component Breakdown

3.1 Orchestrator (`services.orchestrator`)

Responsibilities:

Parse config & inputs
Determine which stages to run (full vs targeted slides)
Manage run ID, logging context, timing, error propagation
Persist intermediate artifacts directory structure

Inputs: CLI args / config file, paths/URLs. Outputs: Artifacts directory, final article.md, JSON reports.

3.2 Configuration Manager (`services.config`)

Load precedence: CLI flags > JSON/YAML config > environment defaults.
Validates schema (verbosity level, paragraphs per slide, front matter enable flag, filler removal flag, etc.).
Exposes immutable config object to downstream services.

3.3 Input Validation Service (`services.validation`)

Validate PPTX/PDF file accessible & parseable
Validate YouTube URL structure
Determine transcript source: fetch vs provided file
Surface specific error categories (NETWORK, FORMAT, MISSING_RESOURCE)

3.4 Deck Processing Service (`services.slides`)

Simplified (vision-first) approach: always run a multimodal vision LLM to extract textual/structured content from every rendered slide image; classic OCR tools are no longer part of the primary path (optional future fallback only).

Subtasks:

Convert PPTX → PNG slides (e.g., via python-pptx + headless conversion or libreoffice container call; fallback PDF to image pipeline)
Extract native slide textual content and speaker notes (shapes + notes)
Vision extraction (each slide PNG → vision model vision_model_name) requesting strict JSON with:
- plain_text
- bullets[]
- code_blocks[] (fields: language?, content)
- tables[] (2D arrays or Markdown rows)
- detected_title?
Fusion: Combine notes + shapes + vision JSON into fused_text (precedence: notes > shapes > vision). Deduplicate lines (case-insensitive) while preserving ordering: title / heading → shapes order → bullets → code blocks → tables summaries.
Confidence: assign default confidences (notes 1.0, shapes 0.95, vision 0.85) stored in extraction_meta.
Produce outputs:
- slides/slide-N.png
- slides.json entries: { slide_index, filename, extracted_text (shapes), notes, vision_text, fused_text, extraction_meta[], width, height, hash }
- Derived deck title candidate (from slide 1 vision detected_title or shapes text)

Caching:

Cache vision responses keyed by SHA256(image_bytes + vision_model_name) to avoid recomputation across reruns.

Failure Handling:

If vision call fails after retries (3) use shapes+notes only; mark vision_failed: true and add warning to quality report.
If resulting fused_text length < 15 characters mark slide needs_manual_review.

Security/Privacy:

Optional redaction regex pre-process for shapes/notes before they are echoed into the prompt.

Downstream Effects:

Embeddings consume fused_text primarily (fallback to shapes+notes if vision failed).
Quality report flags slides with vision_failed or very low lexical overlap between shapes+notes and vision_text (potential hallucination).

3.5 Transcript Service (`services.transcript`)

Subtasks:

Fetch YouTube transcript (YouTube API or yt-dlp fallback) if not supplied
Parse supplied .vtt / .srt / raw text
Normalize: lower optional disfluency removal, preserve speaker labels, keep original timestamps
Output raw transcript (transcript_raw.txt) & normalized JSON segments before finer segmentation

3.6 Segmentation Service (`services.segmentation`)

Split normalized transcript into semantically coherent chunks (~200–400 chars) using punctuation and pause thresholds.
Provide each segment: id, start_time, end_time, speaker_label (if any), text
Output transcript_segments.json

3.7 Embeddings Service (`services.embeddings`)

Compute embeddings (OpenAI embedding model) for:
- Slide textual bundle: fused_text (or shapes+notes fallback when vision failed)
- Transcript segments
Caching keyed by SHA256 of text + model name to avoid duplicate cost across reruns
Output embeddings.json (maps ids → vector + model + text_hash)

3.8 Alignment Service (`services.alignment`)

Hybrid algorithm steps:

Initialize candidate mapping via timestamp heuristics (if video timing or approximate slide durations inferred; assumption: future extension for actual slide change events; initial version uses sequential distribution based on total duration / slide count if real events absent).
Refine with semantic similarity: compute cosine similarity between slide embedding and window of adjacent transcript segment embeddings; assign top matches exceeding threshold.
Confidence scoring: weight(timestamp_score, semantic_score, coverage_ratio).
Flag slides with no segment > semantic threshold.
Output alignment.json with per slide: slide_index, aligned_segment_ids (ordered), method_used (timestamp|semantic|greedy), confidence [0–1], warnings.

3.9 Narrative Generation Service (`services.narrative`)

For each slide, build prompt template including:
- Global talk context (title, audience profile, verbosity, narrative style guidelines)
- Slide text/notes/ocr summary
- Aligned transcript segment texts with segment IDs in comments for traceability
- Guardrails: Do not invent facts beyond provided content.
Use OpenAI chat completion model for generation (configurable model name)
Enforce paragraph count (post-process splitting / merging if necessary)
Return per-slide: heading (derived or LLM-suggested), body paragraphs, internal citations (HTML comments with segment IDs, optional flag)
Generate intro (using slide 1 + high-level summary of deck) and conclusion (using final slides + global summary prompt) as separate calls.
Output narrative.json.

3.10 Markdown Assembly Service (`services.assembly`)

Compose front matter (YAML) from config + derived metadata
Conditional Table of Contents if slide count > threshold & flag enabled
Insert slides sequentially:
- Section heading ## Slide N: {heading}
- Image syntax referencing slides/slide-N.png
- Narrative paragraphs
Append Resources section (links auto-detected via regex from slides/transcript; uniqueness enforced)
Output final article.md

3.11 Quality Report Service (`services.quality`)

Checks:

Alignment coverage (percent slides aligned)
Low-confidence thresholds (< configurable min)
Slides lacking alt text / text content
Potential hallucinations: simple heuristic → narrative sentences with nouns not found in source text tokens (approximate lexical overlap score) below threshold
Token cost estimate (sum of prompt/completion tokens if tracked)
Output report.json

3.12 Regeneration Controller (`services.regen`)

Accept target slide indices for narrative regeneration
Reuse existing artifacts (alignment, embeddings) unless --force provided
Update only changed sections in article.md (safe in-place update using placeholder markers or structured regeneration plan)

3.13 Storage & Run Management (`services.storage`)

Run ID: YYYYMMDD-HHMMSS-random4 or deterministic hash if inputs identical & --reuse flag
Directory tree:
- runs/<run_id>/article.md
- runs/<run_id>/slides/slide-N.png
- runs/<run_id>/slides.json
- runs/<run_id>/transcript_raw.txt
- runs/<run_id>/transcript_segments.json
- runs/<run_id>/embeddings.json
- runs/<run_id>/alignment.json
- runs/<run_id>/narrative.json
- runs/<run_id>/report.json
- runs/<run_id>/config.json
Symlink or copy latest run to latest/ convenience directory.

3.14 CLI Interface (`cli.py` planned)

Commands:

generate (full pipeline)
regen --slides 3,7 (partial narrative)
report --run <id> (display summary)
list-runs

3.15 Logging & Observability

Structured JSON logs per stage with: timestamp, run_id, stage, event, duration_ms, error_code(optional)
Log levels: INFO default, DEBUG via flag
Summaries appended to runs/<run_id>/run.log

4. Data Models (Conceptual)

(No source code, conceptual fields only.)

SlideRecord: { slide_index, image_path, extracted_text, notes, vision_text, fused_text, width, height, text_hash, vision_failed? } TranscriptSegment: { id, start_time, end_time, speaker?, text, tokens? } EmbeddingRecord: { id, type(slide|segment), model, vector_dim, text_hash } AlignmentEntry: { slide_index, segment_ids[], method_used, confidence, warnings[] } NarrativeSlide: { slide_index, heading, paragraphs[], citation_segment_ids[], confidence } QualityReport: { run_id, generated_at, stats: { total_slides, aligned_slides, avg_confidence }, warnings[], hallucination_flags[], cost_estimate: { prompt_tokens?, completion_tokens?, total? } } Config: { model_name, embedding_model_name, verbosity_level, paragraphs_per_slide_range, remove_fillers, toc_threshold, front_matter_enabled, regenerate_slide_indices?, max_concurrency, similarity_thresholds { semantic, alignment }, low_confidence_threshold }

5. Algorithm Details

5.1 Segmentation

Tokenize by sentences; accumulate until character window (200–400) or pause gap > threshold
Merge very short trailing segments with predecessor

5.2 Embedding & Caching

Text normalized (strip whitespace) before hashing
Cache file: embedding_cache.sqlite (key: model + hash) or JSON if simplicity preferred; first version JSON.

5.3 Alignment Confidence

confidence = w_timestamp timestamp_score + w_semantic semantic_score + w_coverage * coverage_ratio

timestamp_score: 1 if segment time window overlaps expected slide window else decays
semantic_score: max cosine similarity among chosen segments
coverage_ratio: sum(len(segment text))/max( target_length_baseline, 1 )
Weights configurable; defaults 0.3 / 0.5 / 0.2.

5.4 Hallucination Heuristic

Build source lexicon: union of tokens from aligned transcript segments + fused_text of slides
For each narrative sentence compute overlap = (#source_tokens_intersection) / (#sentence_tokens)
Flag if overlap < threshold (e.g., 0.35) & contains proper nouns (capitalization heuristic) → add to report.json.

5.5 Regeneration Strategy

For target slides: reuse narrative prompt components; regenerate only those slides
Rebuild narrative.json entries & patch article.md using anchor markers:  /

6. Configuration Parameters (Initial Set)

model_name (default: GPT-4o or equivalent Azure deployment)
embedding_model_name (e.g., text-embedding-3-large)
verbosity_level (concise|standard|expanded)
paragraphs_per_slide_range (min,max)
remove_fillers (bool)
toc_threshold (int, default 8)
front_matter_enabled (bool)
low_confidence_threshold (float, default 0.55)
semantic_similarity_threshold (float, default 0.55)
max_alignment_segments_per_slide (default 5)
max_concurrency (embedding batch parallelism)
citation_comments_enabled (bool)
vision_model_name (e.g., gpt-4o-mini vision variant)
vision_batch_size (int, default 4)
vision_cache_enabled (bool)
cost_tracking_enabled (bool)
regeneration_slide_indices (list)

7. Error Handling & Recovery

Error Categories:

INPUT_VALIDATION_ERROR (missing file, invalid URL)
TRANSCRIPT_FETCH_ERROR (network, permission, unavailable)
DECK_PARSE_ERROR (corrupt PPTX)
OCR_ERROR
EMBEDDING_ERROR (API rate limit, auth)
ALIGNMENT_ERROR
LLM_GENERATION_ERROR
ASSEMBLY_ERROR

Strategy:

Fail-fast on critical earlier stages; produce partial artifacts with status=failed in run log
For non-critical slides (e.g., single OCR failure) continue, mark warning
Retry policy: Exponential backoff for embedding & LLM calls (max 3 retries)
Graceful cancellation: On interrupt, flush artifacts & write run_aborted marker

8. Security & Compliance

No hard-coded API keys (use OPENAI_API_KEY or Azure-specific env vars)
Option to redact sensitive text patterns before sending to LLM (regex list configurable)
Store only hashed embeddings if PII risk flagged (future extension)
Avoid logging raw transcript segments at DEBUG unless allow_raw_logging flag set

9. Performance & Scalability

Batch embeddings (adaptive batch size by token length)
Parallel slide image rendering where safe
Streaming LLM output optional (future) – currently blocking calls
Complexity: O(S + T + S*T_emb_similarity?) → reduce by pre-filtering segments using locality (windowed approach around expected time)
Vision extraction adds constant overhead per slide; mitigated by batching & caching.

10. Testing Strategy

Test Layers:

Unit: segmentation logic, similarity scoring, confidence computation
Integration: end-to-end small sample deck (3 slides) & synthetic transcript
Alignment accuracy evaluation harness: curated dataset with known mappings (≥90% coverage assertion)
Regression: Golden output snapshots for narrative with stable config (allow minor diff tolerance via semantic similarity instead of raw string match)
Error injection tests: simulate missing transcript, API rate limit, corrupt slide image
Performance test: 100-slide synthetic deck within time budget (< X minutes; define after prototype)

Artifacts for tests located under tests/ mirroring service modules.

11. Observability & Metrics

Metrics JSON appended: { stage, duration_ms, token_usage, error_counts }
Optional integration with stdout for container logs
Derived metrics: average confidence, cost per slide, processing time per slide

12. Deployment & Packaging

Distributable as a CLI tool (entry point in pyproject.toml or console_scripts) (future)
Container image: base Python + libreoffice (if PPTX->PNG) + tesseract (if OCR enabled)
Azure Container Apps target (aligns with repo name) – environment variables injected via secrets store

13. Assumptions

Access to an embedding & chat completion capable OpenAI (or Azure OpenAI) endpoint via openai SDK
PPTX is primary; PDF fallback already flattened; notes may be absent
No precise slide change timestamps available in MVP; using heuristic distribution
Single primary speaker voice for narrative in MVP
Acceptable to use cosine similarity over normalized L2 for embeddings
Cost tracking relies on API returned usage fields (if provided)

14. Risks & Mitigations (Technical)

Slide rendering inconsistencies → Normalize target width & maintain aspect ratio
Large transcripts (token explosion) → Segment summarization before prompt if > threshold tokens
Rate limits → Centralized throttle & caching
Hallucinations → Strict prompt instructions + lexicon overlap heuristic + optional inline citations
Vision extraction hallucination (invented tables/code) → enforce strict JSON schema, compare overlap with shapes+notes; if overlap < threshold flag for review.

15. Sequence (Textual) Diagrams

15.1 Full Generation

User → CLI → Orchestrator → Validation → Slides → Transcript → Segmentation → Embeddings → Alignment → Narrative (loop per slide) → Assembly → Quality → Storage → Output paths returned.

15.2 Regeneration (Slides 3,7)

User → CLI (regen) → Orchestrator (load existing run) → Narrative (slides 3,7) → Assembly patch → Quality delta update → Updated artifacts.

16. Step-by-Step Implementation Plan

Scaffold module directories & placeholder service classes (no logic) to enable incremental commits
Implement Configuration Manager & CLI argument parsing
Input Validation (PPTX presence, URL regex, transcript detection)
Deck Processing (render to PNG, extract text & notes, produce slides.json)
Transcript fetching & normalization (support file & remote)
Segmentation algorithm + unit tests
Embedding service with caching layer
Alignment hybrid algorithm + confidence scoring (tests with synthetic data)
Narrative prompt template design + per-slide generation (intro, conclusion)
Markdown assembly with front matter, TOC conditional, resources extraction
Quality report heuristics (coverage, hallucination approximation, alt text, low confidence)
Regeneration controller (anchor markers + selective patching)
Logging & metrics instrumentation
Error handling refinement & retry logic
Add test suite coverage for edge cases & performance scenario
Documentation updates (README, usage examples) & sample run artifacts
Optional: containerization & GitHub Action integration

17. Acceptance Criteria Traceability Mapping

Each PRD story mapped to artifacts: see Section 3 & outputs folder; ensure test cases validate presence & quality thresholds.

18. Future Extensions (Not in MVP)

Multi-speaker narrative blending
Translation layer
SEO metadata suggestions
Slide change detection from video frame differencing
Vector store persistence for cross-talk analytics
Per-element bounding boxes via vision model for layout reconstruction

20. Completion Definition

Running generate on sample inputs produces all artifact files with non-empty sections
Alignment coverage ≥ 90% on test deck
Narrative uses first-person pronouns consistently
report.json enumerates warnings when low confidence slides present

pamelafox/techspec.md