Skip to content

Instantly share code, notes, and snippets.

@pamelafox
Created October 16, 2025 22:52
Show Gist options
  • Select an option

  • Save pamelafox/4801b9c313caae84f692a2f20d9a7f9e to your computer and use it in GitHub Desktop.

Select an option

Save pamelafox/4801b9c313caae84f692a2f20d9a7f9e to your computer and use it in GitHub Desktop.
YouTube to Article Tech Spec

YouTube Talk to Article – Lean MVP Technical Spec

Derived from the original PRD but intentionally simplified for fastest viable implementation. Focus: “folder in, article out” with minimal config, vision-first extraction, and straightforward alignment (no embeddings initially).

1. MVP Goal & Non-Goals

Goal: Given a folder containing a deck.pptx (and optionally a transcript + config.yaml), produce article.md plus slide images using a simple CLI: talk2article <folder>.

Included in MVP:

  • Slide rendering to PNG
  • Vision model extraction of slide textual content
  • Basic transcript acquisition & segmentation
  • Naive sequential alignment of transcript chunks to slides
  • First-person narrative (intro + per-slide sections)
  • Single Markdown output with front matter

Deferred (Future Enhancements): embeddings-based alignment, confidence scoring, detailed quality report, regeneration of specific slides, hallucination heuristics, SEO/TOC logic, advanced logging, multi-speaker handling.

2. Lean Architecture Overview

Six linear functions executed in order (no complex orchestration layer):

  1. load_config_and_inputs(folder)
  2. render_and_extract_slides(context)
  3. fetch_and_normalize_transcript(context)
  4. segment_and_align(context)
  5. generate_narrative(context)
  6. assemble_markdown(context)

Shared context dict accumulates results. Only essential artifacts written.

3. Minimal Inputs & Conventions

Required: deck.pptx in target folder.

Optional files (if absent, defaults apply):

  • config.yaml
  • transcript.(srt|vtt|txt)

Output files written alongside inputs:

  • article.md
  • slides/slide-N.png
  • slides.json (basic slide metadata and fused text)
  • meta.json (simple run metadata & warnings)

4. Configuration (Single YAML, Optional)

If config.yaml missing, defaults apply. Supported keys:

title: "Optional override"            # Else derived from first slide text
author: "Unknown"                    # Optional
date: "2025-10-08"                   # Default = today
tags: [talk, article]                 # Optional list
verbosity: standard                   # concise|standard|expanded (guides narrative tone)
paragraphs_per_slide: 2               # Fixed integer
front_matter: true                    # Include YAML front matter
vision_model: gpt-4o-mini             # Single model for vision + text generation

No CLI flags besides the folder path in MVP.

5. Core Function Responsibilities

5.1 load_config_and_inputs(folder)

  • Validate deck.pptx exists
  • Load config.yaml if present (no schema library; simple key allowlist)
  • Detect transcript file if present (extension sniff)
  • Initialize context = {config, slides: [], transcript_raw: None, warnings: []}

5.2 render_and_extract_slides(context)

  • Use python-pptx to iterate slides, export each slide as PNG (via Pillow composition or conversion method)
  • Invoke vision model per slide (one request per slide for MVP; batching later) with prompt: “Extract plain text, bullet lists, code blocks (with language if obvious), and tables in JSON.”
  • Build fused_text = notes + shapes + vision (deduplicated line-wise)
  • Append slide record: {index, image_path, fused_text}
  • Write slides/slide-N.png progressively

5.3 fetch_and_normalize_transcript(context)

  • If transcript file provided: parse depending on extension (very light regex for timestamps)
  • Else: attempt fetch via yt-dlp --write-auto-subs --skip-download (future) OR raise warning and set transcript to empty
  • Normalize: remove leading timestamps, collapse multiple spaces, keep basic punctuation

Transcript Fetch Strategy (Public Videos – No YouTube API Key Required):

  1. Preferred: use a lightweight captions retrieval library (e.g., python implementation similar to youtube-transcript-api) to request English captions with preferred language codes ["en", "en-US"]. This accesses YouTube's public timed text endpoint directly; no API key needed for public or unlisted videos with captions enabled.
  2. If manual transcript file exists (transcript.srt, .vtt, or .txt), it overrides any remote fetch.
  3. If library fetch fails (no captions, disabled, or endpoint error): optionally attempt a fallback shell call to yt-dlp (if installed) with auto subs flags. This step is deferred in the MVP unless explicitly enabled later.
  4. If still unavailable, proceed with empty transcript and append warning: "no transcript available; relying on slides only".

Normalization Rules:

  • Strip non‑speech cues like lines enclosed in brackets [Music], [Applause].
  • Merge consecutive very short ( < 40 chars) fragments into preceding fragment to reduce fragmentation.
  • Preserve original capitalization; do not lowercase.
  • Remove timestamps if present (HH:MM:SS.xxx -->) and extraneous numbering lines (common in SRT).

Warnings Generated:

  • Missing transcript entirely.
  • Transcript present but yielded zero usable speech lines after filtering.
  • Fetch failure (network or captions disabled).

Future Enhancements (Not in MVP):

  • Language auto-detect & optional translation pipeline.
  • Multi-track selection with user-specified priority list.
  • Partial transcript gap detection (long silence spans) for improved alignment heuristics.

5.4 segment_and_align(context)

  • Split transcript into sentences (naive period/question/exclamation split)
  • Group sentences into chunks ~250–400 chars
  • Naive alignment: distribute chunks sequentially across slides proportionally:
    • chunks_per_slide = ceil(total_chunks / total_slides)
    • Slice list accordingly; if remainder, append to last slide
  • Attach aligned_chunks list to each slide record
  • If a slide gets 0 chunks, add warning ("slide X has no transcript chunks")

5.5 generate_narrative(context)

  • For each slide:
    • Prompt includes: fused_text + concatenated aligned chunks + style directive (verbosity, first-person)
    • Ask model: “Produce exactly {paragraphs_per_slide} paragraphs in first person; no new facts.”
    • Capture heading: either first line of fused_text truncated (max 60 chars) or model-suggested heading (in first line) if fused_text short
  • Intro: separate prompt using first slide fused_text + first 2 transcript chunks overall
  • No conclusion in MVP (can add if last slide heading contains ‘Conclusion’ later)

5.6 assemble_markdown(context)

  • YAML front matter if enabled
  • H1 = Title
  • Intro paragraphs
  • For each slide: ## Slide {i} + image reference + paragraphs
  • Append HTML comment with warnings if any
  • Write article.md
  • Write slides.json (array of minimal slide objects)
  • Write meta.json with: total_slides, total_chunks, model_used, generated_at, warnings

6. Data Structures (MVP Simplified)

Slide: { index, image_path, fused_text, aligned_chunks: ["..."], narrative_paragraphs: ["..."], heading } Context: { config, slides: [Slide], transcript_raw, warnings: [str] } meta.json: { total_slides, total_chunks, model, generated_at, warnings }

7. Prompt Sketches (Conceptual)

Slide prompt user content structure:

SLIDE TEXT:
<fused_text>

TRANSCRIPT CHUNKS:
<joined_chunks>

Instructions: Write exactly {N} paragraphs in first person, authentic but concise ({verbosity}). Do not introduce facts not present above.

System message: “You are converting a presentation slide plus transcript snippets into a faithful first-person narrative.”

Intro prompt: similar but includes only early transcript portion and first slide fused_text. No paragraphs-per-slide constraint; request 1–2 paragraphs.

8. Error Handling (MVP)

  • Missing deck → abort (exit non-zero)
  • Vision API failure for a slide → retry once then fallback to shapes/notes only, warning
  • Transcript fetch failure → continue with empty transcript; narrative relies on slides
  • Model generation failure for a slide → warning; insert placeholder paragraph

9. Logging (Simple)

Stdout lines: [stage] message. No JSON logs.

10. Testing (Initial Set)

  1. test_single_slide_minimal() – Produces article with one slide section
  2. test_alignment_distribution() – Ensures each slide gets >= 0 chunks and total preserved
  3. test_article_structure() – Confirms front matter, H1 title, slide headings, image references

11. Performance Considerations

  • Vision calls dominate latency → run sequentially first; add batching later
  • For a 30-slide deck expect linear scaling; caching keyed by image hash prevents redundant re-runs

12. Security / Privacy (MVP Scope)

  • Only use OPENAI_API_KEY from environment
  • Do not log raw transcript or fused text (stdout shows only counts)

13. Future Enhancements (Graduation Path)

Area Future Feature
Alignment Add embeddings + semantic similarity + confidence scoring
Regeneration Slide-specific regeneration with anchor markers
Quality Hallucination heuristic, coverage stats, per-slide confidence JSON
Narrative Conclusion generation, resources extraction, TOC, multi-speaker voice blending
Performance Parallel vision batching, rate-limit adaptive retry, token budgeting
Observability Structured JSON logs, metrics export, cost tracking
Security Redaction filters, PII token masking before prompts

14. Implementation Order (Lean)

  1. CLI entry + config loader
  2. Slide render + vision extraction + write images & slides.json
  3. Transcript load/fetch + segmentation
  4. Naive alignment distribution
  5. Narrative generation (slides then intro)
  6. Markdown assembly
  7. Minimal tests & meta.json

15. Acceptance Criteria Mapping (MVP Subset)

PRD Story MVP Handling
Upload & Initiate Folder + deck + optional transcript recognized; errors surfaced
Transcript Normalization Basic sentence split & whitespace cleanup
Slide Extraction PNG + fused_text via vision model
Alignment Sequential distribution (baseline)
Narrative Generation Per-slide first-person paragraphs
Output Assembly Single article.md with images
Quality & Consistency Warnings list only (no numeric confidence)
Configuration & Reruns Implicit rerun by re-executing command (no partial regen)
Storage & Retrieval Outputs co-located in folder; minimal JSON metadata

16. Definition of Done (MVP)

  • Running talk2article folder/ with sample deck yields article.md with all slide sections
  • No unhandled exceptions on empty or missing transcript case
  • At least one warning appears when a slide has no transcript chunks
  • All referenced slide images exist and render

This Lean MVP spec supersedes the prior comprehensive design for the initial implementation phase. Advanced capabilities are deferred to iterative milestones.

3. Component Breakdown

3.1 Orchestrator (services.orchestrator)

Responsibilities:

  • Parse config & inputs
  • Determine which stages to run (full vs targeted slides)
  • Manage run ID, logging context, timing, error propagation
  • Persist intermediate artifacts directory structure

Inputs: CLI args / config file, paths/URLs. Outputs: Artifacts directory, final article.md, JSON reports.

3.2 Configuration Manager (services.config)

  • Load precedence: CLI flags > JSON/YAML config > environment defaults.
  • Validates schema (verbosity level, paragraphs per slide, front matter enable flag, filler removal flag, etc.).
  • Exposes immutable config object to downstream services.

3.3 Input Validation Service (services.validation)

  • Validate PPTX/PDF file accessible & parseable
  • Validate YouTube URL structure
  • Determine transcript source: fetch vs provided file
  • Surface specific error categories (NETWORK, FORMAT, MISSING_RESOURCE)

3.4 Deck Processing Service (services.slides)

Simplified (vision-first) approach: always run a multimodal vision LLM to extract textual/structured content from every rendered slide image; classic OCR tools are no longer part of the primary path (optional future fallback only).

Subtasks:

  • Convert PPTX → PNG slides (e.g., via python-pptx + headless conversion or libreoffice container call; fallback PDF to image pipeline)
  • Extract native slide textual content and speaker notes (shapes + notes)
  • Vision extraction (each slide PNG → vision model vision_model_name) requesting strict JSON with:
    • plain_text
    • bullets[]
    • code_blocks[] (fields: language?, content)
    • tables[] (2D arrays or Markdown rows)
    • detected_title?
  • Fusion: Combine notes + shapes + vision JSON into fused_text (precedence: notes > shapes > vision). Deduplicate lines (case-insensitive) while preserving ordering: title / heading → shapes order → bullets → code blocks → tables summaries.
  • Confidence: assign default confidences (notes 1.0, shapes 0.95, vision 0.85) stored in extraction_meta.
  • Produce outputs:
    • slides/slide-N.png
    • slides.json entries: { slide_index, filename, extracted_text (shapes), notes, vision_text, fused_text, extraction_meta[], width, height, hash }
    • Derived deck title candidate (from slide 1 vision detected_title or shapes text)

Caching:

  • Cache vision responses keyed by SHA256(image_bytes + vision_model_name) to avoid recomputation across reruns.

Failure Handling:

  • If vision call fails after retries (3) use shapes+notes only; mark vision_failed: true and add warning to quality report.
  • If resulting fused_text length < 15 characters mark slide needs_manual_review.

Security/Privacy:

  • Optional redaction regex pre-process for shapes/notes before they are echoed into the prompt.

Downstream Effects:

  • Embeddings consume fused_text primarily (fallback to shapes+notes if vision failed).
  • Quality report flags slides with vision_failed or very low lexical overlap between shapes+notes and vision_text (potential hallucination).

3.5 Transcript Service (services.transcript)

Subtasks:

  • Fetch YouTube transcript (YouTube API or yt-dlp fallback) if not supplied
  • Parse supplied .vtt / .srt / raw text
  • Normalize: lower optional disfluency removal, preserve speaker labels, keep original timestamps
  • Output raw transcript (transcript_raw.txt) & normalized JSON segments before finer segmentation

3.6 Segmentation Service (services.segmentation)

  • Split normalized transcript into semantically coherent chunks (~200–400 chars) using punctuation and pause thresholds.
  • Provide each segment: id, start_time, end_time, speaker_label (if any), text
  • Output transcript_segments.json

3.7 Embeddings Service (services.embeddings)

  • Compute embeddings (OpenAI embedding model) for:
    • Slide textual bundle: fused_text (or shapes+notes fallback when vision failed)
    • Transcript segments
  • Caching keyed by SHA256 of text + model name to avoid duplicate cost across reruns
  • Output embeddings.json (maps ids → vector + model + text_hash)

3.8 Alignment Service (services.alignment)

Hybrid algorithm steps:

  1. Initialize candidate mapping via timestamp heuristics (if video timing or approximate slide durations inferred; assumption: future extension for actual slide change events; initial version uses sequential distribution based on total duration / slide count if real events absent).
  2. Refine with semantic similarity: compute cosine similarity between slide embedding and window of adjacent transcript segment embeddings; assign top matches exceeding threshold.
  3. Confidence scoring: weight(timestamp_score, semantic_score, coverage_ratio).
  4. Flag slides with no segment > semantic threshold.
  5. Output alignment.json with per slide: slide_index, aligned_segment_ids (ordered), method_used (timestamp|semantic|greedy), confidence [0–1], warnings.

3.9 Narrative Generation Service (services.narrative)

  • For each slide, build prompt template including:
    • Global talk context (title, audience profile, verbosity, narrative style guidelines)
    • Slide text/notes/ocr summary
    • Aligned transcript segment texts with segment IDs in comments for traceability
    • Guardrails: Do not invent facts beyond provided content.
  • Use OpenAI chat completion model for generation (configurable model name)
  • Enforce paragraph count (post-process splitting / merging if necessary)
  • Return per-slide: heading (derived or LLM-suggested), body paragraphs, internal citations (HTML comments with segment IDs, optional flag)
  • Generate intro (using slide 1 + high-level summary of deck) and conclusion (using final slides + global summary prompt) as separate calls.
  • Output narrative.json.

3.10 Markdown Assembly Service (services.assembly)

  • Compose front matter (YAML) from config + derived metadata
  • Conditional Table of Contents if slide count > threshold & flag enabled
  • Insert slides sequentially:
    • Section heading ## Slide N: {heading}
    • Image syntax referencing slides/slide-N.png
    • Narrative paragraphs
  • Append Resources section (links auto-detected via regex from slides/transcript; uniqueness enforced)
  • Output final article.md

3.11 Quality Report Service (services.quality)

Checks:

  • Alignment coverage (percent slides aligned)
  • Low-confidence thresholds (< configurable min)
  • Slides lacking alt text / text content
  • Potential hallucinations: simple heuristic → narrative sentences with nouns not found in source text tokens (approximate lexical overlap score) below threshold
  • Token cost estimate (sum of prompt/completion tokens if tracked)
  • Output report.json

3.12 Regeneration Controller (services.regen)

  • Accept target slide indices for narrative regeneration
  • Reuse existing artifacts (alignment, embeddings) unless --force provided
  • Update only changed sections in article.md (safe in-place update using placeholder markers or structured regeneration plan)

3.13 Storage & Run Management (services.storage)

  • Run ID: YYYYMMDD-HHMMSS-random4 or deterministic hash if inputs identical & --reuse flag
  • Directory tree:
    • runs/<run_id>/article.md
    • runs/<run_id>/slides/slide-N.png
    • runs/<run_id>/slides.json
    • runs/<run_id>/transcript_raw.txt
    • runs/<run_id>/transcript_segments.json
    • runs/<run_id>/embeddings.json
    • runs/<run_id>/alignment.json
    • runs/<run_id>/narrative.json
    • runs/<run_id>/report.json
    • runs/<run_id>/config.json
  • Symlink or copy latest run to latest/ convenience directory.

3.14 CLI Interface (cli.py planned)

Commands:

  • generate (full pipeline)
  • regen --slides 3,7 (partial narrative)
  • report --run <id> (display summary)
  • list-runs

3.15 Logging & Observability

  • Structured JSON logs per stage with: timestamp, run_id, stage, event, duration_ms, error_code(optional)
  • Log levels: INFO default, DEBUG via flag
  • Summaries appended to runs/<run_id>/run.log

4. Data Models (Conceptual)

(No source code, conceptual fields only.)

SlideRecord: { slide_index, image_path, extracted_text, notes, vision_text, fused_text, width, height, text_hash, vision_failed? } TranscriptSegment: { id, start_time, end_time, speaker?, text, tokens? } EmbeddingRecord: { id, type(slide|segment), model, vector_dim, text_hash } AlignmentEntry: { slide_index, segment_ids[], method_used, confidence, warnings[] } NarrativeSlide: { slide_index, heading, paragraphs[], citation_segment_ids[], confidence } QualityReport: { run_id, generated_at, stats: { total_slides, aligned_slides, avg_confidence }, warnings[], hallucination_flags[], cost_estimate: { prompt_tokens?, completion_tokens?, total? } } Config: { model_name, embedding_model_name, verbosity_level, paragraphs_per_slide_range, remove_fillers, toc_threshold, front_matter_enabled, regenerate_slide_indices?, max_concurrency, similarity_thresholds { semantic, alignment }, low_confidence_threshold }

5. Algorithm Details

5.1 Segmentation

  • Tokenize by sentences; accumulate until character window (200–400) or pause gap > threshold
  • Merge very short trailing segments with predecessor

5.2 Embedding & Caching

  • Text normalized (strip whitespace) before hashing
  • Cache file: embedding_cache.sqlite (key: model + hash) or JSON if simplicity preferred; first version JSON.

5.3 Alignment Confidence

confidence = w_timestamp timestamp_score + w_semantic semantic_score + w_coverage * coverage_ratio

  • timestamp_score: 1 if segment time window overlaps expected slide window else decays
  • semantic_score: max cosine similarity among chosen segments
  • coverage_ratio: sum(len(segment text))/max( target_length_baseline, 1 )
  • Weights configurable; defaults 0.3 / 0.5 / 0.2.

5.4 Hallucination Heuristic

  • Build source lexicon: union of tokens from aligned transcript segments + fused_text of slides
  • For each narrative sentence compute overlap = (#source_tokens_intersection) / (#sentence_tokens)
  • Flag if overlap < threshold (e.g., 0.35) & contains proper nouns (capitalization heuristic) → add to report.json.

5.5 Regeneration Strategy

  • For target slides: reuse narrative prompt components; regenerate only those slides
  • Rebuild narrative.json entries & patch article.md using anchor markers: <!-- SLIDE_SECTION_START:N --> / <!-- SLIDE_SECTION_END:N -->

6. Configuration Parameters (Initial Set)

  • model_name (default: GPT-4o or equivalent Azure deployment)
  • embedding_model_name (e.g., text-embedding-3-large)
  • verbosity_level (concise|standard|expanded)
  • paragraphs_per_slide_range (min,max)
  • remove_fillers (bool)
  • toc_threshold (int, default 8)
  • front_matter_enabled (bool)
  • low_confidence_threshold (float, default 0.55)
  • semantic_similarity_threshold (float, default 0.55)
  • max_alignment_segments_per_slide (default 5)
  • max_concurrency (embedding batch parallelism)
  • citation_comments_enabled (bool)
  • vision_model_name (e.g., gpt-4o-mini vision variant)
  • vision_batch_size (int, default 4)
  • vision_cache_enabled (bool)
  • cost_tracking_enabled (bool)
  • regeneration_slide_indices (list)

7. Error Handling & Recovery

Error Categories:

  • INPUT_VALIDATION_ERROR (missing file, invalid URL)
  • TRANSCRIPT_FETCH_ERROR (network, permission, unavailable)
  • DECK_PARSE_ERROR (corrupt PPTX)
  • OCR_ERROR
  • EMBEDDING_ERROR (API rate limit, auth)
  • ALIGNMENT_ERROR
  • LLM_GENERATION_ERROR
  • ASSEMBLY_ERROR

Strategy:

  • Fail-fast on critical earlier stages; produce partial artifacts with status=failed in run log
  • For non-critical slides (e.g., single OCR failure) continue, mark warning
  • Retry policy: Exponential backoff for embedding & LLM calls (max 3 retries)
  • Graceful cancellation: On interrupt, flush artifacts & write run_aborted marker

8. Security & Compliance

  • No hard-coded API keys (use OPENAI_API_KEY or Azure-specific env vars)
  • Option to redact sensitive text patterns before sending to LLM (regex list configurable)
  • Store only hashed embeddings if PII risk flagged (future extension)
  • Avoid logging raw transcript segments at DEBUG unless allow_raw_logging flag set

9. Performance & Scalability

  • Batch embeddings (adaptive batch size by token length)
  • Parallel slide image rendering where safe
  • Streaming LLM output optional (future) – currently blocking calls
  • Complexity: O(S + T + S*T_emb_similarity?) → reduce by pre-filtering segments using locality (windowed approach around expected time)
  • Vision extraction adds constant overhead per slide; mitigated by batching & caching.

10. Testing Strategy

Test Layers:

  1. Unit: segmentation logic, similarity scoring, confidence computation
  2. Integration: end-to-end small sample deck (3 slides) & synthetic transcript
  3. Alignment accuracy evaluation harness: curated dataset with known mappings (≥90% coverage assertion)
  4. Regression: Golden output snapshots for narrative with stable config (allow minor diff tolerance via semantic similarity instead of raw string match)
  5. Error injection tests: simulate missing transcript, API rate limit, corrupt slide image
  6. Performance test: 100-slide synthetic deck within time budget (< X minutes; define after prototype)

Artifacts for tests located under tests/ mirroring service modules.

11. Observability & Metrics

  • Metrics JSON appended: { stage, duration_ms, token_usage, error_counts }
  • Optional integration with stdout for container logs
  • Derived metrics: average confidence, cost per slide, processing time per slide

12. Deployment & Packaging

  • Distributable as a CLI tool (entry point in pyproject.toml or console_scripts) (future)
  • Container image: base Python + libreoffice (if PPTX->PNG) + tesseract (if OCR enabled)
  • Azure Container Apps target (aligns with repo name) – environment variables injected via secrets store

13. Assumptions

  • Access to an embedding & chat completion capable OpenAI (or Azure OpenAI) endpoint via openai SDK
  • PPTX is primary; PDF fallback already flattened; notes may be absent
  • No precise slide change timestamps available in MVP; using heuristic distribution
  • Single primary speaker voice for narrative in MVP
  • Acceptable to use cosine similarity over normalized L2 for embeddings
  • Cost tracking relies on API returned usage fields (if provided)

14. Risks & Mitigations (Technical)

  • Slide rendering inconsistencies → Normalize target width & maintain aspect ratio
  • Large transcripts (token explosion) → Segment summarization before prompt if > threshold tokens
  • Rate limits → Centralized throttle & caching
  • Hallucinations → Strict prompt instructions + lexicon overlap heuristic + optional inline citations
  • Vision extraction hallucination (invented tables/code) → enforce strict JSON schema, compare overlap with shapes+notes; if overlap < threshold flag for review.

15. Sequence (Textual) Diagrams

15.1 Full Generation

User → CLI → Orchestrator → Validation → Slides → Transcript → Segmentation → Embeddings → Alignment → Narrative (loop per slide) → Assembly → Quality → Storage → Output paths returned.

15.2 Regeneration (Slides 3,7)

User → CLI (regen) → Orchestrator (load existing run) → Narrative (slides 3,7) → Assembly patch → Quality delta update → Updated artifacts.

16. Step-by-Step Implementation Plan

  1. Scaffold module directories & placeholder service classes (no logic) to enable incremental commits
  2. Implement Configuration Manager & CLI argument parsing
  3. Input Validation (PPTX presence, URL regex, transcript detection)
  4. Deck Processing (render to PNG, extract text & notes, produce slides.json)
  5. Transcript fetching & normalization (support file & remote)
  6. Segmentation algorithm + unit tests
  7. Embedding service with caching layer
  8. Alignment hybrid algorithm + confidence scoring (tests with synthetic data)
  9. Narrative prompt template design + per-slide generation (intro, conclusion)
  10. Markdown assembly with front matter, TOC conditional, resources extraction
  11. Quality report heuristics (coverage, hallucination approximation, alt text, low confidence)
  12. Regeneration controller (anchor markers + selective patching)
  13. Logging & metrics instrumentation
  14. Error handling refinement & retry logic
  15. Add test suite coverage for edge cases & performance scenario
  16. Documentation updates (README, usage examples) & sample run artifacts
  17. Optional: containerization & GitHub Action integration

17. Acceptance Criteria Traceability Mapping

  • Each PRD story mapped to artifacts: see Section 3 & outputs folder; ensure test cases validate presence & quality thresholds.

18. Future Extensions (Not in MVP)

  • Multi-speaker narrative blending
  • Translation layer
  • SEO metadata suggestions
  • Slide change detection from video frame differencing
  • Vector store persistence for cross-talk analytics
  • Per-element bounding boxes via vision model for layout reconstruction

20. Completion Definition

  • Running generate on sample inputs produces all artifact files with non-empty sections
  • Alignment coverage ≥ 90% on test deck
  • Narrative uses first-person pronouns consistently
  • report.json enumerates warnings when low confidence slides present
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment