Derived from the original PRD but intentionally simplified for fastest viable implementation. Focus: “folder in, article out” with minimal config, vision-first extraction, and straightforward alignment (no embeddings initially).
Goal: Given a folder containing a deck.pptx (and optionally a transcript + config.yaml), produce article.md plus slide images using a simple CLI: talk2article <folder>.
Included in MVP:
- Slide rendering to PNG
- Vision model extraction of slide textual content
- Basic transcript acquisition & segmentation
- Naive sequential alignment of transcript chunks to slides
- First-person narrative (intro + per-slide sections)
- Single Markdown output with front matter
Deferred (Future Enhancements): embeddings-based alignment, confidence scoring, detailed quality report, regeneration of specific slides, hallucination heuristics, SEO/TOC logic, advanced logging, multi-speaker handling.
Six linear functions executed in order (no complex orchestration layer):
- load_config_and_inputs(folder)
- render_and_extract_slides(context)
- fetch_and_normalize_transcript(context)
- segment_and_align(context)
- generate_narrative(context)
- assemble_markdown(context)
Shared context dict accumulates results. Only essential artifacts written.
Required: deck.pptx in target folder.
Optional files (if absent, defaults apply):
config.yamltranscript.(srt|vtt|txt)
Output files written alongside inputs:
article.mdslides/slide-N.pngslides.json(basic slide metadata and fused text)meta.json(simple run metadata & warnings)
If config.yaml missing, defaults apply. Supported keys:
title: "Optional override" # Else derived from first slide text
author: "Unknown" # Optional
date: "2025-10-08" # Default = today
tags: [talk, article] # Optional list
verbosity: standard # concise|standard|expanded (guides narrative tone)
paragraphs_per_slide: 2 # Fixed integer
front_matter: true # Include YAML front matter
vision_model: gpt-4o-mini # Single model for vision + text generationNo CLI flags besides the folder path in MVP.
- Validate
deck.pptxexists - Load
config.yamlif present (no schema library; simple key allowlist) - Detect transcript file if present (extension sniff)
- Initialize
context = {config, slides: [], transcript_raw: None, warnings: []}
- Use
python-pptxto iterate slides, export each slide as PNG (via Pillow composition or conversion method) - Invoke vision model per slide (one request per slide for MVP; batching later) with prompt: “Extract plain text, bullet lists, code blocks (with language if obvious), and tables in JSON.”
- Build
fused_text= notes + shapes + vision (deduplicated line-wise) - Append slide record:
{index, image_path, fused_text} - Write
slides/slide-N.pngprogressively
- If transcript file provided: parse depending on extension (very light regex for timestamps)
- Else: attempt fetch via
yt-dlp --write-auto-subs --skip-download(future) OR raise warning and set transcript to empty - Normalize: remove leading timestamps, collapse multiple spaces, keep basic punctuation
Transcript Fetch Strategy (Public Videos – No YouTube API Key Required):
- Preferred: use a lightweight captions retrieval library (e.g., python implementation similar to
youtube-transcript-api) to request English captions with preferred language codes["en", "en-US"]. This accesses YouTube's public timed text endpoint directly; no API key needed for public or unlisted videos with captions enabled. - If manual transcript file exists (
transcript.srt,.vtt, or.txt), it overrides any remote fetch. - If library fetch fails (no captions, disabled, or endpoint error): optionally attempt a fallback shell call to
yt-dlp(if installed) with auto subs flags. This step is deferred in the MVP unless explicitly enabled later. - If still unavailable, proceed with empty transcript and append warning:
"no transcript available; relying on slides only".
Normalization Rules:
- Strip non‑speech cues like lines enclosed in brackets
[Music],[Applause]. - Merge consecutive very short ( < 40 chars) fragments into preceding fragment to reduce fragmentation.
- Preserve original capitalization; do not lowercase.
- Remove timestamps if present (
HH:MM:SS.xxx -->) and extraneous numbering lines (common in SRT).
Warnings Generated:
- Missing transcript entirely.
- Transcript present but yielded zero usable speech lines after filtering.
- Fetch failure (network or captions disabled).
Future Enhancements (Not in MVP):
- Language auto-detect & optional translation pipeline.
- Multi-track selection with user-specified priority list.
- Partial transcript gap detection (long silence spans) for improved alignment heuristics.
- Split transcript into sentences (naive period/question/exclamation split)
- Group sentences into chunks ~250–400 chars
- Naive alignment: distribute chunks sequentially across slides proportionally:
chunks_per_slide = ceil(total_chunks / total_slides)- Slice list accordingly; if remainder, append to last slide
- Attach
aligned_chunkslist to each slide record - If a slide gets 0 chunks, add warning (
"slide X has no transcript chunks")
- For each slide:
- Prompt includes: fused_text + concatenated aligned chunks + style directive (verbosity, first-person)
- Ask model: “Produce exactly {paragraphs_per_slide} paragraphs in first person; no new facts.”
- Capture heading: either first line of fused_text truncated (max 60 chars) or model-suggested heading (in first line) if fused_text short
- Intro: separate prompt using first slide fused_text + first 2 transcript chunks overall
- No conclusion in MVP (can add if last slide heading contains ‘Conclusion’ later)
- YAML front matter if enabled
- H1 = Title
- Intro paragraphs
- For each slide:
## Slide {i}+ image reference + paragraphs - Append HTML comment with warnings if any
- Write
article.md - Write
slides.json(array of minimal slide objects) - Write
meta.jsonwith: total_slides, total_chunks, model_used, generated_at, warnings
Slide: { index, image_path, fused_text, aligned_chunks: ["..."], narrative_paragraphs: ["..."], heading } Context: { config, slides: [Slide], transcript_raw, warnings: [str] } meta.json: { total_slides, total_chunks, model, generated_at, warnings }
Slide prompt user content structure:
SLIDE TEXT:
<fused_text>
TRANSCRIPT CHUNKS:
<joined_chunks>
Instructions: Write exactly {N} paragraphs in first person, authentic but concise ({verbosity}). Do not introduce facts not present above.
System message: “You are converting a presentation slide plus transcript snippets into a faithful first-person narrative.”
Intro prompt: similar but includes only early transcript portion and first slide fused_text. No paragraphs-per-slide constraint; request 1–2 paragraphs.
- Missing deck → abort (exit non-zero)
- Vision API failure for a slide → retry once then fallback to shapes/notes only, warning
- Transcript fetch failure → continue with empty transcript; narrative relies on slides
- Model generation failure for a slide → warning; insert placeholder paragraph
Stdout lines: [stage] message. No JSON logs.
test_single_slide_minimal()– Produces article with one slide sectiontest_alignment_distribution()– Ensures each slide gets >= 0 chunks and total preservedtest_article_structure()– Confirms front matter, H1 title, slide headings, image references
- Vision calls dominate latency → run sequentially first; add batching later
- For a 30-slide deck expect linear scaling; caching keyed by image hash prevents redundant re-runs
- Only use
OPENAI_API_KEYfrom environment - Do not log raw transcript or fused text (stdout shows only counts)
| Area | Future Feature |
|---|---|
| Alignment | Add embeddings + semantic similarity + confidence scoring |
| Regeneration | Slide-specific regeneration with anchor markers |
| Quality | Hallucination heuristic, coverage stats, per-slide confidence JSON |
| Narrative | Conclusion generation, resources extraction, TOC, multi-speaker voice blending |
| Performance | Parallel vision batching, rate-limit adaptive retry, token budgeting |
| Observability | Structured JSON logs, metrics export, cost tracking |
| Security | Redaction filters, PII token masking before prompts |
- CLI entry + config loader
- Slide render + vision extraction + write images & slides.json
- Transcript load/fetch + segmentation
- Naive alignment distribution
- Narrative generation (slides then intro)
- Markdown assembly
- Minimal tests & meta.json
| PRD Story | MVP Handling |
|---|---|
| Upload & Initiate | Folder + deck + optional transcript recognized; errors surfaced |
| Transcript Normalization | Basic sentence split & whitespace cleanup |
| Slide Extraction | PNG + fused_text via vision model |
| Alignment | Sequential distribution (baseline) |
| Narrative Generation | Per-slide first-person paragraphs |
| Output Assembly | Single article.md with images |
| Quality & Consistency | Warnings list only (no numeric confidence) |
| Configuration & Reruns | Implicit rerun by re-executing command (no partial regen) |
| Storage & Retrieval | Outputs co-located in folder; minimal JSON metadata |
- Running
talk2article folder/with sample deck yieldsarticle.mdwith all slide sections - No unhandled exceptions on empty or missing transcript case
- At least one warning appears when a slide has no transcript chunks
- All referenced slide images exist and render
This Lean MVP spec supersedes the prior comprehensive design for the initial implementation phase. Advanced capabilities are deferred to iterative milestones.
Responsibilities:
- Parse config & inputs
- Determine which stages to run (full vs targeted slides)
- Manage run ID, logging context, timing, error propagation
- Persist intermediate artifacts directory structure
Inputs: CLI args / config file, paths/URLs.
Outputs: Artifacts directory, final article.md, JSON reports.
- Load precedence: CLI flags > JSON/YAML config > environment defaults.
- Validates schema (verbosity level, paragraphs per slide, front matter enable flag, filler removal flag, etc.).
- Exposes immutable config object to downstream services.
- Validate PPTX/PDF file accessible & parseable
- Validate YouTube URL structure
- Determine transcript source: fetch vs provided file
- Surface specific error categories (NETWORK, FORMAT, MISSING_RESOURCE)
Simplified (vision-first) approach: always run a multimodal vision LLM to extract textual/structured content from every rendered slide image; classic OCR tools are no longer part of the primary path (optional future fallback only).
Subtasks:
- Convert PPTX → PNG slides (e.g., via
python-pptx+ headless conversion orlibreofficecontainer call; fallback PDF to image pipeline) - Extract native slide textual content and speaker notes (shapes + notes)
- Vision extraction (each slide PNG → vision model
vision_model_name) requesting strict JSON with:plain_textbullets[]code_blocks[](fields:language?,content)tables[](2D arrays or Markdown rows)detected_title?
- Fusion: Combine notes + shapes + vision JSON into
fused_text(precedence: notes > shapes > vision). Deduplicate lines (case-insensitive) while preserving ordering: title / heading → shapes order → bullets → code blocks → tables summaries. - Confidence: assign default confidences (notes 1.0, shapes 0.95, vision 0.85) stored in
extraction_meta. - Produce outputs:
slides/slide-N.pngslides.jsonentries: { slide_index, filename, extracted_text (shapes), notes, vision_text, fused_text, extraction_meta[], width, height, hash }- Derived deck title candidate (from slide 1 vision
detected_titleor shapes text)
Caching:
- Cache vision responses keyed by SHA256(image_bytes + vision_model_name) to avoid recomputation across reruns.
Failure Handling:
- If vision call fails after retries (3) use shapes+notes only; mark
vision_failed: trueand add warning to quality report. - If resulting fused_text length < 15 characters mark slide
needs_manual_review.
Security/Privacy:
- Optional redaction regex pre-process for shapes/notes before they are echoed into the prompt.
Downstream Effects:
- Embeddings consume
fused_textprimarily (fallback to shapes+notes if vision failed). - Quality report flags slides with
vision_failedor very low lexical overlap between shapes+notes and vision_text (potential hallucination).
Subtasks:
- Fetch YouTube transcript (YouTube API or
yt-dlpfallback) if not supplied - Parse supplied
.vtt/.srt/ raw text - Normalize: lower optional disfluency removal, preserve speaker labels, keep original timestamps
- Output raw transcript (
transcript_raw.txt) & normalized JSON segments before finer segmentation
- Split normalized transcript into semantically coherent chunks (~200–400 chars) using punctuation and pause thresholds.
- Provide each segment: id, start_time, end_time, speaker_label (if any), text
- Output
transcript_segments.json
- Compute embeddings (OpenAI embedding model) for:
- Slide textual bundle:
fused_text(or shapes+notes fallback when vision failed) - Transcript segments
- Slide textual bundle:
- Caching keyed by SHA256 of text + model name to avoid duplicate cost across reruns
- Output
embeddings.json(maps ids → vector + model + text_hash)
Hybrid algorithm steps:
- Initialize candidate mapping via timestamp heuristics (if video timing or approximate slide durations inferred; assumption: future extension for actual slide change events; initial version uses sequential distribution based on total duration / slide count if real events absent).
- Refine with semantic similarity: compute cosine similarity between slide embedding and window of adjacent transcript segment embeddings; assign top matches exceeding threshold.
- Confidence scoring: weight(timestamp_score, semantic_score, coverage_ratio).
- Flag slides with no segment > semantic threshold.
- Output
alignment.jsonwith per slide: slide_index, aligned_segment_ids (ordered), method_used (timestamp|semantic|greedy), confidence [0–1], warnings.
- For each slide, build prompt template including:
- Global talk context (title, audience profile, verbosity, narrative style guidelines)
- Slide text/notes/ocr summary
- Aligned transcript segment texts with segment IDs in comments for traceability
- Guardrails: Do not invent facts beyond provided content.
- Use OpenAI chat completion model for generation (configurable model name)
- Enforce paragraph count (post-process splitting / merging if necessary)
- Return per-slide: heading (derived or LLM-suggested), body paragraphs, internal citations (HTML comments with segment IDs, optional flag)
- Generate intro (using slide 1 + high-level summary of deck) and conclusion (using final slides + global summary prompt) as separate calls.
- Output
narrative.json.
- Compose front matter (YAML) from config + derived metadata
- Conditional Table of Contents if slide count > threshold & flag enabled
- Insert slides sequentially:
- Section heading
## Slide N: {heading} - Image syntax referencing
slides/slide-N.png - Narrative paragraphs
- Section heading
- Append Resources section (links auto-detected via regex from slides/transcript; uniqueness enforced)
- Output final
article.md
Checks:
- Alignment coverage (percent slides aligned)
- Low-confidence thresholds (< configurable min)
- Slides lacking alt text / text content
- Potential hallucinations: simple heuristic → narrative sentences with nouns not found in source text tokens (approximate lexical overlap score) below threshold
- Token cost estimate (sum of prompt/completion tokens if tracked)
- Output
report.json
- Accept target slide indices for narrative regeneration
- Reuse existing artifacts (alignment, embeddings) unless
--forceprovided - Update only changed sections in
article.md(safe in-place update using placeholder markers or structured regeneration plan)
- Run ID:
YYYYMMDD-HHMMSS-random4or deterministic hash if inputs identical &--reuseflag - Directory tree:
runs/<run_id>/article.mdruns/<run_id>/slides/slide-N.pngruns/<run_id>/slides.jsonruns/<run_id>/transcript_raw.txtruns/<run_id>/transcript_segments.jsonruns/<run_id>/embeddings.jsonruns/<run_id>/alignment.jsonruns/<run_id>/narrative.jsonruns/<run_id>/report.jsonruns/<run_id>/config.json
- Symlink or copy latest run to
latest/convenience directory.
Commands:
generate(full pipeline)regen --slides 3,7(partial narrative)report --run <id>(display summary)list-runs
- Structured JSON logs per stage with: timestamp, run_id, stage, event, duration_ms, error_code(optional)
- Log levels: INFO default, DEBUG via flag
- Summaries appended to
runs/<run_id>/run.log
(No source code, conceptual fields only.)
SlideRecord: { slide_index, image_path, extracted_text, notes, vision_text, fused_text, width, height, text_hash, vision_failed? } TranscriptSegment: { id, start_time, end_time, speaker?, text, tokens? } EmbeddingRecord: { id, type(slide|segment), model, vector_dim, text_hash } AlignmentEntry: { slide_index, segment_ids[], method_used, confidence, warnings[] } NarrativeSlide: { slide_index, heading, paragraphs[], citation_segment_ids[], confidence } QualityReport: { run_id, generated_at, stats: { total_slides, aligned_slides, avg_confidence }, warnings[], hallucination_flags[], cost_estimate: { prompt_tokens?, completion_tokens?, total? } } Config: { model_name, embedding_model_name, verbosity_level, paragraphs_per_slide_range, remove_fillers, toc_threshold, front_matter_enabled, regenerate_slide_indices?, max_concurrency, similarity_thresholds { semantic, alignment }, low_confidence_threshold }
- Tokenize by sentences; accumulate until character window (200–400) or pause gap > threshold
- Merge very short trailing segments with predecessor
- Text normalized (strip whitespace) before hashing
- Cache file:
embedding_cache.sqlite(key: model + hash) or JSON if simplicity preferred; first version JSON.
confidence = w_timestamp timestamp_score + w_semantic semantic_score + w_coverage * coverage_ratio
- timestamp_score: 1 if segment time window overlaps expected slide window else decays
- semantic_score: max cosine similarity among chosen segments
- coverage_ratio: sum(len(segment text))/max( target_length_baseline, 1 )
- Weights configurable; defaults 0.3 / 0.5 / 0.2.
- Build source lexicon: union of tokens from aligned transcript segments + fused_text of slides
- For each narrative sentence compute overlap = (#source_tokens_intersection) / (#sentence_tokens)
- Flag if overlap < threshold (e.g., 0.35) & contains proper nouns (capitalization heuristic) → add to
report.json.
- For target slides: reuse narrative prompt components; regenerate only those slides
- Rebuild
narrative.jsonentries & patcharticle.mdusing anchor markers:<!-- SLIDE_SECTION_START:N -->/<!-- SLIDE_SECTION_END:N -->
- model_name (default: GPT-4o or equivalent Azure deployment)
- embedding_model_name (e.g.,
text-embedding-3-large) - verbosity_level (concise|standard|expanded)
- paragraphs_per_slide_range (min,max)
- remove_fillers (bool)
- toc_threshold (int, default 8)
- front_matter_enabled (bool)
- low_confidence_threshold (float, default 0.55)
- semantic_similarity_threshold (float, default 0.55)
- max_alignment_segments_per_slide (default 5)
- max_concurrency (embedding batch parallelism)
- citation_comments_enabled (bool)
- vision_model_name (e.g.,
gpt-4o-minivision variant) - vision_batch_size (int, default 4)
- vision_cache_enabled (bool)
- cost_tracking_enabled (bool)
- regeneration_slide_indices (list)
Error Categories:
- INPUT_VALIDATION_ERROR (missing file, invalid URL)
- TRANSCRIPT_FETCH_ERROR (network, permission, unavailable)
- DECK_PARSE_ERROR (corrupt PPTX)
- OCR_ERROR
- EMBEDDING_ERROR (API rate limit, auth)
- ALIGNMENT_ERROR
- LLM_GENERATION_ERROR
- ASSEMBLY_ERROR
Strategy:
- Fail-fast on critical earlier stages; produce partial artifacts with
status=failedin run log - For non-critical slides (e.g., single OCR failure) continue, mark warning
- Retry policy: Exponential backoff for embedding & LLM calls (max 3 retries)
- Graceful cancellation: On interrupt, flush artifacts & write
run_abortedmarker
- No hard-coded API keys (use
OPENAI_API_KEYor Azure-specific env vars) - Option to redact sensitive text patterns before sending to LLM (regex list configurable)
- Store only hashed embeddings if PII risk flagged (future extension)
- Avoid logging raw transcript segments at DEBUG unless
allow_raw_loggingflag set
- Batch embeddings (adaptive batch size by token length)
- Parallel slide image rendering where safe
- Streaming LLM output optional (future) – currently blocking calls
- Complexity: O(S + T + S*T_emb_similarity?) → reduce by pre-filtering segments using locality (windowed approach around expected time)
- Vision extraction adds constant overhead per slide; mitigated by batching & caching.
Test Layers:
- Unit: segmentation logic, similarity scoring, confidence computation
- Integration: end-to-end small sample deck (3 slides) & synthetic transcript
- Alignment accuracy evaluation harness: curated dataset with known mappings (≥90% coverage assertion)
- Regression: Golden output snapshots for narrative with stable config (allow minor diff tolerance via semantic similarity instead of raw string match)
- Error injection tests: simulate missing transcript, API rate limit, corrupt slide image
- Performance test: 100-slide synthetic deck within time budget (< X minutes; define after prototype)
Artifacts for tests located under tests/ mirroring service modules.
- Metrics JSON appended: { stage, duration_ms, token_usage, error_counts }
- Optional integration with stdout for container logs
- Derived metrics: average confidence, cost per slide, processing time per slide
- Distributable as a CLI tool (entry point in
pyproject.tomlor console_scripts) (future) - Container image: base Python + libreoffice (if PPTX->PNG) + tesseract (if OCR enabled)
- Azure Container Apps target (aligns with repo name) – environment variables injected via secrets store
- Access to an embedding & chat completion capable OpenAI (or Azure OpenAI) endpoint via
openaiSDK - PPTX is primary; PDF fallback already flattened; notes may be absent
- No precise slide change timestamps available in MVP; using heuristic distribution
- Single primary speaker voice for narrative in MVP
- Acceptable to use cosine similarity over normalized L2 for embeddings
- Cost tracking relies on API returned usage fields (if provided)
- Slide rendering inconsistencies → Normalize target width & maintain aspect ratio
- Large transcripts (token explosion) → Segment summarization before prompt if > threshold tokens
- Rate limits → Centralized throttle & caching
- Hallucinations → Strict prompt instructions + lexicon overlap heuristic + optional inline citations
- Vision extraction hallucination (invented tables/code) → enforce strict JSON schema, compare overlap with shapes+notes; if overlap < threshold flag for review.
User → CLI → Orchestrator → Validation → Slides → Transcript → Segmentation → Embeddings → Alignment → Narrative (loop per slide) → Assembly → Quality → Storage → Output paths returned.
User → CLI (regen) → Orchestrator (load existing run) → Narrative (slides 3,7) → Assembly patch → Quality delta update → Updated artifacts.
- Scaffold module directories & placeholder service classes (no logic) to enable incremental commits
- Implement Configuration Manager & CLI argument parsing
- Input Validation (PPTX presence, URL regex, transcript detection)
- Deck Processing (render to PNG, extract text & notes, produce
slides.json) - Transcript fetching & normalization (support file & remote)
- Segmentation algorithm + unit tests
- Embedding service with caching layer
- Alignment hybrid algorithm + confidence scoring (tests with synthetic data)
- Narrative prompt template design + per-slide generation (intro, conclusion)
- Markdown assembly with front matter, TOC conditional, resources extraction
- Quality report heuristics (coverage, hallucination approximation, alt text, low confidence)
- Regeneration controller (anchor markers + selective patching)
- Logging & metrics instrumentation
- Error handling refinement & retry logic
- Add test suite coverage for edge cases & performance scenario
- Documentation updates (
README, usage examples) & sample run artifacts - Optional: containerization & GitHub Action integration
- Each PRD story mapped to artifacts: see Section 3 & outputs folder; ensure test cases validate presence & quality thresholds.
- Multi-speaker narrative blending
- Translation layer
- SEO metadata suggestions
- Slide change detection from video frame differencing
- Vector store persistence for cross-talk analytics
- Per-element bounding boxes via vision model for layout reconstruction
- Running
generateon sample inputs produces all artifact files with non-empty sections - Alignment coverage ≥ 90% on test deck
- Narrative uses first-person pronouns consistently
report.jsonenumerates warnings when low confidence slides present