Skip to content

Instantly share code, notes, and snippets.

@einnar82
Created November 6, 2025 16:58
Show Gist options
  • Select an option

  • Save einnar82/58f3908aa3ddfbfd8a7a2cad177e6a58 to your computer and use it in GitHub Desktop.

Select an option

Save einnar82/58f3908aa3ddfbfd8a7a2cad177e6a58 to your computer and use it in GitHub Desktop.
AI Document parse

AI Project Document Generator & Parser - Architecture Discussion

Date: November 6, 2025
Role: Head of Engineering
Context: RAG, MCP, and LLM expertise


Project Overview

Goal

Build an AI-based project document generator and parser that can:

  1. Generate project documents from scratch
  2. Parse uploaded documents and provide clear explanations
  3. Review documents for completeness and consistency

Target Document Types

  • Technical specifications
  • Proposals
  • Project documents for review

Constraints & Infrastructure

VM Specifications (Ubuntu 20.04.6 LTS):

  • CPU: 6-Core AMD EPYC (Zen architecture)
  • RAM: 16 GB
  • Storage: 400 GB (390 GB available)
  • Network: Internet connected
  • Virtualization: QEMU/KVM
  • Container Platform: Docker (already running multiple containers)

Development Environment:

  • MacBook Pro M1, 16GB RAM

Architecture Design

Core Stack Recommendation

┌─────────────────────────────────────┐
│      FastAPI Server (Port 8000)     │
│    Document API + Review Engine     │
└──────────────┬──────────────────────┘
               │
      ┌────────┴────────┐
      │                 │
┌─────▼──────┐   ┌─────▼─────┐
│   Ollama   │   │  Qdrant   │
│ (Port 11434)│   │(Port 6333)│
│            │   │           │
│ - mistral  │   │ Vector DB │
│ - nomic    │   │           │
└────────────┘   └───────────┘

Technology Stack

1. LLM Layer: Ollama (Self-hosted)

  • Model: mistral:7b-instruct or deepseek-coder:6.7b
  • Embeddings: nomic-embed-text
  • Memory Footprint: ~8GB RAM
  • Reasoning: Self-hosted for privacy, client document compliance

2. Vector Database: Qdrant

  • Deployment: Docker container
  • Memory: 512MB-1GB for POC, scales to 2GB+
  • Collections:
    • technical_specs - Technical specification embeddings
    • proposals - Proposal documents
    • templates - High-quality reference examples

3. Document Processing

  • Parser: Unstructured.io (handles PDF/DOCX/MD without format assumptions)
  • Alternative: LlamaParse (better for complex layouts)
  • Orchestration: LangChain or FastAPI + direct Ollama calls

4. API Layer

  • Framework: FastAPI
  • Port: 8000
  • Features: Document upload, generation, review endpoints

Review Workflow Requirements

Completeness Checks

  1. Required Sections:

    • Project Overview/Introduction
    • Requirements (Functional/Non-functional)
    • Architecture/Technical Design
    • Timeline/Milestones
    • RACI Matrix (Deferred to future phase)
  2. Technical Components Coverage:

    • All mentioned technologies are addressed
    • Architecture decisions are documented
    • Dependencies are identified

Consistency Validation

  • Cross-reference validation between sections
  • Technical terminology consistency
  • Version/date consistency

Milestone Extraction

  • Identify timeline commitments
  • Extract deliverables
  • Map dependencies

Implementation Strategy

Two-Stage Pipeline

Stage 1: Structure Extraction & Validation

Document → Parse → LLM Structure Extraction → Validate Checklist

Purpose:

  • Identify document sections
  • Check completeness against required sections
  • Extract metadata (page ranges, section hierarchy)

Output:

{
  "sections": [
    {
      "name": "Requirements",
      "present": true,
      "page_range": "5-12",
      "completeness_score": 0.85
    },
    {
      "name": "RACI Matrix",
      "present": false,
      "completeness_score": 0.0
    }
  ]
}

Stage 2: RAG for Technical Verification

Identified Sections → Chunk → Embed → Semantic Search → Technical Validation

Purpose:

  • Deep technical content analysis
  • Consistency checking across document
  • Compare against similar historical documents (when available)

POC Approach

Phase 1: Single Document Validation

Goal: Validate architecture with one client document

Steps:

  1. Set up Docker infrastructure
  2. Parse single document
  3. Extract structure
  4. Generate review report
  5. Iterate on prompts

Success Criteria:

  • Successfully parse document structure
  • Identify missing sections
  • Generate actionable review comments

Phase 2: Scale to Document Repository (Future)

  • Ingest historical documents
  • Build RAG knowledge base
  • Comparative analysis capabilities

Docker Compose Configuration

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: doc_ai_ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped

  qdrant:
    image: qdrant/qdrant:latest
    container_name: doc_ai_qdrant
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant_data:/qdrant/storage
    restart: unless-stopped

  api:
    build: ./app
    container_name: doc_ai_api
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_HOST=http://ollama:11434
      - QDRANT_HOST=qdrant
      - QDRANT_PORT=6333
    volumes:
      - ./app:/app
      - uploads:/app/uploads
    depends_on:
      - ollama
      - qdrant
    restart: unless-stopped

volumes:
  ollama_data:
  qdrant_data:
  uploads:

Project Structure

project/
├── docker-compose.yml
├── app/
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── main.py                      # FastAPI application
│   ├── services/
│   │   ├── parser.py                # Document parsing logic
│   │   ├── structure_extractor.py   # Section detection & classification
│   │   ├── embedder.py              # Embedding service (Ollama)
│   │   └── reviewer.py              # Review orchestration
│   ├── prompts/
│   │   ├── structure.txt            # Structure extraction prompt
│   │   └── review.txt               # Review prompt templates
│   └── models/
│       └── schemas.py               # Pydantic models
└── README.md

Key Dependencies

fastapi==0.104.1
uvicorn[standard]==0.24.0
python-multipart==0.0.6
unstructured[pdf]==0.11.0
qdrant-client==1.7.0
langchain==0.1.0
langchain-community==0.0.10
ollama==0.1.6
pydantic==2.5.0
python-docx==1.1.0
PyPDF2==3.0.1

Use Case Flows

Use Case 1: Generate New Document

User Prompt
  ↓
Retrieve Similar Specs (RAG)
  ↓
Extract Common Structure/Patterns
  ↓
LLM Generation with Context
  ↓
Format with Templates
  ↓
Return Generated Document

Use Case 2: Upload & Explain

Upload PDF/DOCX
  ↓
Parse Structure (Unstructured.io)
  ↓
Extract Sections with Metadata
  ↓
Embed Chunks (nomic-embed-text)
  ↓
Store in Qdrant
  ↓
Semantic Search for Similar Docs
  ↓
LLM Explains with Comparative Context
  ↓
Return Explanation Report

Use Case 3: Review Document

Document + Review Criteria
  ↓
Extract Key Sections (Structure Extractor)
  ↓
Validate Completeness Checklist
  ↓
Retrieve Best Practices (RAG)
  ↓
Technical Consistency Check
  ↓
Identify Gaps/Inconsistencies
  ↓
Generate Structured Review Report

Prompt Engineering Strategy

Structure Extraction Prompt Template

Extract and classify sections from this technical document:

Required Sections:
- Project Overview/Introduction
- Requirements (Functional/Non-functional)
- Architecture/Technical Design
- Timeline/Milestones

For each section found, return:
{
  "name": "section name",
  "present": true/false,
  "page_range": "start-end",
  "content_summary": "brief summary",
  "completeness_score": 0.0-1.0
}

Document Content:
{document_text}

Review Prompt Template

Review this technical specification for:

1. Completeness:
   - All required sections present
   - Technical components adequately covered
   - Timeline/milestones clearly defined

2. Consistency:
   - Technical terminology usage
   - Cross-references validate
   - Version/date consistency

3. Technical Accuracy:
   - Architecture decisions justified
   - Technology choices appropriate
   - Dependencies identified

Document Sections:
{structured_sections}

Provide structured feedback with:
- Missing elements
- Inconsistencies found
- Recommendations

Privacy & Compliance Considerations

Security Requirements

  • Self-hosted LLM: All processing on-premises
  • No external API calls: Client documents never leave infrastructure
  • Data isolation: Each client project in separate Qdrant collection
  • Access control: (To be implemented in production)

Document Handling

  • Encrypted uploads (HTTPS)
  • Temporary storage only during processing
  • Option to purge after review
  • Audit logging for compliance

Resource Allocation

Memory Budget (16GB VM)

Ollama (Mistral 7B):     ~8GB
Qdrant:                  ~1-2GB
FastAPI + Workers:       ~2GB
System + Docker:         ~3GB
Buffer:                  ~2GB
------------------------
Total:                   ~16GB

Storage Requirements

  • Ollama Models: ~4-8GB per model
  • Qdrant Collections: Scales with document volume
  • POC Estimate: 10-20GB for single document testing

Next Steps

Immediate Actions

  1. Infrastructure Setup:

    • Deploy docker-compose stack on VM
    • Pull Ollama models (mistral:7b-instruct, nomic-embed-text)
    • Verify Qdrant connectivity
  2. Code Implementation:

    • Build FastAPI skeleton
    • Implement document parser
    • Create structure extractor service
  3. Prompt Engineering:

    • Develop structure extraction prompts
    • Create review prompt templates
    • Test with sample document

Decision Point: Implementation Approach

Choose one of the following to proceed:

Option A: Full Implementation

  • Complete parser → structure extractor → RAG → review pipeline
  • Estimated time: 2-3 days
  • Best for: Complete POC validation

Option B: Critical Components First

  • Structure extraction + review prompts only
  • Estimated time: 1 day
  • Best for: Quick validation of concept

Option C: Infrastructure + Iterative

  • Setup stack, then iterate on logic incrementally
  • Estimated time: Ongoing
  • Best for: Learning and refinement

Open Questions & Considerations

Document Structure

  • Q: Do historical documents follow any patterns?
  • A: No specific format - adaptive parsing required

Scale Considerations

  • Current: Single document POC
  • Future: Multi-document repository with comparative analysis

RACI Integration

  • Status: Deferred to future phase
  • Reason: Complexity of table extraction and inference
  • Future: Will need to decide between:
    • RACI table extraction (if provided)
    • RACI inference (if generated from content)

Architecture Benefits

Why This Stack?

  1. Resource Efficient: Fits comfortably in 16GB RAM
  2. Privacy Compliant: Fully self-hosted, no external dependencies
  3. Offline Capable: All inference on-premises
  4. MCP-Ready: Can expose tools via MCP servers in future iterations
  5. Scalable: Can swap Ollama for API calls if cloud deployment needed
  6. Docker-based: Consistent development across Mac M1 → VM deployment

Development Workflow

Local (Mac M1)              Remote (VM)
    ↓                           ↓
Docker Compose          Docker Compose
    ↓                           ↓
Hot Reload Dev    →     Deploy to Production
    ↓                           ↓
Test Locally            Client Documents

Success Metrics

POC Validation

  • Successfully parse uploaded document
  • Extract document structure with 80%+ accuracy
  • Identify missing sections
  • Generate actionable review comments
  • Process document in <2 minutes

Production Readiness (Future)

  • Process 10+ documents in repository
  • Comparative analysis across documents
  • API response time <30s for review
  • Accuracy validation against manual reviews

Resources & References

Ollama Models

Vector Database

Document Processing


Notes & Decisions Log

2025-11-06 - Initial Architecture Discussion

  • Decision: Two-stage pipeline (structure extraction → RAG validation)
  • Rationale: No standardized format requires structure detection before technical analysis
  • Deferred: RACI matrix extraction (complexity vs POC scope)
  • Agreed: Start with single document POC before scaling

Key Architectural Choices

  1. Self-hosted over API: Privacy/compliance requirement
  2. Ollama over OpenAI: On-premises, cost control, data sovereignty
  3. Structure-first approach: Necessary for completeness validation without standard format
  4. Docker Compose: Deployment consistency, already familiar infrastructure

Contact & Collaboration

Project Lead: Rannie (Head of Engineering)
Expertise: RAG, MCP, LLM, Laravel/PHP, DevOps, System Architecture
Current Focus: Malta GPG integration, ETL pipelines, multi-PHP setups


Document Status: Architecture Planning Complete - Awaiting Implementation Decision

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment