Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save MattMatheus/5ca3aac2c5b4f4234eabf79c2b79b7db to your computer and use it in GitHub Desktop.

Select an option

Save MattMatheus/5ca3aac2c5b4f4234eabf79c2b79b7db to your computer and use it in GitHub Desktop.
ADR-0018: Authority-First Delta-Indexed Retrieval Architecture

ADR-0018: Authority-First Delta-Indexed Retrieval Architecture

Status

Proposed

Context

Current retrieval architecture decisions cover quality gates, schema/versioning, and API wrapper behavior, but do not yet define an end-to-end contract that simultaneously enforces:

  • authoritative raw-source truth
  • immutable identity across alias churn
  • ACL-first retrieval (including vector search)
  • derived-artifact disposability with provenance witnesses
  • delta-only recomputation keyed by digests and epochs
  • untrusted-source handling for all inbound content
  • minimum ingest transport coverage for SFTP and HTTPS

This gap creates risk of identity drift, silent ACL leakage, and expensive rebuild behavior during source movement, policy updates, and model/version churn.

Decision

Adopt an authority-first, digest-and-epoch keyed architecture with three planes and deterministic-first processing.

  1. Core invariants (non-negotiable):
  • Authority: raw source artifacts are the only authoritative truth.
  • Identity: node identity is (source_system, immutable_source_id); titles/paths/URLs are aliases only.
  • Access: no retrieval before ACL filter for any retrieval path, including vector retrieval.
  • Derived-only: normalized text, embeddings, and edges are disposable projections with witnesses.
  • Delta-only: recomputation is driven by content digests and policy/schema/model epochs; no unbounded rebuilds by default.
  • Trust boundary: all source data is untrusted at ingest and must pass validation/sanitization before downstream use.

0.1) Interface constraints (required):

  • Ingest interfaces must include both SFTP and HTTPS transports.
  • Connector implementations must preserve the same identity and digest invariants regardless of transport.
  1. Data model:
  • Authoritative (Blob Plane):
    • RawBlob(raw_digest, bytes, source_system, source_id, source_rev, fetched_at)
    • AttachmentBlob(attachment_id, raw_digest, mime, refs...)
    • HeadMap(node_id -> raw_digest) with append-only supersession history
  • Alias (Resolution Plane):
    • AliasMap(node_id, alias_type, alias_value, epoch)
    • Redirect edges on rename/move; identity never overwritten
  • Derived (Index Planes):
    • Every derived artifact carries:
      • witness = H(raw_digest, policy_epoch, schema_version, resolver_version, model_id, prompt_hash)
    • NormalizedArtifact with canonical text + span map to raw
    • EdgeIndex keyed by (node_id, raw_digest)
    • LexicalIndex per chunk
    • VectorIndex per chunk
    • ACLIndex per chunk
    • optional EntityIndex
  1. Ingest + normalization pipeline:
  • Stage A ingest.fetch: fetch ADO Wiki / SharePoint / Jira docs and other approved sources via required interfaces (SFTP, HTTPS) with native IDs/revisions; treat payloads as untrusted; store raw blobs; update head map.
  • Stage B extract.*: deterministic parse of DOM/MD AST for structure, raw links, ACL payloads.
  • Stage C normalize.*: LLM normalization only where deterministic conversion is insufficient; strict schema output with span provenance; quarantine on invariant violation.
  • Stage D resolve.links: deterministic link resolution via alias map and source-aware rules; unresolved links retained with explicit reasons; no implicit node merges.
  • Stage E index.*: stable chunk IDs from heading hash + span range; delta-write lexical/vector/edge/ACL indexes for changed chunks only.
  1. Query path (permission-first hybrid retrieval):
  • Step 1 query.authz_context: resolve principal->groups->ACL bitsets.
  • Step 2 query.retrieve.*: run lexical and vector retrieval with ACL prefilter before candidate return.
    • Prohibit unsafe vector post-filtering unless index design is leakage-safe by construction.
  • Step 3 query.rerank + query.synthesize: rerank by query class, assemble minimal cited excerpts, synthesize answers only from retrieved excerpts.
  1. Governance and safety gates:
  • ACL correctness gate:
    • retrieval requires ACL token; missing token fails closed
    • continuous leak tests with canary principals and adversarial prompts
  • Identity gate:
    • only (source_system, immutable_source_id) may create nodes
    • alias collisions require explicit resolution events
  • Drift gate:
    • enforce structural invariants (headings/code/tables/redactions and span coverage)
    • quarantine and human review when drift thresholds exceeded
  • Rebuild control gate:
    • requalification planner computes affected artifacts from epoch deltas
    • canary then roll; global rebuild is opt-in exception
  1. OTEL and metrics contract:
  • Hash high-cardinality identifiers (source_id_hash, raw_digest_prefix) in attributes.
  • Use full metric coverage + sampled full traces.
  • Minimum tracked metrics:
    • leak-test pass rate
    • quarantine rate
    • unresolved-link rate
    • tokens per doc
    • cost-per-doc proxy
    • per-stage latency breakdown

Consequences

  • Positive:
    • Prevents identity corruption during title/path churn.
    • Reduces ACL leakage risk by making permission checks a hard precondition.
    • Cuts indexing cost via digest/epoch delta recomputation.
    • Enables auditable provenance and safe replacement of derived artifacts.
  • Negative:
    • Increases schema and indexing complexity (witness management, epoch propagation, quarantine queues).
    • Requires careful ACL index design to avoid vector leakage paths.
    • Adds operational overhead for leak-testing, drift scoring, and requalification planning.
  • Neutral:
    • LLM use remains optional and bounded to explicitly approved normalization/disambiguation scenarios.
    • Entity extraction remains optional and independent from correctness-critical retrieval paths.

Alternatives Considered

  • Option A: title/path identity with periodic full rebuilds.
    • Rejected due to high rename/move breakage risk and poor incremental efficiency.
  • Option B: ACL filtering after retrieval merge.
    • Rejected due to leakage risk and policy non-compliance for protected corpora.
  • Option C: deterministic-only normalization with no LLM fallback.
    • Rejected due to poor handling of complex HTML/macro content and reduced recall.

Validation Plan

  • Local system testing target: macOS (developer machine baseline).
  • Build deterministic fixture corpus containing:
    • rename/move chains with redirects
    • alias collision scenarios
    • mixed ACL visibility documents
    • structured content (headings/code/tables/redactions)
  • Execute validation suites:
    • transport coverage: ingest fixtures successfully over both SFTP and HTTPS
    • untrusted input safety: malformed/hostile payload fixtures are rejected or quarantined per policy
    • identity stability: node IDs remain stable across alias changes
    • ACL fail-closed: retrieval denied without ACL token
    • ACL leak tests: canary principal cannot retrieve protected chunks through lexical or vector path
    • delta indexing: only changed chunks/artifacts recomputed on digest or epoch change
    • witness integrity: derived artifacts rejected if witness mismatch
    • drift gate: quarantine triggered for span/structure threshold violations
    • unresolved links: unresolved entries retained with explicit reason codes
  • Verify OTEL outputs:
    • required stage spans present
    • hashed attributes used for high-cardinality IDs
    • metric set emitted with stable names and units
  • Promotion rule:
    • canary pass required before broader index roll-forward
    • no global rebuild unless explicitly approved by operator override
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment