Proposed
Current retrieval architecture decisions cover quality gates, schema/versioning, and API wrapper behavior, but do not yet define an end-to-end contract that simultaneously enforces:
- authoritative raw-source truth
- immutable identity across alias churn
- ACL-first retrieval (including vector search)
- derived-artifact disposability with provenance witnesses
- delta-only recomputation keyed by digests and epochs
- untrusted-source handling for all inbound content
- minimum ingest transport coverage for SFTP and HTTPS
This gap creates risk of identity drift, silent ACL leakage, and expensive rebuild behavior during source movement, policy updates, and model/version churn.
Adopt an authority-first, digest-and-epoch keyed architecture with three planes and deterministic-first processing.
- Core invariants (non-negotiable):
- Authority: raw source artifacts are the only authoritative truth.
- Identity: node identity is
(source_system, immutable_source_id); titles/paths/URLs are aliases only. - Access: no retrieval before ACL filter for any retrieval path, including vector retrieval.
- Derived-only: normalized text, embeddings, and edges are disposable projections with witnesses.
- Delta-only: recomputation is driven by content digests and policy/schema/model epochs; no unbounded rebuilds by default.
- Trust boundary: all source data is untrusted at ingest and must pass validation/sanitization before downstream use.
0.1) Interface constraints (required):
- Ingest interfaces must include both
SFTPandHTTPStransports. - Connector implementations must preserve the same identity and digest invariants regardless of transport.
- Data model:
- Authoritative (Blob Plane):
RawBlob(raw_digest, bytes, source_system, source_id, source_rev, fetched_at)AttachmentBlob(attachment_id, raw_digest, mime, refs...)HeadMap(node_id -> raw_digest)with append-only supersession history
- Alias (Resolution Plane):
AliasMap(node_id, alias_type, alias_value, epoch)- Redirect edges on rename/move; identity never overwritten
- Derived (Index Planes):
- Every derived artifact carries:
witness = H(raw_digest, policy_epoch, schema_version, resolver_version, model_id, prompt_hash)
NormalizedArtifactwith canonical text + span map to rawEdgeIndexkeyed by(node_id, raw_digest)LexicalIndexper chunkVectorIndexper chunkACLIndexper chunk- optional
EntityIndex
- Every derived artifact carries:
- Ingest + normalization pipeline:
- Stage A
ingest.fetch: fetch ADO Wiki / SharePoint / Jira docs and other approved sources via required interfaces (SFTP,HTTPS) with native IDs/revisions; treat payloads as untrusted; store raw blobs; update head map. - Stage B
extract.*: deterministic parse of DOM/MD AST for structure, raw links, ACL payloads. - Stage C
normalize.*: LLM normalization only where deterministic conversion is insufficient; strict schema output with span provenance; quarantine on invariant violation. - Stage D
resolve.links: deterministic link resolution via alias map and source-aware rules; unresolved links retained with explicit reasons; no implicit node merges. - Stage E
index.*: stable chunk IDs from heading hash + span range; delta-write lexical/vector/edge/ACL indexes for changed chunks only.
- Query path (permission-first hybrid retrieval):
- Step 1
query.authz_context: resolve principal->groups->ACL bitsets. - Step 2
query.retrieve.*: run lexical and vector retrieval with ACL prefilter before candidate return.- Prohibit unsafe vector post-filtering unless index design is leakage-safe by construction.
- Step 3
query.rerank+query.synthesize: rerank by query class, assemble minimal cited excerpts, synthesize answers only from retrieved excerpts.
- Governance and safety gates:
- ACL correctness gate:
- retrieval requires ACL token; missing token fails closed
- continuous leak tests with canary principals and adversarial prompts
- Identity gate:
- only
(source_system, immutable_source_id)may create nodes - alias collisions require explicit resolution events
- only
- Drift gate:
- enforce structural invariants (headings/code/tables/redactions and span coverage)
- quarantine and human review when drift thresholds exceeded
- Rebuild control gate:
- requalification planner computes affected artifacts from epoch deltas
- canary then roll; global rebuild is opt-in exception
- OTEL and metrics contract:
- Hash high-cardinality identifiers (
source_id_hash,raw_digest_prefix) in attributes. - Use full metric coverage + sampled full traces.
- Minimum tracked metrics:
- leak-test pass rate
- quarantine rate
- unresolved-link rate
- tokens per doc
- cost-per-doc proxy
- per-stage latency breakdown
- Positive:
- Prevents identity corruption during title/path churn.
- Reduces ACL leakage risk by making permission checks a hard precondition.
- Cuts indexing cost via digest/epoch delta recomputation.
- Enables auditable provenance and safe replacement of derived artifacts.
- Negative:
- Increases schema and indexing complexity (witness management, epoch propagation, quarantine queues).
- Requires careful ACL index design to avoid vector leakage paths.
- Adds operational overhead for leak-testing, drift scoring, and requalification planning.
- Neutral:
- LLM use remains optional and bounded to explicitly approved normalization/disambiguation scenarios.
- Entity extraction remains optional and independent from correctness-critical retrieval paths.
- Option A: title/path identity with periodic full rebuilds.
- Rejected due to high rename/move breakage risk and poor incremental efficiency.
- Option B: ACL filtering after retrieval merge.
- Rejected due to leakage risk and policy non-compliance for protected corpora.
- Option C: deterministic-only normalization with no LLM fallback.
- Rejected due to poor handling of complex HTML/macro content and reduced recall.
- Local system testing target: macOS (developer machine baseline).
- Build deterministic fixture corpus containing:
- rename/move chains with redirects
- alias collision scenarios
- mixed ACL visibility documents
- structured content (headings/code/tables/redactions)
- Execute validation suites:
- transport coverage: ingest fixtures successfully over both
SFTPandHTTPS - untrusted input safety: malformed/hostile payload fixtures are rejected or quarantined per policy
- identity stability: node IDs remain stable across alias changes
- ACL fail-closed: retrieval denied without ACL token
- ACL leak tests: canary principal cannot retrieve protected chunks through lexical or vector path
- delta indexing: only changed chunks/artifacts recomputed on digest or epoch change
- witness integrity: derived artifacts rejected if witness mismatch
- drift gate: quarantine triggered for span/structure threshold violations
- unresolved links: unresolved entries retained with explicit reason codes
- transport coverage: ingest fixtures successfully over both
- Verify OTEL outputs:
- required stage spans present
- hashed attributes used for high-cardinality IDs
- metric set emitted with stable names and units
- Promotion rule:
- canary pass required before broader index roll-forward
- no global rebuild unless explicitly approved by operator override