Skip to content

Instantly share code, notes, and snippets.

@jasonrhodes
Last active March 12, 2026 15:55
Show Gist options
  • Select an option

  • Save jasonrhodes/b032b618ce04fb2005497384e24f897f to your computer and use it in GitHub Desktop.

Select an option

Save jasonrhodes/b032b618ce04fb2005497384e24f897f to your computer and use it in GitHub Desktop.
Significant Events: Feature Detection & Computed Features Deep Dive (Kibana Streams)

Significant Events: Feature Detection Deep Dive

Source: Kibana Streams plugin, kbn-streams-ai and kbn-ai-tools packages. All code paths relative to x-pack/platform/ in the Kibana repo.

Overview

Feature extraction is Phase 1 of the SigEvents pipeline. It runs two things in parallel:

  1. LLM-Based Extraction — Sends 20 sample documents to an LLM to extract structured features (entities, infrastructure, technologies, dependencies, schema info).
  2. Computed Features — Four programmatic (no LLM) feature generators that provide raw data context.

Results are merged, deduplicated, and stored in .kibana_streams_features-*. The LLM that generates detection queries in Phase 2 never sees raw log data — it operates entirely on these features.


When Does Feature Extraction Run?

Feature extraction is a one-shot task, not a recurring job. It runs only when explicitly triggered:

  1. At onboarding time — The onboarding pipeline (POST /internal/streams/{streamName}/onboarding/_task) schedules feature identification, then query generation sequentially.
  2. On manual request — Via POST /internal/streams/{name}/features/_task (API or UI "regenerate features" action).

There is no cron, no interval, no automatic re-extraction. The Task Manager schedule call means "enqueue to run once," not "run on a recurring basis." Features expire after 7 days if not refreshed, but refreshing requires another explicit trigger.

Time Range

The from and to parameters are passed by the caller. The frontend UI (useOnboardingApi hook) hardcodes:

// streams_app/public/util/time_range.ts
const ONE_DAY_MS = 24 * 60 * 60 * 1000;

export function getLast24HoursTimeRange(now = Date.now()): AbsoluteTimeRange {
  return {
    from: new Date(now - ONE_DAY_MS).toISOString(),
    to: new Date(now).toISOString(),
    mode: 'absolute',
  };
}

So in practice, all feature extraction — including log_patterns, log_samples, error_logs, and the 20 sample docs sent to the LLM — operates on the last 24 hours of data. The API technically accepts any time range, but the UI always sends the last 24h window.


Computed Feature Generators

All four generators run in parallel via generateAllComputedFeatures() in kbn-streams-ai/src/features/computed/index.ts.

1. dataset_analysis

  • Implementation: plugins/shared/streams/server/lib/significant_events/helpers/analyze_dataset.ts
  • What it does: Runs describeDataset() from @kbn/ai-tools to analyze the index mapping. Returns a short dataset summary and discovers which text fields exist.
  • Side effect: Determines the categorizationField used by log_patterns — prefers message, falls back to body.text, or undefined if neither exists as a text field.

2. log_samples

  • Implementation: kbn-streams-ai/src/features/computed/log_samples.ts
  • What it does: Fetches 5 random raw documents from the stream.
  • Sampling: Uses getSampleDocuments() with random_score function scoring — ES assigns a random score to each doc, sorted descending, so the top 5 are effectively random picks.
  • Filtering: None. Any document in the stream within the time range.
  • Output: Full flattened field maps (all fields via *), arrays truncated to 3 items.
  • Purpose: Give the LLM a sense of raw data structure — field names, values, format.

3. error_logs

  • Implementation: kbn-streams-ai/src/features/computed/error_logs.ts
  • What it does: Fetches 5 random error documents from the stream.
  • Sampling: Same random_score approach as log_samples, but with an error filter.
  • Filtering: Documents must match at least one of:
    • log.level: "error" (exact term match)
    • message contains phrase "error" or "exception"
    • body.text contains phrase "error" or "exception"
  • Output: Same full flattened field maps as log_samples.
  • Purpose: Ensure the LLM sees actual error content even if errors are rare relative to total volume.

4. log_patterns

  • Implementation: kbn-streams-ai/src/features/computed/log_patterns.ts → calls getLogPatterns() from @kbn/ai-tools
  • What it does: Identifies common log message pattern templates via Elasticsearch's categorize_text aggregation.
  • Fields analyzed: Only message and body.text (validated via fieldCaps — skipped if neither exists as text/match_only_text).
  • Output: Top 5 patterns (truncated from full results via MAX_PATTERNS = 5), each containing {field, pattern, regex, count, sample}.
  • Purpose: Show the LLM the recurring shapes of log messages — aggregate templates, not individual documents.

log_patterns Sampling & Filtering in Detail

The core implementation lives in kbn-ai-tools/src/tools/log_patterns/get_log_patterns.ts.

Step 1: Field Validation

const fieldCapsResponse = await esClient.fieldCaps({
  fields,            // ['message', 'body.text']
  index_filter: { bool: { filter: [...dateRangeQuery(start, end)] } },
  index,
  types: ['text', 'match_only_text'],
});

Only proceeds if at least one of the requested fields exists as a text type in the index.

Step 2: Sampling Probability Calculation

let samplingProbability = 100_000 / totalHits;
if (samplingProbability >= 0.5) {
  samplingProbability = 1;
}

Targets a 100k document sample. If total docs ≤ 200k, reads everything (probability = 1). Above that, uses ES's random_sampler aggregation to probabilistically downsample (e.g., ~1% for a 10M doc stream).

Step 3: Two-Pass Categorization

Pass 1 — Common patterns (sampled):

  • Runs categorize_text aggregation on the sampled data
  • size: 100 (up to 100 pattern categories)
  • Uses the aiops categorization analyzer (not ML standard tokenizer)
  • Wrapped in a random_sampler aggregation at the calculated probability
  • Finds the most frequent log message templates

Pass 2 — Rare patterns (unsampled, full scan):

  • Takes all patterns from Pass 1 that had ≥ 50 occurrences
  • Constructs must_not match queries for those common patterns
  • Runs a second categorize_text with:
    • samplingProbability: 1 (no sampling — full data scan)
    • size: 1000 (up to 1000 rare pattern categories)
    • ML standard tokenizer (different from Pass 1)
  • Specifically hunts for rare/unusual patterns that common ones would drown out

Step 4: Merge & Dedup

return uniqBy(
  orderBy(allPatterns.flat(), (pattern) => pattern.count, 'desc'),
  (pattern) => pattern.sample
);

The two passes are merged, deduped by sample text, and sorted by count descending.

Step 5: Truncation

Back in the generator, only the top 5 patterns are kept:

return {
  patterns: patterns.slice(0, MAX_PATTERNS).map(({ field, pattern, regex, count, sample }) => ({
    pattern, regex, count, sample, field,
  })),
};

Pipeline Summary

Stage What Happens
Field selection Only message and body.text, validated via fieldCaps
Sampling random_sampler targeting 100k docs; full scan if ≤200k total
Pass 1 categorize_text on sampled data, top 100 patterns, aiops analyzer
Pass 2 Full scan excluding Pass 1 patterns with ≥50 hits, ML tokenizer, up to 1000 rare patterns
Merge Combined, deduped by sample, ordered by count desc
Truncation Top 5 patterns kept for the feature

Comparison: All Three Document-Based Computed Features

Aspect log_samples error_logs log_patterns
Returns 5 full raw documents 5 full raw error documents 5 pattern templates with counts + regex
Filtering None log.level:error OR message/body.text contains "error"/"exception" None
Sampling mechanism random_score query — pick 5 random hits random_score query — pick 5 random error hits random_sampler agg targeting 100k docs, then full-scan second pass for rare patterns
ES mechanism Simple search query Simple search query with filter Two-pass categorize_text aggregation
Fields analyzed All fields (*) All fields (*) Only message and body.text
Output shape Full doc field maps Full doc field maps {pattern, regex, count, sample, field}
Granularity Individual documents Individual documents Aggregate categories across the dataset
What it tells the LLM "Here's what a document looks like" "Here's what errors look like" "These are the recurring message shapes in this stream"

Key Distinction

log_samples and error_logs are individual document samples selected pseudo-randomly. They show the LLM concrete examples of what the data contains.

log_patterns is an aggregate analysis — it categorizes all (or a sampled subset of) log messages into pattern buckets, counts them, and returns the top templates. It tells the LLM about the statistical distribution of message types.

The sampling approaches are fundamentally different: samples/errors just fetch 5 docs with random scoring (trivial). Log patterns needs to analyze a statistically meaningful portion of the data, hence the random_sampler aggregation targeting 100k docs plus the full-scan second pass to surface rare patterns that would otherwise be invisible under dominant ones.


How Computed Features Feed Into Query Generation

All four computed features are stored alongside LLM-inferred features in .kibana_streams_features-*. When the query generation LLM runs (Phase 2), it retrieves all features for the stream via the get_stream_features tool. Each computed feature includes llmInstructions that tell the LLM how to interpret it:

  • dataset_analysis → Understand the data schema and available fields
  • log_samples → See actual data format and field values
  • error_logs → Understand error patterns and recurring issues
  • log_patterns → Know the common message templates and their frequency

This gives the query generation LLM both high-level understanding (from LLM-inferred features like entities, dependencies, technologies) and concrete data grounding (from computed features) without ever seeing the raw log data directly.


Source File Reference

File Package Purpose
src/features/computed/index.ts kbn-streams-ai Registry, generateAllComputedFeatures()
src/features/computed/types.ts kbn-streams-ai ComputedFeatureGenerator interface
src/features/computed/log_samples.ts kbn-streams-ai Log samples generator
src/features/computed/error_logs.ts kbn-streams-ai Error logs generator
src/features/computed/log_patterns.ts kbn-streams-ai Log patterns generator
src/features/utils/format_raw_document.ts kbn-streams-ai Document field flattening/truncation
src/tools/log_patterns/get_log_patterns.ts kbn-ai-tools Core log pattern analysis (two-pass categorization)
src/tools/describe_dataset/get_sample_documents.ts kbn-ai-tools getSampleDocuments() with random_score
server/lib/significant_events/helpers/analyze_dataset.ts streams plugin Dataset analysis + categorization field discovery
server/lib/significant_events/helpers/get_log_patterns.ts streams plugin Thin wrapper calling @kbn/ai-tools log patterns
server/lib/tasks/task_definitions/features_identification.ts streams plugin Background task orchestrating all feature extraction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment