Significant Events: Feature Detection Deep Dive

Source: Kibana Streams plugin, kbn-streams-ai and kbn-ai-tools packages. All code paths relative to x-pack/platform/ in the Kibana repo.

Overview

Feature extraction is Phase 1 of the SigEvents pipeline. It runs two things in parallel:

LLM-Based Extraction — Sends 20 sample documents to an LLM to extract structured features (entities, infrastructure, technologies, dependencies, schema info).
Computed Features — Four programmatic (no LLM) feature generators that provide raw data context.

Results are merged, deduplicated, and stored in .kibana_streams_features-*. The LLM that generates detection queries in Phase 2 never sees raw log data — it operates entirely on these features.

When Does Feature Extraction Run?

Feature extraction is a one-shot task, not a recurring job. It runs only when explicitly triggered:

At onboarding time — The onboarding pipeline (POST /internal/streams/{streamName}/onboarding/_task) schedules feature identification, then query generation sequentially.
On manual request — Via POST /internal/streams/{name}/features/_task (API or UI "regenerate features" action).

There is no cron, no interval, no automatic re-extraction. The Task Manager schedule call means "enqueue to run once," not "run on a recurring basis." Features expire after 7 days if not refreshed, but refreshing requires another explicit trigger.

Time Range

The from and to parameters are passed by the caller. The frontend UI (useOnboardingApi hook) hardcodes:

// streams_app/public/util/time_range.ts
const ONE_DAY_MS = 24 * 60 * 60 * 1000;

export function getLast24HoursTimeRange(now = Date.now()): AbsoluteTimeRange {
  return {
    from: new Date(now - ONE_DAY_MS).toISOString(),
    to: new Date(now).toISOString(),
    mode: 'absolute',
  };
}

So in practice, all feature extraction — including log_patterns, log_samples, error_logs, and the 20 sample docs sent to the LLM — operates on the last 24 hours of data. The API technically accepts any time range, but the UI always sends the last 24h window.

Computed Feature Generators

All four generators run in parallel via generateAllComputedFeatures() in kbn-streams-ai/src/features/computed/index.ts.

1. `dataset_analysis`

Implementation: plugins/shared/streams/server/lib/significant_events/helpers/analyze_dataset.ts
What it does: Runs describeDataset() from @kbn/ai-tools to analyze the index mapping. Returns a short dataset summary and discovers which text fields exist.
Side effect: Determines the categorizationField used by log_patterns — prefers message, falls back to body.text, or undefined if neither exists as a text field.

2. `log_samples`

Implementation: kbn-streams-ai/src/features/computed/log_samples.ts
What it does: Fetches 5 random raw documents from the stream.
Sampling: Uses getSampleDocuments() with random_score function scoring — ES assigns a random score to each doc, sorted descending, so the top 5 are effectively random picks.
Filtering: None. Any document in the stream within the time range.
Output: Full flattened field maps (all fields via *), arrays truncated to 3 items.
Purpose: Give the LLM a sense of raw data structure — field names, values, format.

3. `error_logs`

Implementation: kbn-streams-ai/src/features/computed/error_logs.ts
What it does: Fetches 5 random error documents from the stream.
Sampling: Same random_score approach as log_samples, but with an error filter.
Filtering: Documents must match at least one of:
- log.level: "error" (exact term match)
- message contains phrase "error" or "exception"
- body.text contains phrase "error" or "exception"
Output: Same full flattened field maps as log_samples.
Purpose: Ensure the LLM sees actual error content even if errors are rare relative to total volume.

4. `log_patterns`

Implementation: kbn-streams-ai/src/features/computed/log_patterns.ts → calls getLogPatterns() from @kbn/ai-tools
What it does: Identifies common log message pattern templates via Elasticsearch's categorize_text aggregation.
Fields analyzed: Only message and body.text (validated via fieldCaps — skipped if neither exists as text/match_only_text).
Output: Top 5 patterns (truncated from full results via MAX_PATTERNS = 5), each containing {field, pattern, regex, count, sample}.
Purpose: Show the LLM the recurring shapes of log messages — aggregate templates, not individual documents.

`log_patterns` Sampling & Filtering in Detail

The core implementation lives in kbn-ai-tools/src/tools/log_patterns/get_log_patterns.ts.

Step 1: Field Validation

const fieldCapsResponse = await esClient.fieldCaps({
  fields,            // ['message', 'body.text']
  index_filter: { bool: { filter: [...dateRangeQuery(start, end)] } },
  index,
  types: ['text', 'match_only_text'],
});

Only proceeds if at least one of the requested fields exists as a text type in the index.

Step 2: Sampling Probability Calculation

let samplingProbability = 100_000 / totalHits;
if (samplingProbability >= 0.5) {
  samplingProbability = 1;
}

Targets a 100k document sample. If total docs ≤ 200k, reads everything (probability = 1). Above that, uses ES's random_sampler aggregation to probabilistically downsample (e.g., ~1% for a 10M doc stream).

Step 3: Two-Pass Categorization

Pass 1 — Common patterns (sampled):

Runs categorize_text aggregation on the sampled data
size: 100 (up to 100 pattern categories)
Uses the aiops categorization analyzer (not ML standard tokenizer)
Wrapped in a random_sampler aggregation at the calculated probability
Finds the most frequent log message templates

Pass 2 — Rare patterns (unsampled, full scan):

Takes all patterns from Pass 1 that had ≥ 50 occurrences
Constructs must_not match queries for those common patterns
Runs a second categorize_text with:
- samplingProbability: 1 (no sampling — full data scan)
- size: 1000 (up to 1000 rare pattern categories)
- ML standard tokenizer (different from Pass 1)
Specifically hunts for rare/unusual patterns that common ones would drown out

Step 4: Merge & Dedup

return uniqBy(
  orderBy(allPatterns.flat(), (pattern) => pattern.count, 'desc'),
  (pattern) => pattern.sample
);

The two passes are merged, deduped by sample text, and sorted by count descending.

Step 5: Truncation

Back in the generator, only the top 5 patterns are kept:

return {
  patterns: patterns.slice(0, MAX_PATTERNS).map(({ field, pattern, regex, count, sample }) => ({
    pattern, regex, count, sample, field,
  })),
};

Pipeline Summary

Stage	What Happens
Field selection	Only `message` and `body.text`, validated via `fieldCaps`
Sampling	`random_sampler` targeting 100k docs; full scan if ≤200k total
Pass 1	`categorize_text` on sampled data, top 100 patterns, aiops analyzer
Pass 2	Full scan excluding Pass 1 patterns with ≥50 hits, ML tokenizer, up to 1000 rare patterns
Merge	Combined, deduped by sample, ordered by count desc
Truncation	Top 5 patterns kept for the feature

Comparison: All Three Document-Based Computed Features

Aspect	`log_samples`	`error_logs`	`log_patterns`
Returns	5 full raw documents	5 full raw error documents	5 pattern templates with counts + regex
Filtering	None	`log.level:error` OR message/body.text contains "error"/"exception"	None
Sampling mechanism	`random_score` query — pick 5 random hits	`random_score` query — pick 5 random error hits	`random_sampler` agg targeting 100k docs, then full-scan second pass for rare patterns
ES mechanism	Simple search query	Simple search query with filter	Two-pass `categorize_text` aggregation
Fields analyzed	All fields (`*`)	All fields (`*`)	Only `message` and `body.text`
Output shape	Full doc field maps	Full doc field maps	`{pattern, regex, count, sample, field}`
Granularity	Individual documents	Individual documents	Aggregate categories across the dataset
What it tells the LLM	"Here's what a document looks like"	"Here's what errors look like"	"These are the recurring message shapes in this stream"

Key Distinction

log_samples and error_logs are individual document samples selected pseudo-randomly. They show the LLM concrete examples of what the data contains.

log_patterns is an aggregate analysis — it categorizes all (or a sampled subset of) log messages into pattern buckets, counts them, and returns the top templates. It tells the LLM about the statistical distribution of message types.

The sampling approaches are fundamentally different: samples/errors just fetch 5 docs with random scoring (trivial). Log patterns needs to analyze a statistically meaningful portion of the data, hence the random_sampler aggregation targeting 100k docs plus the full-scan second pass to surface rare patterns that would otherwise be invisible under dominant ones.

How Computed Features Feed Into Query Generation

All four computed features are stored alongside LLM-inferred features in .kibana_streams_features-*. When the query generation LLM runs (Phase 2), it retrieves all features for the stream via the get_stream_features tool. Each computed feature includes llmInstructions that tell the LLM how to interpret it:

dataset_analysis → Understand the data schema and available fields
log_samples → See actual data format and field values
error_logs → Understand error patterns and recurring issues
log_patterns → Know the common message templates and their frequency

This gives the query generation LLM both high-level understanding (from LLM-inferred features like entities, dependencies, technologies) and concrete data grounding (from computed features) without ever seeing the raw log data directly.

Source File Reference

File	Package	Purpose
`src/features/computed/index.ts`	`kbn-streams-ai`	Registry, `generateAllComputedFeatures()`
`src/features/computed/types.ts`	`kbn-streams-ai`	`ComputedFeatureGenerator` interface
`src/features/computed/log_samples.ts`	`kbn-streams-ai`	Log samples generator
`src/features/computed/error_logs.ts`	`kbn-streams-ai`	Error logs generator
`src/features/computed/log_patterns.ts`	`kbn-streams-ai`	Log patterns generator
`src/features/utils/format_raw_document.ts`	`kbn-streams-ai`	Document field flattening/truncation
`src/tools/log_patterns/get_log_patterns.ts`	`kbn-ai-tools`	Core log pattern analysis (two-pass categorization)
`src/tools/describe_dataset/get_sample_documents.ts`	`kbn-ai-tools`	`getSampleDocuments()` with `random_score`
`server/lib/significant_events/helpers/analyze_dataset.ts`	streams plugin	Dataset analysis + categorization field discovery
`server/lib/significant_events/helpers/get_log_patterns.ts`	streams plugin	Thin wrapper calling `@kbn/ai-tools` log patterns
`server/lib/tasks/task_definitions/features_identification.ts`	streams plugin	Background task orchestrating all feature extraction

jasonrhodes/sig-events-feature-detection.md

Select an option

No results found

Select an option

No results found

Significant Events: Feature Detection Deep Dive

Overview

When Does Feature Extraction Run?

Time Range

Computed Feature Generators

1. `dataset_analysis`

2. `log_samples`

3. `error_logs`

4. `log_patterns`

`log_patterns` Sampling & Filtering in Detail

Step 1: Field Validation

Step 2: Sampling Probability Calculation

Step 3: Two-Pass Categorization

Step 4: Merge & Dedup

Step 5: Truncation

Pipeline Summary

Comparison: All Three Document-Based Computed Features

Key Distinction

How Computed Features Feed Into Query Generation

Source File Reference

jasonrhodes/sig-events-feature-detection.md

Significant Events: Feature Detection Deep Dive

Overview

When Does Feature Extraction Run?

Time Range

Computed Feature Generators

1. dataset_analysis

2. log_samples

3. error_logs

4. log_patterns

log_patterns Sampling & Filtering in Detail

Step 1: Field Validation

Step 2: Sampling Probability Calculation

Step 3: Two-Pass Categorization

Step 4: Merge & Dedup

Step 5: Truncation

Pipeline Summary

Comparison: All Three Document-Based Computed Features

Key Distinction

How Computed Features Feed Into Query Generation

Source File Reference

1. `dataset_analysis`

2. `log_samples`

3. `error_logs`

4. `log_patterns`

`log_patterns` Sampling & Filtering in Detail