Source: Kibana Streams plugin,
kbn-streams-aiandkbn-ai-toolspackages. All code paths relative tox-pack/platform/in the Kibana repo.
Feature extraction is Phase 1 of the SigEvents pipeline. It runs two things in parallel:
- LLM-Based Extraction — Sends 20 sample documents to an LLM to extract structured features (entities, infrastructure, technologies, dependencies, schema info).
- Computed Features — Four programmatic (no LLM) feature generators that provide raw data context.
Results are merged, deduplicated, and stored in .kibana_streams_features-*. The LLM that generates detection queries in Phase 2 never sees raw log data — it operates entirely on these features.
Feature extraction is a one-shot task, not a recurring job. It runs only when explicitly triggered:
- At onboarding time — The onboarding pipeline (
POST /internal/streams/{streamName}/onboarding/_task) schedules feature identification, then query generation sequentially. - On manual request — Via
POST /internal/streams/{name}/features/_task(API or UI "regenerate features" action).
There is no cron, no interval, no automatic re-extraction. The Task Manager schedule call means "enqueue to run once," not "run on a recurring basis." Features expire after 7 days if not refreshed, but refreshing requires another explicit trigger.
The from and to parameters are passed by the caller. The frontend UI (useOnboardingApi hook) hardcodes:
// streams_app/public/util/time_range.ts
const ONE_DAY_MS = 24 * 60 * 60 * 1000;
export function getLast24HoursTimeRange(now = Date.now()): AbsoluteTimeRange {
return {
from: new Date(now - ONE_DAY_MS).toISOString(),
to: new Date(now).toISOString(),
mode: 'absolute',
};
}So in practice, all feature extraction — including log_patterns, log_samples, error_logs, and the 20 sample docs sent to the LLM — operates on the last 24 hours of data. The API technically accepts any time range, but the UI always sends the last 24h window.
All four generators run in parallel via generateAllComputedFeatures() in kbn-streams-ai/src/features/computed/index.ts.
- Implementation:
plugins/shared/streams/server/lib/significant_events/helpers/analyze_dataset.ts - What it does: Runs
describeDataset()from@kbn/ai-toolsto analyze the index mapping. Returns a short dataset summary and discovers which text fields exist. - Side effect: Determines the
categorizationFieldused bylog_patterns— prefersmessage, falls back tobody.text, orundefinedif neither exists as a text field.
- Implementation:
kbn-streams-ai/src/features/computed/log_samples.ts - What it does: Fetches 5 random raw documents from the stream.
- Sampling: Uses
getSampleDocuments()withrandom_scorefunction scoring — ES assigns a random score to each doc, sorted descending, so the top 5 are effectively random picks. - Filtering: None. Any document in the stream within the time range.
- Output: Full flattened field maps (all fields via
*), arrays truncated to 3 items. - Purpose: Give the LLM a sense of raw data structure — field names, values, format.
- Implementation:
kbn-streams-ai/src/features/computed/error_logs.ts - What it does: Fetches 5 random error documents from the stream.
- Sampling: Same
random_scoreapproach aslog_samples, but with an error filter. - Filtering: Documents must match at least one of:
log.level: "error"(exact term match)messagecontains phrase"error"or"exception"body.textcontains phrase"error"or"exception"
- Output: Same full flattened field maps as
log_samples. - Purpose: Ensure the LLM sees actual error content even if errors are rare relative to total volume.
- Implementation:
kbn-streams-ai/src/features/computed/log_patterns.ts→ callsgetLogPatterns()from@kbn/ai-tools - What it does: Identifies common log message pattern templates via Elasticsearch's
categorize_textaggregation. - Fields analyzed: Only
messageandbody.text(validated viafieldCaps— skipped if neither exists astext/match_only_text). - Output: Top 5 patterns (truncated from full results via
MAX_PATTERNS = 5), each containing{field, pattern, regex, count, sample}. - Purpose: Show the LLM the recurring shapes of log messages — aggregate templates, not individual documents.
The core implementation lives in kbn-ai-tools/src/tools/log_patterns/get_log_patterns.ts.
const fieldCapsResponse = await esClient.fieldCaps({
fields, // ['message', 'body.text']
index_filter: { bool: { filter: [...dateRangeQuery(start, end)] } },
index,
types: ['text', 'match_only_text'],
});Only proceeds if at least one of the requested fields exists as a text type in the index.
let samplingProbability = 100_000 / totalHits;
if (samplingProbability >= 0.5) {
samplingProbability = 1;
}Targets a 100k document sample. If total docs ≤ 200k, reads everything (probability = 1). Above that, uses ES's random_sampler aggregation to probabilistically downsample (e.g., ~1% for a 10M doc stream).
Pass 1 — Common patterns (sampled):
- Runs
categorize_textaggregation on the sampled data size: 100(up to 100 pattern categories)- Uses the
aiopscategorization analyzer (not ML standard tokenizer) - Wrapped in a
random_sampleraggregation at the calculated probability - Finds the most frequent log message templates
Pass 2 — Rare patterns (unsampled, full scan):
- Takes all patterns from Pass 1 that had ≥ 50 occurrences
- Constructs
must_notmatch queries for those common patterns - Runs a second
categorize_textwith:samplingProbability: 1(no sampling — full data scan)size: 1000(up to 1000 rare pattern categories)- ML standard tokenizer (different from Pass 1)
- Specifically hunts for rare/unusual patterns that common ones would drown out
return uniqBy(
orderBy(allPatterns.flat(), (pattern) => pattern.count, 'desc'),
(pattern) => pattern.sample
);The two passes are merged, deduped by sample text, and sorted by count descending.
Back in the generator, only the top 5 patterns are kept:
return {
patterns: patterns.slice(0, MAX_PATTERNS).map(({ field, pattern, regex, count, sample }) => ({
pattern, regex, count, sample, field,
})),
};| Stage | What Happens |
|---|---|
| Field selection | Only message and body.text, validated via fieldCaps |
| Sampling | random_sampler targeting 100k docs; full scan if ≤200k total |
| Pass 1 | categorize_text on sampled data, top 100 patterns, aiops analyzer |
| Pass 2 | Full scan excluding Pass 1 patterns with ≥50 hits, ML tokenizer, up to 1000 rare patterns |
| Merge | Combined, deduped by sample, ordered by count desc |
| Truncation | Top 5 patterns kept for the feature |
| Aspect | log_samples |
error_logs |
log_patterns |
|---|---|---|---|
| Returns | 5 full raw documents | 5 full raw error documents | 5 pattern templates with counts + regex |
| Filtering | None | log.level:error OR message/body.text contains "error"/"exception" |
None |
| Sampling mechanism | random_score query — pick 5 random hits |
random_score query — pick 5 random error hits |
random_sampler agg targeting 100k docs, then full-scan second pass for rare patterns |
| ES mechanism | Simple search query | Simple search query with filter | Two-pass categorize_text aggregation |
| Fields analyzed | All fields (*) |
All fields (*) |
Only message and body.text |
| Output shape | Full doc field maps | Full doc field maps | {pattern, regex, count, sample, field} |
| Granularity | Individual documents | Individual documents | Aggregate categories across the dataset |
| What it tells the LLM | "Here's what a document looks like" | "Here's what errors look like" | "These are the recurring message shapes in this stream" |
log_samples and error_logs are individual document samples selected pseudo-randomly. They show the LLM concrete examples of what the data contains.
log_patterns is an aggregate analysis — it categorizes all (or a sampled subset of) log messages into pattern buckets, counts them, and returns the top templates. It tells the LLM about the statistical distribution of message types.
The sampling approaches are fundamentally different: samples/errors just fetch 5 docs with random scoring (trivial). Log patterns needs to analyze a statistically meaningful portion of the data, hence the random_sampler aggregation targeting 100k docs plus the full-scan second pass to surface rare patterns that would otherwise be invisible under dominant ones.
All four computed features are stored alongside LLM-inferred features in .kibana_streams_features-*. When the query generation LLM runs (Phase 2), it retrieves all features for the stream via the get_stream_features tool. Each computed feature includes llmInstructions that tell the LLM how to interpret it:
dataset_analysis→ Understand the data schema and available fieldslog_samples→ See actual data format and field valueserror_logs→ Understand error patterns and recurring issueslog_patterns→ Know the common message templates and their frequency
This gives the query generation LLM both high-level understanding (from LLM-inferred features like entities, dependencies, technologies) and concrete data grounding (from computed features) without ever seeing the raw log data directly.
| File | Package | Purpose |
|---|---|---|
src/features/computed/index.ts |
kbn-streams-ai |
Registry, generateAllComputedFeatures() |
src/features/computed/types.ts |
kbn-streams-ai |
ComputedFeatureGenerator interface |
src/features/computed/log_samples.ts |
kbn-streams-ai |
Log samples generator |
src/features/computed/error_logs.ts |
kbn-streams-ai |
Error logs generator |
src/features/computed/log_patterns.ts |
kbn-streams-ai |
Log patterns generator |
src/features/utils/format_raw_document.ts |
kbn-streams-ai |
Document field flattening/truncation |
src/tools/log_patterns/get_log_patterns.ts |
kbn-ai-tools |
Core log pattern analysis (two-pass categorization) |
src/tools/describe_dataset/get_sample_documents.ts |
kbn-ai-tools |
getSampleDocuments() with random_score |
server/lib/significant_events/helpers/analyze_dataset.ts |
streams plugin | Dataset analysis + categorization field discovery |
server/lib/significant_events/helpers/get_log_patterns.ts |
streams plugin | Thin wrapper calling @kbn/ai-tools log patterns |
server/lib/tasks/task_definitions/features_identification.ts |
streams plugin | Background task orchestrating all feature extraction |