Catch Elasticsearch error spikes before users do
Every 15 minutes, scan your Elasticsearch logs for error spikes, post a triage summary in Slack, and file a Linear ticket when it crosses critical.
Build me an agent workflow that watches our Elasticsearch logs for error spikes and pages on-call when something real is going wrong, with no noise when things are calm.
Trigger: a cron that runs every 15 minutes.
Inputs I want exposed at setup: the Elasticsearch index pattern that holds my logs (e.g. logs-app-*), the field name that identifies the service (e.g. service.name or kubernetes.container.name), the field that holds the error signature or message (e.g. error.type, log.message), the timestamp field (usually @timestamp), the Slack channel for triage alerts, the Linear team for critical issues, the spike multiplier (default 3x) that defines what counts as a spike vs the baseline, the critical multiplier (default 10x or absolute count threshold) that escalates a spike to a Linear ticket, and an optional list of error patterns to ignore as known-noisy.
What the agent should do on each run, in order:
1) Use Elasticsearch Search Documents (Query Index) on the configured index pattern, scoped to the last 15 minutes, filtered to error-level events (log.level: error or severity >= error, configurable). Run a terms aggregation grouped first by the service field, then nested by a normalized error signature (use the configured signature field; if the field is a long message, ask the agent to bucket by a stable prefix or hash so similar messages collapse together). Include a top_hits sub-aggregation per bucket to grab 2-3 sample log lines so the alert has real text in it. Skip any pattern that matches the user's noisy-pattern allowlist.
2) Run a second Elasticsearch Search Documents call against the same index pattern, but for the 6 hours immediately before the current window, with the same aggregation. This gives you a per-(service, signature) baseline count. From those two responses, compute the per-minute rate for the recent window and the per-minute rate for the baseline window. For each (service, signature) bucket: if it did not exist in the baseline at all, treat it as a brand new error. Otherwise compute the multiplier (recent rate / baseline rate). Anything at or above the spike multiplier counts as a spike; anything at or above the critical multiplier (or above an absolute count threshold) counts as critical.
3) If nothing crosses threshold and nothing is brand new, do nothing. No Slack message, no Linear ticket. Quiet is the default.
4) For each spike or new error, draft a short triage summary the agent writes itself (do not paste raw JSON). It should name the affected service, describe the error pattern in plain language, include 2-3 sample log lines verbatim, state the spike magnitude ("5x baseline" or "new pattern, 42 events in 15 minutes"), and include a Kibana link that opens a Discover search for that service and signature over the last hour (build the link from the cluster's Kibana base URL, which I'll provide at setup, with the index pattern and a KQL filter).
5) Post one Slack message via Send a Message (Slack Bot) to the configured channel. If multiple signatures are spiking, group them into one message with a short header ("3 services spiking") and one section per spike. Use Slack mrkdwn formatting, not standard markdown.
6) For any spike that crosses the critical threshold, also open a Linear issue via Create Issue in the configured team. Title = the error signature plus the service name, kept short. Body = the same triage detail as the Slack message plus the Kibana link, in markdown. Set priority to Urgent (1) for critical spikes. Do not open a new issue if there is already an open Linear issue in that team whose title matches the same signature within the last 24 hours; comment or skip instead, so we don't dupe tickets across runs.
Other behavior I care about: be conservative about flagging brand new errors during very low-traffic periods (a single event at 3am is not a spike); when in doubt, prefer silence; and keep the Slack message under ~25 lines so the channel stays readable.
Additional information
What does this prompt do?
- Sweeps your Elasticsearch logs every 15 minutes for error events grouped by service and error pattern.
- Compares the latest window to a rolling six-hour baseline so brand new errors and real spikes get caught, not background noise.
- Posts a clean Slack triage note with the affected service, the error pattern, sample log lines, and how big the spike is.
- Opens a Linear ticket when a spike crosses a critical threshold, with the same triage detail plus a Kibana link so on-call has something to track.
- Stays silent when nothing crosses threshold, so your alert channel stays signal only.
What do I need to use this?
- An Elasticsearch cluster (Elastic Cloud or self-hosted) and a read key for the index that holds your logs.
- A Slack workspace and the channel you want alerts posted to.
- A Linear workspace and the team where critical issues should land.
- A rough idea of which field in your logs names the service and which one carries the error message or signature.
How can I customize it?
- Change how often it runs (every 5 minutes for noisy systems, every 30 for quieter ones) and the size of the comparison window.
- Tune the spike multiplier and the critical threshold that decides when a Linear ticket gets filed instead of just a Slack note.
- Pick which Slack channel gets the routine alerts and which Linear team owns the critical tickets.
- Add a noisy-pattern allowlist so known harmless errors stop paging the channel.
Frequently asked questions
Will this alert on every single error in my logs?
Do I need Elastic Watcher or Kibana Alerting set up first?
What happens during a quiet 15 minutes when nothing is spiking?
How does it decide when to open a Linear ticket instead of just a Slack message?
Can I point it at multiple log indices or just one?
What if my logs use different field names for service and error message?
Stop finding error spikes from a customer ticket.
Connect Elasticsearch, Slack, and Linear once, and Geni watches your logs every 15 minutes so on-call hears about real problems first.