Triage Railway resource alerts before paging anyone
When Railway fires a CPU, memory, or volume alert, an agent investigates the recent deploy and logs, classifies the cause, and posts a clear recommendation to Slack.
Build me an agent workflow that triages Railway resource alerts before anyone gets paged.
Trigger: an incoming webhook from Railway. Railway sends a webhook when one of its resource alerts fires: CPU monitor alerts, RAM monitor alerts, and volume usage alerts (see https://docs.railway.com/guides/webhooks). The webhook payload identifies the project, environment, service, and the metric that crossed its threshold, plus the alert timestamp.
When an alert arrives, the agent should investigate before deciding what to do. Concretely:
1. Parse the webhook payload and pull out the project ID, environment ID, service ID, service name, metric type (CPU, RAM, or volume), threshold, current value, and alert timestamp.
2. Call Railway's List Deployments for that service and environment to see whether the spike correlates with a deploy in the last hour. Capture the deploy ID, status, and timestamp of the latest few deployments.
3. For the latest deployment, call Railway's Get Deployment Logs and Get HTTP Logs to gather runtime errors and recent request volume and status codes. Also call Get Environment Logs filtered to recent errors, so the agent has a wider view of what is happening across the environment around the alert time.
4. Classify the spike into exactly one of these buckets, grounded in concrete signals (deploy time vs alert time, error count delta, request volume delta, gradual rise over days), not vibes: deploy-induced regression, traffic spike, slow leak, or noisy threshold.
5. Recommend a concrete next step that matches the classification: rollback to a specific deploy ID, raise the threshold, scale replicas, or ignore. Pick one, do not list options.
Output 1: Post a single message to Slack using the Slack Bot Send a Message action. The message should include the service and environment, a severity tag (P1, P2, P3 based on production vs non-production and metric type), the classification, the recommended next step, and a short evidence block citing the deploy ID, error count, and request volume delta. Keep it scannable in one screen.
Output 2: If, and only if, the classification is deploy-induced regression or slow leak, also open a Linear issue using Linear Create Issue. The issue title should name the service and the cause, the description should include the log excerpts and the recommended action, and the assignee should come from a configurable service-to-owner mapping that the user sets up once.
Configurable inputs the workflow needs from the user: the Slack channel per environment (production vs non-production), the Linear team to file into, the service-to-owner mapping (service ID or name to Linear user email), and the time window the agent considers "recent" for deploys (default one hour) and for slow leaks (default seven days).
Important behavior: only one Slack message per alert (no duplicate retries), and skip filing a Linear issue if there is already an open Linear issue for the same service with the same classification in the last 24 hours. Keep the AI classification grounded in real signals from the logs, not guesses.
Additional information
What does this prompt do?
- Catches Railway CPU, memory, and volume usage alerts the moment they fire, so noisy pings never sit in a channel unread.
- Pulls the affected service's recent deploy history and logs, and checks whether the spike lines up with a recent release, a traffic surge, or a slow leak building over days.
- Classifies the alert as a deploy regression, traffic spike, slow leak, or noisy threshold, with concrete evidence pulled from the logs.
- Posts one Slack message tagged with severity and a recommended next step (rollback, scale, raise the threshold, or ignore).
- Opens a Linear issue assigned to the service owner when the cause looks like a real regression or leak, with log excerpts already attached.
What do I need to use this?
- A Railway account with permission to create an API token and configure a resource alert webhook on the services you want monitored.
- A Slack workspace where the bot can post into your incidents or on-call channel.
- A Linear workspace with a team to file issues into, and a simple mapping of services to owners.
- Optional: your preferred severity thresholds and which channel each environment (production, staging) should post to.
How can I customize it?
- Change the service-to-owner mapping so the right engineer gets the Linear issue for each service.
- Adjust which Slack channel receives alerts per environment, or split production and staging into different channels.
- Tune the classifier's bias, for example always page on production volume alerts even if the agent thinks the threshold is noisy.
- Decide whether the agent should auto-trigger a rollback recommendation or stop short of suggesting one.
Frequently asked questions
Will this actually rollback my service automatically?
Which Railway alerts does this cover?
What if the spike is just a traffic surge and not a bug?
Can I send different services to different on-call engineers?
Does it post to Slack for every single alert?
Stop letting Railway alerts sit unread in a channel.
Connect Railway, Slack, and Linear once, and Geni investigates every resource alert with real context before anyone has to look.