RunPod serverless health watchdog with Slack alerts

Check every RunPod serverless endpoint every ten minutes, decide if anything is degraded or down, and ping your on-call Slack channel only when it is.

Agentic Task
RunPodSlack BotEngineeringOperationsNotifications & AlertsResearch & Monitoring

Build me a RunPod serverless endpoint health watchdog that pages my on-call channel in Slack the moment an endpoint degrades. I want this running as a scheduled agent.

Trigger: cron, every 10 minutes.

Integrations: RunPod and Slack Bot.

Each run should do the following:

1. Call RunPod's List Serverless Endpoints to get every serverless endpoint on the account.

2. For each endpoint, call RunPod's Get Serverless Endpoint Health to read worker counts (ready vs initializing vs throttled) and queue stats (in-queue, in-progress, completed, failed, retried).

3. Decide a status for each endpoint: healthy, degraded, or down. Use rules along these lines, and treat them as defaults I can tweak: down = zero ready workers AND zero initializing workers AND queue depth > 0 for two consecutive checks; degraded = zero ready workers while queue depth > 5, OR failed jobs in the last interval > 10% of completed jobs, OR throttled workers > 50% of total workers for two consecutive checks; healthy = everything else. Make the thresholds easy to edit at the top of the agent instructions.

4. For every endpoint that is degraded or down, write a concise incident summary that names the endpoint, the symptom (e.g. "queue of 47 jobs with 0 ready workers"), and a probable cause (cold start storm, worker crash loop, queue backlog, throttled by max workers, sustained traffic spike). Suggest a recommended action: increase max workers, raise execution timeout, purge the queue, check the worker image for crash loops, etc.

5. Send one Slack message per incident to my on-call channel using Slack Bot's Send a Message. Make me set the channel as a parameter. Format the message clearly: a status emoji and endpoint name on the first line, then the symptom, probable cause, and recommended action as short bullets. Keep it scannable.

6. Stay silent when every endpoint is healthy. Do not send a "nothing to report" message. The goal is zero alert fatigue.

Nice-to-have refinements:

- Track previous-run state per endpoint so a degraded endpoint that stays degraded does not re-alert every 10 minutes. Send a follow-up only when the status changes (e.g. degraded becomes down, or recovers to healthy).

- Skip disabled or paused endpoints entirely.

- Allow an optional name pattern or tag filter so I can monitor only my production endpoints and ignore experimental ones.

- When the watchdog itself fails to reach RunPod (auth error, 5xx, timeout), post a single "watchdog could not reach RunPod" message to the same Slack channel instead of failing silently.

Audience for the Slack message is the on-call engineer. Keep tone factual and short. No emojis beyond a single status indicator. No marketing copy.

Additional information

What does this prompt do?
  • Sweeps every RunPod serverless endpoint on a ten minute cycle and pulls the latest worker and queue stats.
  • Decides whether each endpoint is healthy, degraded, or down using simple rules you can tune.
  • Posts a clean incident summary to your on-call Slack channel, naming the endpoint, the symptom, and a probable cause.
  • Stays quiet when everything is fine, so the channel only lights up when something actually needs attention.
What do I need to use this?
  • A RunPod account with serverless endpoints you want to monitor, plus a RunPod API key.
  • A Slack workspace and a dedicated on-call channel where the watchdog can post alerts.
How can I customize it?
  • Change how often it runs, for example every five minutes for critical traffic or every thirty minutes for batch endpoints.
  • Tune the rules that count as degraded or down, like queue depth, failure rate, and how long workers can stay throttled.
  • Pick which Slack channel the alerts go to, who gets tagged, and how the recommended actions are phrased.

Frequently asked questions

Will I get pinged when everything is healthy?
No. The watchdog stays silent when every endpoint is healthy, so your channel only lights up when something actually needs attention.
Can I monitor just a subset of my endpoints?
Yes. Add a filter so the watchdog only checks endpoints whose name matches a pattern, runs a specific model, or carries a particular tag, and skips everything else.
What does each alert tell me?
A short summary with the endpoint name, what is wrong (zero ready workers, queue backlog, failed jobs spiking), a probable cause like a cold start storm or worker crash loop, and a recommended next step such as raising max workers or purging the queue.
Will it spam me if an endpoint stays down for a while?
You can tune it to suppress repeat alerts for the same endpoint until either the state changes or a quiet period passes, so you get one ping when the incident opens, not one every ten minutes.
Does this cover RunPod Pods too, or only serverless endpoints?
This watchdog targets RunPod serverless endpoints, which is where most production inference traffic runs. Pod-level monitoring would be a separate workflow.

Catch RunPod outages before your users do.

Connect RunPod and Slack once, and Geni checks every endpoint every ten minutes, only pinging on-call when something is actually wrong.