RunPod Serverless Health Watchdog with Slack Alerts

PromptCreate

Build me a RunPod serverless endpoint health watchdog that pages my on-call channel in Slack the moment an endpoint degrades. I want this running as a scheduled agent.

Trigger: cron, every 10 minutes.

Integrations: RunPod and Slack Bot.

Each run should do the following:

1. Call RunPod's List Serverless Endpoints to get every serverless endpoint on the account.

2. For each endpoint, call RunPod's Get Serverless Endpoint Health to read worker counts (ready vs initializing vs throttled) and queue stats (in-queue, in-progress, completed, failed, retried).

3. Decide a status for each endpoint: healthy, degraded, or down. Use rules along these lines, and treat them as defaults I can tweak: down = zero ready workers AND zero initializing workers AND queue depth > 0 for two consecutive checks; degraded = zero ready workers while queue depth > 5, OR failed jobs in the last interval > 10% of completed jobs, OR throttled workers > 50% of total workers for two consecutive checks; healthy = everything else. Make the thresholds easy to edit at the top of the agent instructions.

4. For every endpoint that is degraded or down, write a concise incident summary that names the endpoint, the symptom (e.g. "queue of 47 jobs with 0 ready workers"), and a probable cause (cold start storm, worker crash loop, queue backlog, throttled by max workers, sustained traffic spike). Suggest a recommended action: increase max workers, raise execution timeout, purge the queue, check the worker image for crash loops, etc.

5. Send one Slack message per incident to my on-call channel using Slack Bot's Send a Message. Make me set the channel as a parameter. Format the message clearly: a status emoji and endpoint name on the first line, then the symptom, probable cause, and recommended action as short bullets. Keep it scannable.

6. Stay silent when every endpoint is healthy. Do not send a "nothing to report" message. The goal is zero alert fatigue.

Nice-to-have refinements:

- Track previous-run state per endpoint so a degraded endpoint that stays degraded does not re-alert every 10 minutes. Send a follow-up only when the status changes (e.g. degraded becomes down, or recovers to healthy).

- Skip disabled or paused endpoints entirely.

- Allow an optional name pattern or tag filter so I can monitor only my production endpoints and ignore experimental ones.

- When the watchdog itself fails to reach RunPod (auth error, 5xx, timeout), post a single "watchdog could not reach RunPod" message to the same Slack channel instead of failing silently.

Audience for the Slack message is the on-call engineer. Keep tone factual and short. No emojis beyond a single status indicator. No marketing copy.

Additional information

What does this prompt do?

Sweeps every RunPod serverless endpoint on a ten minute cycle and pulls the latest worker and queue stats.
Decides whether each endpoint is healthy, degraded, or down using simple rules you can tune.
Posts a clean incident summary to your on-call Slack channel, naming the endpoint, the symptom, and a probable cause.
Stays quiet when everything is fine, so the channel only lights up when something actually needs attention.

What do I need to use this?

A RunPod account with serverless endpoints you want to monitor, plus a RunPod API key.
A Slack workspace and a dedicated on-call channel where the watchdog can post alerts.

How can I customize it?

Change how often it runs, for example every five minutes for critical traffic or every thirty minutes for batch endpoints.
Tune the rules that count as degraded or down, like queue depth, failure rate, and how long workers can stay throttled.
Pick which Slack channel the alerts go to, who gets tagged, and how the recommended actions are phrased.

Frequently asked questions

Will I get pinged when everything is healthy?

No. The watchdog stays silent when every endpoint is healthy, so your channel only lights up when something actually needs attention.

Can I monitor just a subset of my endpoints?

Yes. Add a filter so the watchdog only checks endpoints whose name matches a pattern, runs a specific model, or carries a particular tag, and skips everything else.

What does each alert tell me?

A short summary with the endpoint name, what is wrong (zero ready workers, queue backlog, failed jobs spiking), a probable cause like a cold start storm or worker crash loop, and a recommended next step such as raising max workers or purging the queue.

Will it spam me if an endpoint stays down for a while?

Every Monday, automatically find which employee accounts on your company domain turned up in data breaches and get a plain-English briefing in Slack.

5d ago

Agentic Task

Prompt

Instant Slack alert when an Azure DevOps build fails

When a pipeline build fails, we post a tight Slack card naming who pushed, the likely cause, and a link to the build so on-call can triage in seconds.

5d ago

Agentic Task

Prompt

Catch RunPod outages before your users do.

Connect RunPod and Slack once, and Geni checks every endpoint every ten minutes, only pinging on-call when something is actually wrong.

Explore more prompts Create