Lessons Learned: The Time I Nuked Dev
Let’s set the scene: You’re working on something exciting. You’ve got your skaffold dev
session running. You’re deep in the zone. Everything’s humming. And then—just as you’re ready to stop—your muscle memory kicks in:
Ctrl + C
And then… The logs keep going. The pods are spinning down. The infra is disappearing.
Wait. No. NO. You weren’t in local mode. You were pointing at one of the shared Dev environments.
Congratulations. You just accidentally undeployed half the stack in Dev.
Yep. That was me. And from the hushed Slack messages I got afterward, I learned I wasn’t the first to do it. Apparently, this was tradition.
The Context: Three Dev Environments, One Human Brain¶
In our setup, we maintain three distinct Dev environments. Each is used by different teams for testing features, running demos, or verifying bug fixes. They share infrastructure, services, and, unfortunately… developers with default kubeconfigs.
Kubernetes’ kubectl
uses the current context defined in your ~/.kube/config
file. If you’ve ever run kubectl config use-context <some-env>
and forgotten to switch back—well, you’re already sweating.
Here’s the thing: unless you’ve hardened your setup, there’s nothing stopping a dev from running skaffold dev
or some other destructive command while pointed at a shared remote cluster.
The result? An unintentional but very real infrastructure teardown.
Band-Aids and Best Practices (That Didn’t Stick)¶
Of course, over time we developed the usual defenses:
- Tribal knowledge: “Hey, don’t ever run Skaffold while pointed at Dev.”
- Wiki pages: “Use aliases in your kubeconfig!”
- Verbal warnings: “PLEASE double-check your context before deploying.”
But let’s be real: people are human. Mistakes happen.
New hires don’t always know the sacred scrolls.
And kubectl config current-context
is just not the first command you think to run when you’re chasing down a bug.
We needed something better. Something visible. Something interactive. So I built a Slackbot.
Enter: The Slackbot Guardian of the Clusters¶
At first, I built it for myself—just a tool to query environments quickly. But as it grew, it became something more powerful:
đź”§ Features¶
/env-status
: See which services are up, what pods are running, and where things are failing./logs <service>
: Pull logs from any of the Dev environments instantly./restart <pod>
: Give a problematic pod a gentle nudge (with appropriate RBAC and guardrails)./kube-contexts
: A reminder of who’s using what—without having to SSH into anything.
🛡️ Guardrails¶
Of course, I wasn’t about to hand over cluster access to everyone with a Slack account. So:
- Certain commands are read-only unless you're whitelisted.
- Anything destructive (like restarts) has strict role limits.
- No deletes. Ever. If you want to
kubectl delete
, do it the old-fashioned way—with full shame and fear.
🌍 Deployment¶
I deployed the bot to our internal Kubernetes cluster and hooked it into our DevOps Slack workspace. No fancy third-party services. No external calls. Just clean, fast, controlled interaction with our environments—on our terms.
The Impact: Fewer Mistakes, Faster Insight, Better Dev Experience¶
Here’s what happened once the bot went live:
-
No more accidental nukes. We haven’t had a surprise
skaffold dev
apocalypse since launch. -
Faster debugging. Devs can check logs or pod health from Slack without switching tools or contexts.
-
Onboarding made easier. New team members don’t have to memorize all three kubeconfigs. They can check things safely before ever running a CLI command.
-
Increased awareness. When something breaks, people see it in Slack. It’s not a mystery. It’s not a surprise. It’s just another message they can act on.
Final Thoughts: Don't Wait for the Oops¶
There’s a pattern here. Whether it’s CI/CD pipelines or shared Dev environments, the same truth applies:
If humans are capable of doing something dumb, eventually they will.
The trick isn’t to blame them for it—it’s to build systems that expect it and provide better options.
For us, that meant a Slackbot.
For you, maybe it’s a CLI wrapper, a local dev proxy, or a hard fail in your skaffold.yaml
.
But whatever it is, don’t wait for the postmortem.
If you’re curious, I can share the bot code, Helm chart, or even help write a README
so you can deploy a similar bot for your own teams. And if nothing else, let this be a reminder:
Always check your kube context. And maybe don’t run
skaffold dev
until you’re really sure it’s local.