CI/CD that survives after I leave

I have moved between teams enough times to notice which CI/CD practices survive my departure and which ones don’t. The pipeline itself isn’t the legacy. The team’s relationship with the pipeline is.

Five practices that I’ve watched outlast three handovers across three different teams.

1. The pipeline runs on every commit, including the broken ones

The first instinct of a new team is to make CI strict. Don’t let red builds reach main. Block PRs on test failures. Block merges on lint warnings.

That instinct is right for the deploy pipeline. It is wrong for the dev pipeline.

The pipeline that survives is the one engineers can run on their own machines, in two minutes, on a feature branch with broken tests. They commit broken work to a branch, the pipeline runs, the failure tells them what to fix, they fix it, they push again.

If the pipeline is too strict, engineers stop pushing until they are sure. Sure means slow. Slow means they don’t push for half a day. Half a day means context loss. Context loss means worse code.

The deploy pipeline is strict. The dev pipeline is forgiving. They are two pipelines, not one.

2. The pipeline lives in the repo, not in a SaaS UI

Every team I joined had a pipeline configured in the SaaS UI of their CI vendor. Click click click. Nobody knows who set it up. Nobody can tell what changed last week. Nobody can replicate it locally.

I migrate every team to declarative pipelines in the repo. .github/workflows/. .gitlab-ci.yml. azure-pipelines.yml. Whatever the vendor calls it. The point is that the pipeline change is a PR, the PR has a reviewer, the reviewer has to think about what changed.

The team that survives is the one where “the pipeline broke” is something a junior can investigate by reading the YAML. Not something a senior has to dig into the SaaS UI to find out.

This is the single highest-leverage thing you can do for a team’s CI/CD relationship. It is also the thing most teams resist longest, because the SaaS UI is “easier.”

3. The deploy is one button or zero buttons

Two states are good. Three states are bad.

Zero buttons means main is auto-deployed. Engineers push, CI passes, production updates. This is the default for greenfield products with feature flags.

One button means a human has to click “deploy” but everything before that is automatic. Useful for regulated environments, or when production is high-stakes enough that someone has to be paying attention at the moment of deploy.

Three buttons means: build, deploy to staging, run smoke tests, deploy to production. Each manual. This is the failure mode. The engineer who clicks button two doesn’t always click button three. Things ship to staging and never to production. The two environments diverge. Tests pass in staging that fail in production because the deploys are weeks apart.

I have shipped one-button and zero-button. I have inherited three-button, twice, and reduced it both times.

4. The post-deploy verification is the test, not the deploy

Most teams I have joined treated “deploy succeeded” as the success metric. The pipeline goes green, the deploy returns 200, the team celebrates, then a customer reports a bug an hour later.

The deploy succeeding means the artefact reached the server. Nothing more. The actual question is: is the system behaving the way it should be?

The pipeline that survives includes a post-deploy verification step. A health check. A smoke test against production URLs. A query to the database to confirm the migration ran. Whatever the equivalent is for your stack. If that fails, the deploy is rolled back automatically.

This is the difference between a pipeline that ships and a pipeline that’s responsible for what shipped. The team I leave behind, six months later, has fewer “we deployed but didn’t notice it broke” incidents. That is the practice surviving.

5. The pipeline failures have an owner, not a queue

A red build on main is the team’s most expensive thing. Every minute main is red, every other engineer’s flow is at risk.

The practice that survives: when main goes red, the person who broke it is on the hook to fix it within 30 minutes or revert. No queue. No ticket. No “I’ll get to it after lunch.” 30 minutes or revert.

The reason this survives is that it’s a social contract, not a tool. You can’t enforce it with a webhook. You enforce it by reverting the second red build a junior leaves unattended for an hour. The team learns. The next time it happens to them, they fix it or revert. The third time it happens, they don’t push without running CI locally first.

What doesn’t survive

Custom dashboards that visualise pipeline metrics. They get built once, looked at twice, and abandoned.
Slack notifications for every CI event. They get muted in week two.
Ten-step pipelines with bespoke logic. They survive me by exactly the time it takes the next senior to ask “why is this so complicated.”
Anything that depends on me personally to keep working.