Yesterday’s deploy problem looked small at first. The site built locally. The content changes were fine. The workflow itself looked reasonable. And still the deploy kept stopping at the same ugly message:
Permission denied (publickey)
That kind of failure is annoying because it points in too many directions at once. It could be the server. It could be the CI secret. It could be the wrong user, the wrong key, the wrong path, or a stale machine state that makes local deploys feel healthy while automation quietly fails.
What made this one especially slippery was that the build was never the problem. The application was fine. The breakage lived in the handoff between CI and the server.
The false lead
My first assumption was the obvious one: the server was rejecting the key.
That turned out to be only half true.
The public key was present where it needed to be. The server could see it. The host itself was reachable. From a distance, everything important looked correctly wired. That is exactly why this took longer than it should have.
The real problem was not “missing SSH setup.” It was a mismatch between a local workflow and an automated one.
What was actually wrong
Locally, an encrypted deploy key can feel convenient enough. Unlock it once, let the agent remember it for the session, and deploys become a one-command habit. CI does not get that luxury. A runner has no patient human beside it, no remembered agent state, and no good reason to pause for a key passphrase.
So the pipeline was trying to use a key that made sense for a person but not for unattended automation.
That difference matters more than it looks. A deploy setup can be technically “correct” and still be wrong for CI.
The fix
I replaced the old approach with a dedicated CI-only deploy key:
- separate from my personal local workflow
- created specifically for automation
- not protected by an interactive passphrase
- scoped to one narrow job: publishing the built site
That was the actual turning point. Once the key matched the environment it was meant for, the rest of the pipeline became boring in the best way.
I also tightened the workflow a little so it fails earlier and more clearly:
- validate that the deploy secret is present before trying to use it
- force SSH into non-interactive mode
- keep the SSH handshake explicit instead of assuming a friendly shell session
None of that is flashy, but it changes debugging from guesswork into a short checklist.
The small extra lesson
While cleaning this up, another warning appeared: some GitHub Actions in the workflow were still on older major versions that are being phased away from an older Node runtime.
That did not break the deploy, but it was useful timing. Infrastructure problems rarely arrive one at a time. If I was already in the workflow, it made sense to update the actions and remove the next future annoyance before it turned into a real interruption.
The more interesting part
The nicest outcome was not “the deploy works again.”
It was that the fix pushed the setup in a safer direction.
This was a good reminder that public writeups about infrastructure should be edited with the same care as infrastructure itself. It is tempting to tell the whole story exactly as it happened, with every hostname, path, secret name, and account detail preserved for narrative texture. That makes a post feel concrete. It can also make it too concrete.
So this version leaves out the pieces that are useful to attackers and keeps the parts that are useful to future-me:
- what failed
- why the first explanation was wrong
- what kind of key belongs in CI
- what kind of key belongs in a human workflow
That feels like a better trade.
Where I landed
The deploy path is simpler now:
- build the site
- authenticate with a dedicated automation key
- publish the generated files
No romance. No mystery. No hidden manual state.
That is probably the best sign that the system is healthier than it was yesterday.