We had an EC2 instance retirement notice email from AWS. It was our Kubernetes master node. I thought to myself: we can simply just terminate and launch a new instance. I’ve done it many times. It’s no big deal.

However, this time, when our infra engineer did that, we were greeted with this error when trying to access our cluster.

Unable to connect to the server: EOF

All the apps are still fine. Thanks to Kubernetes’s design. We can have all the time we need to fix this.

So kubectl is unable to connect to Kubernetes’s API. It’s a CNAME to API load balancer in Route53. That’s where we look first.

Route53 records are wrong

So ok. There are many problems which can cause this error. One of the first thing I notice is the Route53 DNS record for etcd is not correct. It was the old master IP address. Could it be somehow the init script unable to update it?

So our first attempt to fix it was manually update the DNS record for etcd to the new instance’s IP address. Nope, the error is still the same.

ELB marks master node as OutOfService

We look a little bit more into the ELB for API server. The instance was masked OutOfService. I thought this is it. It makes sense. But what could cause the API server to be down this time? We’ve done this process many times before.

We sshed into our master instance and issue docker ps -a. There is nothing. Zero container whatsoever.

We check systemctl and there it is, the cloud-final.service failed. We check the logs with journalctl -u cloud-final.service.

We noticed from the logs that many required packages were missing like ebtables, etc… when nodeup script ran.

Manual apt update

So if we can fix that issue, it should be ok right? We issue apt update manually and saw this

E: Release file for http://cloudfront.debian.net/debian/dists/jessie-backports/InRelease is expired (invalid since ...). Updates for this repository will not be applied.

Ok, this still makes sense. Our cluster is old and the release file is expire. If we manually update it, it should work again right? We do apt update with valid until flag set to false.

apt-get -o Acquire::Check-Valid-Until=false update

Restart cloud-final service

Restart cloud-final.service or manually run the nodeup script again with

/var/cache/kubernetes-install/nodeup --conf=/var/cache/kubernetes-install/kube_env.yaml --v=8

docker ps -a at this point should show all the containers are running again. Wait for awhile (30seconds) and kubectl should be able to communicate with the API server again.

Final

While your problem may not be exactly same as this, I thought I would just share my debugging experience in case it could help someone out there.

In our case, the problem was fixed with just 2 commands but the actual debugging process takes more than an hour.