Troubleshooting Tectonic

Troubleshooting a Tectonic cluster is separated into a few major topics:

Triaging a malfunctioning cluster

Use this guide to troubleshoot an indeterminate failure type, and determine which of the more in-depth guides will help to solve the issue. Knowledge of the DNS records, certificate authority and the topology of the master and worker nodes is typically required to perform a proper diagnosis.

If you are having problems with a single Deployment or Pod, jump to Verify the control plane. For all other issues, work through this guide.

Verify the Tectonic Console

Open Tectonic Console in a browser, and use the following list to check its status and to determine if there are any networking issues between it and the cluster. Load the address where your Console is running and match your observed behavior below:

Observed Behavior Action
Console loads and you see the title “Cluster Status” Console is working correctly. Continue to Verify the control plane.
Browser returns an error related to certificates, similar to Your connection is not secure Your cluster installation is using a certificate that is not trusted by your computer. This is common when using a corporate certificate authority or when using an authority generated by the Tectonic Installer. Trust the certificate to continue.
Console presents a login screen Console is working correctly. Continue to Verify Tectonic Identity.
Console will not load. Browser returns HTTP 500 error Continue to Verify connectivity to cluster.
Console returns 503 Service Unavailable Console is not running. The backend pods are not running, possibly related to a console plane issue. Continue to Verify the control plane.
Console returns “Ingress Error” Console is not running. The backend pods are not running, possibly related to a console plane issue. Continue to Verify the control plane.
Console does not load and browser returns error. Continue to Verify connectivity to cluster.

Verify communication to the cluster

Tectonic Console does not load due to connectivity issues diagnosed above. This behavior does not mean that all cluster services are down. Use kubectl get nodes to evaluate the connection to the Kubernetes API:

$ kubectl get nodes
Observed Behavior Action
Node list is shown Console is working correctly. Continue to Verify the control plane.
The connection to the server localhost:8080 was refused Your kubeconfig is misconfigured. It should point to the remote Tectonic cluster, not localhost. View the kubectl guide.
Kubectl hangs until i/o timeout The load balancer, security group, or firewall may not be allowing traffic into the master nodes. Audit your rules to ensure traffic is allowed to the master nodes.
Unable to connect to the server: EOF The load balancer does not have any healthy backends. All of the API server pods may be be down. Continue to Verify the control plane.
Error from server: etcdserver: request timed out Connectivity to the cluster is established, but there might be a problem with etcd, which stores state for the cluster. Continue to Verify the control plane.

Verify Tectonic Identity

The Tectonic Console appears to be working. Attempt to log in with a username and password that is configured for this cluster.

Observed Behavior Action
The log in form appears Attempt to log in. If successful, continue to Verify the control plane.
Console loads and you see the title “Cluster Status” Identity is working correctly. Continue to Verify the control plane.
Invalid username and password. The account does not have access to this cluster.
After clicking “login”, the browser displays a HTTP 500 error. Tectonic Identity is misconfigured or not running. See Troubleshooting Tectonic Identity.
Identity cannot connect to database The backend database, such as LDAP, is misconfigured or not running. See Troubleshooting Tectonic Identity.

Review the Identity failure domains guide for information on how Identity is architected to prevent downtime during a brief period of downtime or misconfiguration.

Verify the control plane

The Console and Identity functions of the cluster appear to be configured correctly. Test the control plane, the brain of the cluster, to see if it is misconfigured or down.

First, test that new pods are being deployed successfully, by checking the Deployments within the kube-system namespace. This namespace holds most of the control plane for the cluster.

$ kubectl --namespace=kube-system get deployments
Observed Behavior Action
The DESIRED, CURRENT, UP TO DATE and AVAILABLE columns all match each other. New pods appear to be launched successfully. Continue below.
The CURRENT count of kube-controller-manager is 0 The Kubernetes Controller Manager is not running, which is what deploys and manages containers on the cluster. Continue to Recover the controller manager.
The CURRENT count of kube-scheduler is 0 The Kubernetes Scheduler is not running, which matches new workloads with machines available to run it. Continue to Recover the scheduler.
The CURRENT count is less than DESIRED but is not 0, for multiple deployments The control plane appears to be unhealthy, but the important components are still running in reduced capacity. Continue to Verify etcd cluster.
etcdserver: request timed out The etcd cluster is down and must be recovered. Continue to Troubleshoot etcd cluster.

If the Controller Manager and Scheduler are running, and new pods are being started successfully, there may be a misconfiguration that is affecting the cluster, but not causing anything to crash.

Verify etcd cluster

The cluster appears to be functioning, but it is showing signs that the etcd cluster is not healthy. Be aware that troubleshooting and recovery differ slightly based on how the etcd cluster was launched with Tectonic Installer. Make a note of which option was selected during installation:

  • Bring an external etcd cluster
  • Provision an etcd cluster
  • Create a self-hosted etcd cluster

First, determine the state of etcd, by looking at the logs of the API server, which is the main consumer of the etcd cluster. If more than one API server is running, pick one to inspect.

$ kubectl --namespace=kube-system get pods | grep api
$ kubectl --namespace=kube-system logs <podname>
Observed Behavior Action
Logs don’t have etcd related errors in them The API appears to be talking to etcd. Continue below.
Logs contain the server cannot complete the requested operation at this time, try again later The Kubernetes Controller Manager is not running, which is what deploys and manages containers on the cluster. Continue to Recover the controller manager.
Logs contain http: TLS handshake error from 10.0.x.x:xxxx: EOF While alarming, this is a normal TLS error related to node health checking, and can be ignored.
etcdserver: request timed out The etcd cluster is down and must be recovered. Continue to Troubleshoot etcd cluster.

Troubleshooting connectivity to cluster

Connections to your cluster depend on a chain of network technologies that vary depending on the compute platform running Tectonic. Connection through Tectonic Console and through the Kubernetes API are similar in function, but may be configured differently, and therefore may act differently in an outage.

There are two main DNS records for your cluster, which are a combination of the cluster name (e.g. east-coast) and the domain (eg. example.com) you provided during installation.

Correctly functioning DNS is the first part of the chain. Test your DNS records with dig:

$ dig east-coast.example.com
$ dig east-coast-api.example.com
Observed Behavior Action
ANSWER SECTION: contains one or more IP addresses DNS appears to be configured to point either to your master nodes, or to a load balancer. Continue below.
Response does not contain an ANSWER SECTION:, but instead contains an AUTHORITY SECTION: DNS records do not point to any master nodes or to a load balancer. Access to the cluster cannot function without these records.

Next, test connectivity to the Console, and any other applications using Tectonic Ingress:

$ curl -I https://east-coast.example.com/
Observed Behavior Action
Response contains HTTP/1.1 200 OK Console can be reached from your computer.
Response contains curl: (35) Server aborted the SSL handshake Console can’t be reached. The load balancer does not have any healthy backends.
Response contains curl: (52) Empty reply from server The Ingress backend is unhealthy or not running. Continue to Troubleshooting Ingress.

Next, test connectivity to the Kubernetes API:

$ curl -I https://east-coast-api.example.com:443
Observed Behavior Action
Response contains HTTP/1.1 401 Unauthorized The Kubernetes API can be reached from your computer. Your request will appear unauthorized because the authentication headers have not been submitted.
Response contains curl: (35) Server aborted the SSL handshake Console can’t be reached. The load balancer does not have any healthy backends.

Troubleshooting Tectonic Ingress

Tectonic Ingress routes traffic to your containers from outside the cluster. It also routes traffic to Tectonic components hosted on the cluster. If DNS passed validation in Troubleshooting connectivity to cluster above, the Ingress address is available, and delivering traffic to the cluster.

When Ingress is not working you will not be able to use the Console, so we will rely on other tools. First, check the response from the Ingress address in a browser or curl:

$ curl -I https://east-coast.example.com/
Observed Behavior Action
Browser times out All of the Ingress routing pods are unavailable. Use kubectl logs to troubleshoot.
Response contains curl: (52) Empty reply from server All of the Ingress routing pods are unavailable. Use kubectl logs to troubleshoot.

Troubleshooting etcd

etcd is a distributed database that holds the state of your Tectonic cluster. Clusters are typically 3 or more members that are constantly syncing and agreeing on the state of the world. A majority of members, called a "quorum", is required to maintain proper function of the cluster.

etcd clusters will automatically go into read-only mode when the quorum is not reached, in order to protect the integrity of the data. This mode allows for some degraded functionality of the cluster. To restablish quorum, add new healthy members to your cluster, and remove any failed members.

If you have a snapshot or backup of etcd, run a temporary Kubernetes API server locally to inspect the cluster state. The cluster control plane can be recovered from etcd-related failures using the backup with the recovery tool.

It can also be helpful to run etcd commands directly against the etcd cluster. This can be done via SSH on a master node.

Troubleshooting Identity

Tectonic Identity is the source of authentication for your cluster and is in the critical path for all new sessions using the Console, Kubernetes API, or kubectl. The failure domains document explains in detail how it is architected to reduce downtime, as it is a critical part of the cluster.

Identity will not start if there is an error in its configuration, which is the most common error. View its logs to look for errors:

$ kubectl --namespace=tectonic-system get pods | grep identity
$ kubectl --namespace=tectonic-system logs <pod-name>

If Identity presents a "Database Error", this is typically a failure of the Kubernetes control plane, which is where Identity stores its access tokens and state. This affects automatic access token refreshing, signing key rotation, etc.