Monitor Consul server health and performance with metrics and logs
Consul server metrics and logs give you detailed statistical and performance information about your Consul cluster. Metrics provide a general overview of system health and performance, while logs provide context and details used to diagnose issues and identify the root cause of problems. Once you enable these Consul observability features, Consul emits runtime metrics and operational logs of its subsystems.
In this tutorial, you will enable Consul server metrics and server logging for your Consul cluster. You will use Grafana to explore dashboards that provide information regarding health, performance, and operations for your Consul cluster. In the process, you will learn how using these features can provide you with deep insights into the operational health and performance of your Consul cluster.
Scenario overview
To begin this tutorial, you will use Terraform to deploy a self-managed Consul cluster and an observability suite on Elastic Kubernetes Service (EKS).
Each Consul server can emit server metrics and server logs that contain timings, protocols, and additional information for analyzing the health and performance of your Consul cluster. By configuring the Consul Helm chart, you can configure your Consul servers to emit this observability information so Prometheus and Promtail can scrape and store the data. You can then visualize the metrics and logs with Grafana.
In this tutorial, you will:
- Deploy the following resources with Terraform:
- Elastic Kubernetes Service (EKS) cluster
- A self-managed Consul datacenter on EKS
- Grafana, Prometheus, and Loki on EKS
- Perform the following Consul control plane procedures:
- Review and enable servers metrics and server logging features
- Explore dashboards with Grafana
Prerequisites
The tutorial assumes that you are familiar with Consul and its core functionality. If you are new to Consul, refer to the Consul Getting Started tutorials collection.
For this tutorial, you will need:
- An AWS account configured for use with Terraform
- (Optional) An HCP account
- aws-cli >= 2.0
- terraform >= 1.0
- consul >= 1.17.0
- consul-k8s >= 1.2.0
- helm >= 3.0
- git >= 2.0
- kubectl > 1.24
Clone GitHub repository
Clone the GitHub repository containing the configuration files and resources.
$ git clone https://github.com/hashicorp-education/learn-consul-cluster-telemetry
Change into the directory that contains the complete configuration files for this tutorial.
$ cd learn-consul-cluster-telemetry/self-managed/eks
Review repository contents
This repository contains Terraform configuration to spin up the initial infrastructure and all files to deploy Consul, the demo application, and the observability suite resources.
The eks
directory contains the following Terraform configuration files:
aws-vpc.tf
defines the AWS VPC resourceseks-cluster.tf
defines Amazon EKS cluster deployment resourceseks-consul.tf
defines the self-managed Consul deploymenteks-observability.tf
defines the Prometheus, Promtail, Loki, and Grafana resourcesoutputs.tf
defines outputs you will use to authenticate and connect to your Kubernetes clusterproviders.tf
defines AWS and Kubernetes provider definitions for Terraformvariables.tf
defines variables you can use to customize the tutorial
The directory also contains the following subdirectories:
../../dashboards
contains the JSON configuration files for the example Grafana dashboardsconfig
contains custom Consul ACL configuration file and the Consul synthetic load generator configuration filehelm
contains the Helm charts for Consul, Prometheus, Promtail, Loki, and Grafana
Deploy infrastructure and demo application
With these Terraform configuration files, you are ready to deploy your infrastructure. Initialize your Terraform configuration to download the necessary providers and modules.
$ terraform init
Initializing the backend...
Initializing provider plugins...
## ...
Terraform has been successfully initialized!
## …
Then, deploy the resources. Confirm the run by entering yes
.
$ terraform apply
## ...
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
## ...
Apply complete! Resources: 61 added, 0 changed, 0 destroyed.
The Terraform deployment could take up to 15 minutes to complete.
Connect to your infrastructure
Now that you have deployed the Kubernetes cluster, configure kubectl
to interact with it.
$ aws eks --region $(terraform output -raw region) update-kubeconfig --name $(terraform output -raw kubernetes_cluster_id)
Ensure all services are up and running successfully
Check the pods across all namespaces to confirm they are running successfully.
$ kubectl get pods --all-namespaces --field-selector metadata.namespace!=kube-system
NAMESPACE NAME READY STATUS RESTARTS AGE
consul consul-connect-injector-9b944b6c4-hq99p 1/1 Running 0 7m17s
consul consul-server-0 1/1 Running 0 7m17s
consul consul-server-1 1/1 Running 0 7m17s
consul consul-server-2 1/1 Running 0 7m17s
consul consul-webhook-cert-manager-9d7cc8cc5-wpx76 1/1 Running 0 7m17s
observability grafana-5dccdcd7c8-qhhbr 1/1 Running 0 6m3s
observability loki-0 1/1 Running 0 6m27s
observability loki-canary-dqpdt 1/1 Running 0 6m27s
observability loki-canary-fgndz 1/1 Running 0 6m27s
observability loki-canary-j7k7q 1/1 Running 0 6m27s
observability loki-gateway-5c59784b98-k4wgk 1/1 Running 0 6m27s
observability loki-grafana-agent-operator-d7c684bf9-jkgkb 1/1 Running 0 6m27s
observability loki-logs-4lfxm 2/2 Running 0 6m22s
observability loki-logs-96wcb 2/2 Running 0 6m22s
observability loki-logs-zvspl 2/2 Running 0 6m22s
observability prometheus-kube-state-metrics-8646c88b45-q5rbz 1/1 Running 0 6m34s
observability prometheus-prometheus-node-exporter-57rqj 1/1 Running 0 6m34s
observability prometheus-prometheus-node-exporter-d6c6f 1/1 Running 0 6m34s
observability prometheus-prometheus-node-exporter-tjlfs 1/1 Running 0 6m34s
observability prometheus-prometheus-pushgateway-79ff799669-4gm44 1/1 Running 0 6m34s
observability prometheus-server-6c87bf4dd9-s7m7x 2/2 Running 0 6m34s
observability promtail-ccm5q 1/1 Running 0 5m8s
observability promtail-djcpn 1/1 Running 0 5m8s
observability promtail-tzp6g 1/1 Running 0 5m8s
Configure your CLI to interact with Consul datacenter
In this section, you will set environment variables in your terminal so your Consul CLI can interact with your Consul datacenter. The Consul CLI reads these environment variables for behavior defaults and will reference these values when you run consul
commands.
Set the Consul server destination address.
$ export CONSUL_HTTP_ADDR=https://$(kubectl get services/consul-ui --namespace consul -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
Retrieve the ACL bootstrap token from the respective Kubernetes secret and set it as an environment variable.
$ export CONSUL_HTTP_TOKEN=$(kubectl get --namespace consul secrets/consul-bootstrap-acl-token --template={{.data.token}} | base64 -d)
Remove SSL verification checks to simplify communication to your Consul datacenter.
$ export CONSUL_HTTP_SSL_VERIFY=false
In a production environment, we recommend keeping this SSL verification set to true
. Only remove this verification if you have a Consul datacenter without TLS configured in development environment and demonstration purposes.
Verify that you can communicate with your Consul cluster by printing all known nodes and the metadata about them.
$ consul catalog nodes
Node ID Address DC
consul-server-0 6965c864 10.0.6.174 dc1
consul-server-1 d461434e 10.0.4.52 dc1
consul-server-2 b73bfdf9 10.0.5.101 dc1
Enable Consul server metrics and logging
Consul server metrics and logs provide you with detailed health and performance information for your Consul clusters. In this section, you will review the parameters that enable these features and update your Consul installation to apply the new configuration.
Review the Consul values file
Consul lets you expose metrics and logs for your server pods so they may be scraped by a Prometheus service that is outside of your service mesh. Review these snippets from the helm/consul-v2-telemetry.yaml
configuration file to see the parameters that enable these features.
Consul metrics are only exposed on port 8500
. Setting httpOnly: false
in the TLS block allows Prometheus to scrape this port for metrics.
global:
## ...
tls:
httpsOnly: false
## ...
The following block enables metrics for all agents in your Consul datacenter.
global:
## ...
metrics:
enabled: true
enableAgentMetrics: true
## ...
This block configures your Consul servers to emit server logs.
## ...
server:
## …
extraConfig: |
{
"log_level": "TRACE"
}
## …
Refer to the Consul metrics for Kubernetes documentation and official Helm chart values to learn more about metrics configuration options and details.
Update the Consul deployment
Update Consul in your Kubernetes cluster with Consul K8S CLI to let Prometheus collect metrics from your Consul servers. Confirm the run by entering y
.
$ consul-k8s upgrade -config-file=helm/consul-v2-telemetry.yaml
Refer to the Consul K8S CLI documentation to learn more about additional settings.
The Consul update could take up to 5 minutes to complete.
Review the official Helm chart values to learn more about these settings.
Configure the anonymous ACL policy
In addition to configuring Consul, you need to modify the anonymous ACL policy to allow agent:read
permissions so Prometheus can scrape metrics from the secure Consul servers. Other permissions in the included file will allow the Consul load generator service to communicate with the respective Consul features.
$ consul acl policy update -name "anonymous-token-policy" \
-datacenter "dc1" \
-rules @config/acl-policy.hcl
Review the Consul ACL Policies documentation to learn more.
Note
In a production environment, we recommend using the Prometheus Consul Exporter for the most secure, restrictive access to Consul metrics on port 8501
.
Deploy the Consul load generator
Deploy the Consul load generator to create synthetic loads for KV, service registration, and the ACL engine. This will create more realistic visualizations in your Grafana dashboards.
$ kubectl apply -f config/consul-load-generator.yaml
Explore Consul health and performance dashboards
Consul control plane metrics and logs provide you with detailed health and performance information for your Consul servers. In this section, you will use Grafana to examine how this information provides insights into your Consul control plane.
Explore Consul telemetry dashboard
Navigate to the control plane monitoring dashboard.
$ export GRAFANA_CP_DASHBOARD=http://$(kubectl get svc/grafana --namespace observability -o json | jq -r '.status.loadBalancer.ingress[0].hostname')/d/control-plane-performance-monitoring && echo $GRAFANA_CP_DASHBOARD
http://a20fb6f2d1d3e4be296d05452a378ad2-428040929.us-west-2.elb.amazonaws.com/d/control-plane-performance-monitoring
The example dashboards take a few minutes to populate with data after the telemetry metrics feature is enabled.
This dashboard provides several sections that give you a variety of information for your Consul control plane. These graphs can be useful to analyze the health of your Consul server pods to identify any anomalies in behavior.
Notice that the System Stats
tab includes CPU usage and memory usage metrics. High metrics in these areas can cause long loading times, slow performance, and unexpected crashes.
Now, click on the Consul Server Behavior
tab. This tab gives insight into the health of Consul's raft protocol, with higher than average numbers indicating slowdowns in reaching a state of concensus between Consul servers.
Click on the Feature: Catalog
tab.
This tab provides health information about the registration/deregistration of nodes, services, and checks in Consul. This can provide useful insight into the load pressure on each of your Consul servers.
Tip
Consul telemetry metrics contain a large set of statistics that you can use to create custom dashboards for monitoring your Consul clusters according to your production environment's unique requirements. Refer to the Consul telemetry overview for a complete list and description of available metrics.
Explore Consul server logs dashboard
Navigate to the control plane logs dashboard.
$ export GRAFANA_CP_LOGS_DASHBOARD=http://$(kubectl get svc/grafana --namespace observability -o json | jq -r '.status.loadBalancer.ingress[0].hostname')/d/control-plane-logs/ && echo $GRAFANA_CP_LOGS_DASHBOARD
http://a20fb6f2d1d3e4be296d05452a378ad2-428040929.us-west-2.elb.amazonaws.com/d/control-plane-logs/
The Grafana dashboard may take a few moments to fully load in your browser.
Notice that the example dashboard panes provide detailed event and error insights for your Consul control plane.
For example, the RPC Server Call Request Type Distribution
pie chart gives you the read/write ratio of RPC server calls in your Consul cluster during a specific time window.
Type request_type=write
in the search field to look deeper into the server logs.
Notice how this action applies a filter to the respective visualizations and raw logs containing that value so you can zoom into error logs for further analysis and troubleshooting. Click on one of the raw logs to view the entire access log contents.
Notice that you can explore the other fields associated with your search terms to learn more information about a particular error or event.
Clean up resources
Destroy the Terraform resources to clean up your environment. Confirm the destroy operation by inputting yes
.
$ terraform destroy
## ...
Do you really want to destroy all resources?
Terraform will destroy all your managed infrastructure, as shown above.
There is no undo. Only 'yes' will be accepted to confirm.
Enter a value: yes
## ...
Destroy complete! Resources: 0 added, 0 changed, 61 destroyed.
Note
Due to race conditions with the cloud resources in this tutorial, you may need to run the destroy
operation twice to remove all the resources.
Next steps
In this tutorial, you enabled Consul server metrics and logs to enhance the health and performance monitoring of your Consul cluster. This integration offers increased control plane understanding, reduced operational overhead, and faster incident resolution.
For more information about the topics covered in this tutorial, refer to the following resources: