-
Notifications
You must be signed in to change notification settings - Fork 1
Admin: Tutorial about CrateDB monitoring with Prometheus and Grafana #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
amotl
wants to merge
2
commits into
main
Choose a base branch
from
monitoring-prometheus-grafana
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,337 @@ | ||
(monitoring-prometheus-grafana)= | ||
# Monitoring a self-managed CrateDB cluster with Prometheus and Grafana | ||
|
||
## Introduction | ||
|
||
In production, monitor CrateDB proactively to catch issues early and | ||
collect statistics for capacity planning. | ||
|
||
Pair two OSS tools: use [Prometheus] to collect and store metrics, | ||
and [Grafana] to build dashboards. | ||
|
||
For a CrateDB environment, we are interested in: | ||
* CrateDB-specific metrics, such as the number of shards or number of failed queries | ||
* and OS metrics, such as available disk space, memory usage, or CPU usage | ||
|
||
For what concerns CrateDB-specific metrics we recommend making these available to Prometheus by using the [Crate JMX HTTP Exporter](https://cratedb.com/docs/crate/reference/en/5.1/admin/monitoring.html#exposing-jmx-via-http) and [Prometheus SQL Exporter](https://github.com/justwatchcom/sql_exporter). For what concerns OS metrics, in Linux environments, we recommend using the [Prometheus Node Exporter](https://prometheus.io/docs/guides/node-exporter/). | ||
|
||
Containerized and [CrateDB Cloud] setups differ. This tutorial targets | ||
standalone and on‑premises installations. | ||
|
||
## First we need a CrateDB cluster | ||
|
||
First things first, we will need a CrateDB cluster, you may have one already and that is great, but if you do not we can get one up quickly. | ||
|
||
You can review the installation documentation at {ref}`install` and {ref}`multi-node-setup`. | ||
|
||
On Ubuntu, start on the first node and run: | ||
```shell | ||
nano /etc/default/crate | ||
``` | ||
|
||
This configuration file sets the JVM heap. Configure it to satisfy bootstrap checks: | ||
``` | ||
CRATE_HEAP_SIZE=4G | ||
``` | ||
|
||
We also need to create another configuration file: | ||
|
||
```shell | ||
mkdir /etc/crate | ||
nano /etc/crate/crate.yml | ||
``` | ||
|
||
In my case I used the following values: | ||
|
||
```yaml | ||
network.host: _local_,_site_ | ||
``` | ||
|
||
This tells CrateDB to respond to requests both from localhost and the local network. | ||
|
||
```yaml | ||
discovery.seed_hosts: | ||
- ubuntuvm1:4300 | ||
- ubuntuvm2:4300 | ||
``` | ||
|
||
This lists all the machines that make up our cluster, here I only have 2, but for production use, we recommend having at least 3 nodes so that a quorum can be established in case of network partition to avoid split-brain scenarios. | ||
|
||
```yaml | ||
cluster.initial_master_nodes: | ||
- ubuntuvm1 | ||
- ubuntuvm2 | ||
``` | ||
|
||
This lists the nodes that are eligible to act as master nodes during bootstrap. | ||
|
||
```yaml | ||
auth.host_based.enabled: true | ||
auth: | ||
host_based: | ||
config: | ||
0: | ||
user: crate | ||
address: _local_ | ||
method: trust | ||
99: | ||
method: password | ||
``` | ||
|
||
This indicates that the `crate` super user will work for local connections but connections from other machines will require a username and password. | ||
|
||
```yaml | ||
gateway.recover_after_data_nodes: 2 | ||
gateway.expected_data_nodes: 2 | ||
``` | ||
|
||
And this requires both nodes to be available for the cluster to operate in this case, but with more nodes, we could have set `recover_after_data_nodes` to a value smaller than the total number of nodes. | ||
|
||
Now let’s install CrateDB: | ||
|
||
```bash | ||
apt update | ||
apt install --yes gpg lsb-release wget | ||
wget -O- https://cdn.crate.io/downloads/deb/DEB-GPG-KEY-crate | gpg --dearmor | tee /usr/share/keyrings/crate.gpg >/dev/null | ||
echo "deb [signed-by=/usr/share/keyrings/crate.gpg] https://cdn.crate.io/downloads/deb/stable/ $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/crate.list | ||
apt update | ||
apt install crate -o Dpkg::Options::="--force-confold" | ||
``` | ||
(`force-confold` is used to keep the configuration files we created earlier) | ||
|
||
Repeat the above steps on the other node. | ||
|
||
## Setup of the Crate JMX HTTP Exporter | ||
|
||
This is very simple, on each node run the following: | ||
|
||
```shell | ||
cd /usr/share/crate/lib | ||
wget https://repo1.maven.org/maven2/io/crate/crate-jmx-exporter/1.2.0/crate-jmx-exporter-1.2.0.jar | ||
nano /etc/default/crate | ||
``` | ||
|
||
then uncomment the `CRATE_JAVA_OPTS` line and change its value to: | ||
|
||
```shell | ||
# Append to existing options (preserve other flags). | ||
CRATE_JAVA_OPTS="${CRATE_JAVA_OPTS:-} -javaagent:/usr/share/crate/lib/crate-jmx-exporter-1.2.0.jar=8080" | ||
``` | ||
|
||
and restart the crate daemon: | ||
|
||
```bash | ||
systemctl restart crate | ||
``` | ||
|
||
## Prometheus Node Exporter | ||
|
||
This can be set up with a one-liner: | ||
|
||
``` | ||
apt install prometheus-node-exporter | ||
``` | ||
|
||
## Prometheus SQL Exporter | ||
|
||
The SQL Exporter allows running arbitrary SQL statements against a CrateDB cluster to retrieve additional information. As the cluster contains information from each node, we do not need to install the SQL Exporter on every node. Instead, we install it centrally on the same machine that also hosts Prometheus. | ||
|
||
Please note that it is not the same to set up a data source in Grafana pointing to CrateDB to display the output from queries in real-time as to use Prometheus to collect these values over time. | ||
|
||
Installing the package is straight-forward: | ||
```shell | ||
apt install prometheus-sql-exporter | ||
``` | ||
|
||
For the SQL exporter to connect to the cluster, we need to create a new user `sql_exporter`. We grant the user reading access to the `sys` schema. Run the below commands on any CrateDB node: | ||
```shell | ||
curl -H 'Content-Type: application/json' -X POST 'http://localhost:4200/_sql' -d '{"stmt":"CREATE USER sql_exporter WITH (password = '\''insert_password'\'');"}' | ||
curl -H 'Content-Type: application/json' -X POST 'http://localhost:4200/_sql' -d '{"stmt":"GRANT DQL ON SCHEMA sys TO sql_exporter;"}' | ||
``` | ||
|
||
We then create a configuration file in `/etc/prometheus-sql-exporter.yml` with a sample query that retrieves the number of shards per node: | ||
|
||
```yaml | ||
jobs: | ||
- name: "global" | ||
interval: '5m' | ||
connections: ['postgres://sql_exporter:insert_password@ubuntuvm1:5433?sslmode=disable'] | ||
queries: | ||
amotl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- name: "shard_distribution" | ||
help: "Number of shards per node" | ||
labels: ["node_name"] | ||
values: ["shards"] | ||
query: | | ||
SELECT node['name'] AS node_name, COUNT(*) AS shards | ||
FROM sys.shards | ||
GROUP BY 1; | ||
allow_zero_rows: true | ||
|
||
- name: "heap_usage" | ||
help: "Used heap space per node" | ||
labels: ["node_name"] | ||
values: ["heap_used"] | ||
query: | | ||
SELECT name AS node_name, heap['used'] / heap['max']::DOUBLE AS heap_used | ||
FROM sys.nodes; | ||
|
||
- name: "global_translog" | ||
help: "Global translog statistics" | ||
values: ["translog_uncommitted_size"] | ||
query: | | ||
SELECT COALESCE(SUM(translog_stats['uncommitted_size']), 0) AS translog_uncommitted_size | ||
FROM sys.shards; | ||
|
||
- name: "checkpoints" | ||
help: "Maximum global/local checkpoint delta" | ||
values: ["max_checkpoint_delta"] | ||
query: | | ||
SELECT COALESCE(MAX(seq_no_stats['local_checkpoint'] - seq_no_stats['global_checkpoint']), 0) AS max_checkpoint_delta | ||
FROM sys.shards; | ||
|
||
- name: "shard_allocation_issues" | ||
help: "Shard allocation issues" | ||
labels: ["shard_type"] | ||
values: ["shards"] | ||
query: | | ||
SELECT IF(s.primary = TRUE, 'primary', 'replica') AS shard_type, COALESCE(shards, 0) AS shards | ||
FROM UNNEST([true, false]) s(primary) | ||
LEFT JOIN ( | ||
SELECT primary, COUNT(*) AS shards | ||
FROM sys.allocations | ||
WHERE current_state <> 'STARTED' | ||
GROUP BY 1 | ||
) a ON s.primary = a.primary; | ||
``` | ||
|
||
*Please note: There exist two implementations of the SQL Exporter: [burningalchemist/sql_exporter](https://github.com/burningalchemist/sql_exporter) and [justwatchcom/sql_exporter](https://github.com/justwatchcom/sql_exporter). They don't share the same configuration options. | ||
Our example is based on the implementation that is shipped with the Ubuntu package, which is justwatchcom/sql_exporter.* | ||
|
||
To apply the new configuration, we restart the service: | ||
|
||
```shell | ||
systemctl restart prometheus-sql-exporter | ||
``` | ||
|
||
The SQL Exporter can also be used to monitor any business metrics as well, but be careful with regularly running expensive queries. Below are two more advanced monitoring queries of CrateDB that may be useful: | ||
|
||
```sql | ||
/* Time since the last successful snapshot (backup) */ | ||
SELECT (NOW() - MAX(started)) / 60000 AS MinutesSinceLastSuccessfulSnapshot | ||
FROM sys.snapshots | ||
WHERE "state" = 'SUCCESS'; | ||
``` | ||
|
||
## Prometheus setup | ||
|
||
You would run this on a machine that is not part of the CrateDB cluster and it can be installed with: | ||
|
||
```shell | ||
apt install prometheus --no-install-recommends | ||
``` | ||
|
||
By default, Prometheus binds to :9090 without authentication. Prevent | ||
auto-start during install (e.g., with `policy-rcd-declarative`), then | ||
configure web auth using a YAML file. | ||
|
||
Create `/etc/prometheus/web.yml`: | ||
|
||
basic_auth_users: | ||
admin: <bcrypt hash> | ||
|
||
Point Prometheus at it (e.g., `/etc/default/prometheus`): | ||
|
||
ARGS="--web.config.file=/etc/prometheus/web.yml --web.enable-lifecycle" | ||
|
||
Restart Prometheus after setting ownership and 0640 permissions on `web.yml`. | ||
|
||
For a large deployment where you also use Prometheus to monitor other systems, | ||
you may also want to use a CrateDB cluster as the storage for all Prometheus | ||
metrics, you can read more about this at | ||
[CrateDB Prometheus Adapter](https://github.com/crate/cratedb-prometheus-adapter). | ||
|
||
Now we will configure Prometheus to scrape metrics from the node explorer from | ||
the CrateDB machines and also metrics from our Crate JMX HTTP Exporter: | ||
```shell | ||
nano /etc/prometheus/prometheus.yml | ||
``` | ||
|
||
Where it says: | ||
```yaml | ||
- job_name: 'node' | ||
static_configs: | ||
- targets: ['localhost:9100'] | ||
``` | ||
|
||
Replace it with the following jobs: port 9100 (Node Exporter), | ||
port 8080 (Crate JMX Exporter), and port 9237 (SQL Exporter), | ||
like outlined below. | ||
```yaml | ||
- job_name: 'node' | ||
static_configs: | ||
- targets: ['ubuntuvm1:9100', 'ubuntuvm2:9100'] | ||
- job_name: 'cratedb_jmx' | ||
static_configs: | ||
- targets: ['ubuntuvm1:8080', 'ubuntuvm2:8080'] | ||
- job_name: 'sql_exporter' | ||
static_configs: | ||
- targets: ['localhost:9237'] | ||
``` | ||
|
||
Restart the `prometheus` daemon if it was already started (`systemctl restart prometheus`). | ||
|
||
## Grafana setup | ||
|
||
Grafana can be installed on the same machine where you installed Prometheus. | ||
To install Grafana on a Debian machine, please refer to its [documentation][grafana-debian]. | ||
Then, start Grafana. | ||
```shell | ||
systemctl start grafana-server | ||
``` | ||
|
||
Open `http://<grafana-host>:3000` to access the Grafana login screen. | ||
The default credentials are `admin`/`admin`; change the password immediately. | ||
|
||
Click on "Add your first data source", then click "Prometheus" and set the | ||
URL to `http://<prometheus-host>:9090`. | ||
|
||
If you had configured basic authentication for Prometheus this is where you would need to enter the credentials. | ||
|
||
Click "Save & test". | ||
|
||
An example dashboard based on the discussed setup is available for easy importing on [grafana.com](https://grafana.com/grafana/dashboards/17174-cratedb-monitoring/). In your Grafana installation, on the left-hand side, hover over the “Dashboards” icon and select “Import”. Specify the ID 17174 and load the dashboard. On the next screen, finalize the setup by selecting your previously created Prometheus data sources. | ||
|
||
 | ||
|
||
## Alternative implementations | ||
|
||
If you decide to build your own dashboard or use an entirely different monitoring approach, we recommend still covering similar metrics as discussed in this article. The list below is a good starting point for troubleshooting most operational issues: | ||
|
||
* CrateDB metrics (with example Prometheus queries based on the Crate JMX HTTP Exporter) | ||
* Thread pools rejected: `sum(rate(crate_threadpools{property="rejected"}[5m])) by (name)` | ||
* Thread pool queue size: `sum(crate_threadpools{property="queueSize"}) by (name)` | ||
* Thread pools active: `sum(crate_threadpools{property="active"}) by (name)` | ||
* Queries per second: `sum(rate(crate_query_total_count[5m])) by (query)` | ||
* Query error rate: `sum(rate(crate_query_failed_count[5m])) by (query)` | ||
* Average Query Duration over the last 5 minutes: `sum(rate(crate_query_sum_of_durations_millis[5m])) by (query) / sum(rate(crate_query_total_count[5m])) by (query)` | ||
* Circuit breaker memory in use: `sum(crate_circuitbreakers{property="used"}) by (name)` | ||
* Number of shards: `crate_node{name="shard_stats",property="total"}` | ||
* Garbage Collector rates: `sum(rate(jvm_gc_collection_seconds_count[5m])) by (gc)` | ||
* Thread pool rejected operations: `crate_threadpools{property="rejected"}` | ||
* Operating system metrics | ||
* CPU utilization | ||
* Memory usage | ||
* Open file descriptors | ||
* Disk usage | ||
* Disk read/write operations and throughput | ||
* Received and transmitted network traffic | ||
|
||
## Wrapping up | ||
|
||
We got a Grafana dashboard that allows us to check live and historical data around performance and capacity metrics in our CrateDB cluster, this illustrates one possible setup. You could use different tools depending on your environment and preferences. Still, we recommend you use the interface of the Crate JMX HTTP Exporter to collect CrateDB-specific metrics and that you always also monitor the health of the environment at the OS level as we have done here with the Prometheus Node Exporter. | ||
|
||
|
||
[CrateDB Cloud]: https://cratedb.com/products/cratedb-cloud | ||
[Grafana]: https://grafana.com/ | ||
[grafana-debian]: https://grafana.com/docs/grafana/latest/setup-grafana/installation/debian/ | ||
[Prometheus]: https://prometheus.io/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this topic already covered elsewhere? Could we link to the existing "install a cluster" content instead? This would avoid repeating and also avoids adjusting lot of places if we need to adjust anything on the setup guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. I think the unique thing here is that the fundamental installation is followed up by educating users about the installation of the Crate JMX HTTP Exporter, which requires editing CrateDB's
/etc/default/crate
configuration file.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, we can use the opportunity to break out and refactor those ingredients to a dedicated place and then refer to them, as you've suggested.
Let's use the chance right away? If you agree, just signal 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.