There are multiple scenarios when monitoring cluster health is crucial. A common example is using Infinispan together with monitoring and log management platform (Kibana, Splink, Zabbix etc). Another interesting example is the Cloud. Health check is often required to plug an instance into a load balancer (in Kubernetes this is called Readiness probe. Finally, the health check API is used for a Kubernetes Rolling update procedure which is a necessary for implementing Infinispan Rolling Upgrades (see ISPN-6673)
The health check API should resolve the following use cases:
-
Monitoring tools which could use JMX/CLI/REST interface
-
The Rolling upgrade procedure with Kubernetes (Readiness and liveness probes)
The health check API should expose:
-
number of machines in the cluster
-
cluster name
-
overall cluster status (
UNHEALTHY
,HEALTHY
andREBALANCING
(operating but rebalance is in progress, don’t change the nodes)) -
per cache status
-
log tail (last X lines)
The Health API is implemented as Java classes and exposed using different interfaces (including MBean, WildFly resources etc). This approach makes the API and exposing parts loosely coupled.
The Health API can be easily accessed in embedded mode by using embeddedCacheManager.getHealth()
. The implementation is heavily inspired by Cache Manager statistics since the functionality is very similar.
Exposing the Health API in embedded mode might be a bit problematic (especially when considering REST). One may consider using sophisticated tools for this such as Jolokia or Spring Boot Metrics.
The Health API is plugged into WildFly integration bits (/server/integration/infinispan
) as Metrics (runtime only resources which are read-only). Again, the main inspiration were previously implemented Metrics and Statistics.
WildFly exposes all management resources via HTTP Interface. Unfortunately since the Health API is exposed in runtime it needs to be invoked using HTTP POST method (GET allows to retrieve only standard attributes and resources).
curl --digest -L -D - "http://localhost:9990/management/subsystem/datagrid-infinispan/cache-container/clustered/health/HEALTH?operation=resource&include-runtime=true&json.pretty=1" --header "Content-Type: application/json" -u ispnadmin:ispnadmin HTTP/1.1 401 Unauthorized Connection: keep-alive WWW-Authenticate: Digest realm="ManagementRealm",domain="/management",nonce="AuZzFxz7uC4NMTQ3MDgyNTU1NTQ3OCfIJBHXVpPHPBdzGUy7Qts=",opaque="00000000000000000000000000000000",algorithm=MD5,qop="auth" Content-Length: 77 Content-Type: text/html Date: Wed, 10 Aug 2016 10:39:15 GMT HTTP/1.1 200 OK Connection: keep-alive Authentication-Info: nextnonce="AuZzFxz7uC4NMTQ3MDgyNTU1NTQ3OCfIJBHXVpPHPBdzGUy7Qts=",qop="auth",rspauth="b518c3170e627bd732055c382ce5d970",cnonce="NGViOWM0NDY5OGJmNjY0MjcyOWE4NDkyZDU3YzNhYjY=",nc=00000001 Content-Type: application/json; charset=utf-8 Content-Length: 1927 Date: Wed, 10 Aug 2016 10:39:15 GMT { "cache-health" : "HEALTHY", "cluster-health" : ["test", "HEALTHY"], "cluster-name" : "clustered", "free-memory" : 96778, "log-tail" : [ "2016-08-10 11:54:14,706 INFO [org.infinispan.server.endpoint] (MSC service thread 1-5) DGENDPT10001: HotRodServer listening on 127.0.0.1:11222", "2016-08-10 11:54:14,706 INFO [org.infinispan.server.endpoint] (MSC service thread 1-1) DGENDPT10001: MemcachedServer listening on 127.0.0.1:11211", "2016-08-10 11:54:14,785 INFO [org.jboss.as.clustering.infinispan] (MSC service thread 1-6) DGISPN0001: Started ___protobuf_metadata cache from clustered container", "2016-08-10 11:54:14,800 INFO [org.jboss.as.clustering.infinispan] (MSC service thread 1-6) DGISPN0001: Started ___script_cache cache from clustered container", "2016-08-10 11:54:15,159 INFO [org.jboss.as.clustering.infinispan] (MSC service thread 1-5) DGISPN0001: Started ___hotRodTopologyCache cache from clustered container", "2016-08-10 11:54:15,210 INFO [org.infinispan.rest.NettyRestServer] (MSC service thread 1-6) ISPN012003: REST server starting, listening on 127.0.0.1:8080", "2016-08-10 11:54:15,210 INFO [org.infinispan.server.endpoint] (MSC service thread 1-6) DGENDPT10002: REST mapped to /rest", "2016-08-10 11:54:15,306 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://127.0.0.1:9990/management", "2016-08-10 11:54:15,307 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990", "2016-08-10 11:54:15,307 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: Infinispan Server 9.0.0-SNAPSHOT (WildFly Core 2.2.0.CR9) started in 8681ms - Started 196 of 237 services (121 services are lazy, passive or on-demand)" ], "number-of-cpus" : 8, "number-of-nodes" : 1, "total-memory" : 235520 }%
Response codes: * 401 Unauthorized if user credentials are incorrect * 200 If everything is fine
The WildFly HTTP Management API allows querying whole subtree of resources as well as single attributes.
For more information, see https://docs.jboss.org/author/display/WFLY10/The+HTTP+management+API
$curl -XGET 'http://localhost:9200/_cluster/health?pretty=true' { "cluster_name" : "testcluster", "status" : "green", "timed_out" : false, "number_of_nodes" : 2, "number_of_data_nodes" : 2, "active_primary_shards" : 5, "active_shards" : 10, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0, "delayed_unassigned_shards": 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 100 }
$curl -XGET 'http://localhost:8080/metrics' { "counter.status.200.root": 20, "counter.status.200.metrics": 3, "counter.status.200.star-star": 5, "counter.status.401.root": 4, "gauge.response.star-star": 6, "gauge.response.root": 2, "gauge.response.metrics": 3, "classes": 5808, "classes.loaded": 5808, "classes.unloaded": 0, "heap": 3728384, "heap.committed": 986624, "heap.init": 262144, "heap.used": 52765, "nonheap": 0, "nonheap.committed": 77568, "nonheap.init": 2496, "nonheap.used": 75826, "mem": 986624, "mem.free": 933858, "processors": 8, "threads": 15, "threads.daemon": 11, "threads.peak": 15, "threads.totalStarted": 42, "uptime": 494836, "instance.uptime": 489782, "datasource.primary.active": 5, "datasource.primary.usage": 0.25 }
$ echo mntr | nc localhost 2185 zk_version 3.4.0 zk_avg_latency 0 zk_max_latency 0 zk_min_latency 0 zk_packets_received 70 zk_packets_sent 69 zk_outstanding_requests 0 zk_server_state leader zk_znode_count 4 zk_watch_count 0 zk_ephemerals_count 0 zk_approximate_data_size 27 zk_followers 4 - only exposed by the Leader zk_synced_followers 4 - only exposed by the Leader zk_pending_syncs 0 - only exposed by the Leader zk_open_file_descriptor_count 23 - only available on Unix platforms zk_max_file_descriptor_count 1024 - only available on Unix platforms