Skip to content

Commit

Permalink
CLDSRV-573: Catch prom cluster timeout
Browse files Browse the repository at this point in the history
Fix crashes of primary because of prom-client 5s timeout.
Mostly to happen at startup when workers are not ready.
Should also fix error write EPIPE in workers by preventing
primary to crash.

(cherry picked from commit bdb4f23)
  • Loading branch information
BourgoisMickael committed Nov 6, 2024
1 parent e31519c commit a796a05
Showing 1 changed file with 14 additions and 2 deletions.
16 changes: 14 additions & 2 deletions lib/utilities/monitoringHandler.js
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,15 @@ async function routeHandler(req, res, cb) {
if (req.method !== 'GET') {
return cb(errors.BadRequest, []);
}
const metrics = await aggregatorRegistry.clusterMetrics();
let metrics;
try {
// Catch timeout on IPC between worker and primary
// prom-client has a 5s hardcoded timeout
metrics = await aggregatorRegistry.clusterMetrics();
} catch (err) {
return cb(err, { message: err.toString() });
}

const contentLen = Buffer.byteLength(metrics, 'utf8');
res.writeHead(200, {
'Content-Length': contentLen,
Expand All @@ -48,7 +56,11 @@ function monitoringHandler(clientIP, req, res, log) {
function monitoringEndHandler(err, results) {
writeResponse(res, err, results, error => {
if (error) {
return log.end().warn('monitoring error', { err: error });
return log.end().warn('monitoring error', { err: {
...error,
// For ArsenalError message is in description
message: error.description || error.message,
} });
}
return log.end();
});
Expand Down

0 comments on commit a796a05

Please sign in to comment.