-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WMArchive Grafana Monitoring Down / No Data #11960
Comments
Here is my reply through email thread to Todor and CMS Monitoring team: Regarding WMArchive issue, if you'll look closely to reported error you'll see that it comes from nginx, i.e.
So, it is issue on nginx k8s frontend which supposes to pass request to backend server (in this case WMArchive). Therefore, the actual issue is on CMSWEB side and neither on WMCore or WMArchive. In other words our k8s frontend rejects request based on its size. I suggest that you check with Imran/Aroosha what is a current limit of nginx we have on k8s, and most likely it needs to be increased. Please note that from reported error we can't see the actual size of the request since it is payload data sent over HTTP POST method. A long term solution which should put in place is compression which can be imposed on WMCore side which will first compress, e.g. gzip, payload and then send it over to WMArchive. But this will require additional effort on several fronts:
Finally, another approach is just review what is send from WMCore and limit it based on current nginx settings. |
I would suggest dumping at least 1 or 2 documents and inspecting its construction and where the heavy data is. We might want to refactor it in WMAgent, in case the information isn't relevant (like free text error message might be a good candidate to be dropped in WMArchive). |
Some of the logs can also be seen on the nginx controller on the Kubernetes:
Although @arooshap has just raised the nginx limit from 1MB to 8MB, but these sizes are around 20MB, so docs like these are still failing. |
raising threshold may be a temporary solution since we never really know the size of payload. I rather prefer to see how it can be constrained in WM system which knows by definition the size when it construct the HTTP request. Now we know the nginx threshold and WM system should acknowledge it. |
I posted first draft of the fix in #11967 which basically checks newly created WMArchive doc within WM component before sending it over to WMArchive service. As such we can accomplish few things:
Once we investigate further the cause of such large sizes we can inspect those documents from local CouchDB (the document id is printed to the log) and identify the source of large document size. After that a more concrete remedy can be put in place to avoid creation of such large documents and we can restore data flow to wmarchive. |
@vkuznet we changed nginx limit to accommodate data of any size. This setting was also being applied in the older cluster, but it was not mentioned in the upgrade procedure, so it was skipped. I am letting you know in case you did not see the discussion on mattermost! |
@arooshap , does it mean that there is no limit now on nginx? If this is the case we'll get problem on MONIT site since it also has internal constrain, I hope @leggerf or @brij01 or @nikodemas can remind us what is a current limit on MONIT size for JSON documents we send, if I recall from the past it was around 20-30MB. In other words increasing or removing limits on nginx does not solve the entire problem of creating big WMArchive docs since they content should not that big for monitoring purposes and we still must identify what are those documents and how to fix them before sending to MONIT. I just fear that by removing nginx limit we delegate problem from CMSWEB to MONIT and sooner or later the MONIT team will complain about such large documents. |
@vkuznet yes, that is exactly that. Apparently, it was decided to use these parameters after a careful consideration a few years ago. I proposed that we can have a discussion among the teams to decide a concrete value for these limits(as you and @belforte mentioned that it is not good to have no limits as we are working under the assumption that it might break things in the future) . Maybe I can open a ticket to discuss this, what do you think? |
as Alan already sugested, we can start by looking at past statitics and start with a limit which envelopes what we do, while protecting against new problems. Then we can look at cost of keeping it vs. cost of reducing and adapting applications. |
Aroosha, we may face now few problems:
I suggest that for item 1 you may use hey tool and performance performance studies how it may impact CMSWEB ngins/FE if you'll pass around 20-50MB JSON in payload to it. This can also help to identify impacton WMArchive, item 3. But for item 2 we need input from CMSMonitoring team to tell us the exact limit from MONIT infrastructure and I would suggest to bound nginx within this limit at least because we don't have access to MONIT infrastructure which is bigger then CMSWEB and MONIT/IT team will complain since CMS docs may impact performance of kafka, etc. Bottom line, simply removing the limit may hit us back. I would rather keep some limit, at least driven by threshold from MONIT. |
IIUC there was no limit in nginx until two weeks ago. A limit is good, but it is not a disaster to start with same configuration as we had in last years. FWIW, if one puts their hands in WMArchive I suggest to critically review the need for it. It dates from the times that we did not have HDFS etc. Do we really need al lthat info ? IIUC the only user now is P&R operations (Jen) who finds some useful plots in Grafana of OK/fail vs. campaing/worflow/agent which could also be filled with information already collected by the HTCondor spider. |
MONIT is accepting the documents up to 30MB and they are kind of recommending to stay under this limit. |
I have monitoring! I can see what is going on for the last 24 hrs. Thank you! |
Just an update on my previous answer about the 30MB limit - there are some problems going near or above the limit. If the message is larger than the limit, then it will simply be rejected, so basically the data will be lost. Otherwise, if some message is only slightly under the limit and for some reason the compression that is done internally on MONIT's side doesn't work too well on it, it can get stuck Flume and disturb the whole data ingestion pipeline for some time (I think this has happened a few months ago). Therefore, may I ask if there are any plans to change anything regarding this issue? |
And another update - one of the possible suggestions from @vkuznet was to send already compressed data to the MONIT, however we were just told that currently MONIT infrastructure only accepts |
do we really need all that data in MONIT ? In a single document ? |
A reasonable/standard document should not be greater than 1 or 2MB. |
@amaltaro ok, thanks for the information! |
This operational issue has likely been fixed from different angles:
Given that WMArchive monitoring is functional for a couple of months and the immediate issue has been resolved, I think we should close this issue out and work on planned developments. Here is a new ticket to be considered in the coming quarters: #12043 Closing this one out, thank you everyone who contributed to this resolution! |
Impact of the bug
WMArchive is an essential dashboard utilized by P&R for day-to-day operations, to investigate failing workflows and issues with sites. This monitoring being down severely affects P&R operations.
Describe the bug
The WMArchive Grafana Dashboard dashboard is missing data since 27th March as reported by Jen on 27th March on mattermost.
As I understand the Failed Workflow Job reports are ingested to the
/wmarchive/data/
endpoint by theArchiveDataPoller
.There are Failed Workflow Job reports (FWJR) that are too large to be digested by the system and the CMSWeb cluster returns an HTTP error code
413 Request Entity Too Large
.The following
ArchiveDataPoller
log was pasted by @todor-ivanov in the Mattermost thread.How to reproduce it
Steps to reproduce the behavior:
Submit a "large" FWJR to the
/wmarchive/data/
endpoint.Not sure whats the actual size of the payload that will qualify as being "large" and make the endpoint return the
413 Request Entity Too Large
.Expected behavior
The
/wmarchive/data/
endpoint to be able to handle "large" FWJRs and the WMArchive dashboard be working as usualThe text was updated successfully, but these errors were encountered: