I am sorry for this instability in our scheduler and the impact that it had on you. The GorillaStack derives a great deal of pride from our usual reliability, so yesterday was a sad day for us. Things don’t always go to plan, but I hope some comfort can be drawn by our determination to treat/remediate the initial issue and continued efforts in monitoring and search of a root cause.
I thank you for your understanding and continued support. We will strive to make these events as infrequent as possible.
A day on, and things remain stable. We’re keeping our eye on the scheduler health, and will continue doing so. A root cause has been identified and plans have been drawn up to address things in the immediate, short and longer term.
Good news is that we have identified the root cause.
It appears to have been caused by an underlying infrastructure issue. The CPU for the docker host maxed out and caused the misbehaviour of our scheduling management containers.
We dug deeper and found that another customer of the platform was running a lot of misconfigured rules that forced some of our action workers (these perform the actions on our action page) to die fatally, increasing the strain on all docker hosts as containers came back online.
I hope this extra information helps. Please reach out if you have any questions.
I appreciate your understanding and patience.
Regards,
Elliott