Schedule Evaluation
Incident Report for GorillaStack
Postmortem

I am sorry

I am sorry for this instability in our scheduler and the impact that it had on you. The GorillaStack derives a great deal of pride from our usual reliability, so yesterday was a sad day for us. Things don’t always go to plan, but I hope some comfort can be drawn by our determination to treat/remediate the initial issue and continued efforts in monitoring and search of a root cause.

I thank you for your understanding and continued support. We will strive to make these events as infrequent as possible.

Introduction

A day on, and things remain stable. We’re keeping our eye on the scheduler health, and will continue doing so. A root cause has been identified and plans have been drawn up to address things in the immediate, short and longer term.

Root Cause

Good news is that we have identified the root cause.

It appears to have been caused by an underlying infrastructure issue.  The CPU for the docker host maxed out and caused the misbehaviour of our scheduling management containers.

We dug deeper and found that another customer of the platform was running a lot of misconfigured rules that forced some of our action workers (these perform the actions on our action page) to die fatally, increasing the strain on all docker hosts as containers came back online. 

Immediate Items

  1. Add further validation to prevent misconfigured rules
  2. Catching errors at this particular part of the rule evaluation, to prevent fatal errors from causing our worker containers from dying

Short Term

  1. Improving log retention, consistency, structure and quality (this was a challenge during investigation)
  2. Execute our planned migration from Rancher to AWS Batch for a large swathe of our docker services (this was already under active development)

Long Term

  1. Decommission Rancher, moving remaining services not in AWS Batch to Kubernetes
  2. Explore how we can implement better eventing (perhaps CNCF CloudEvents) across our services to expedite future incident investigations

I hope this extra information helps.  Please reach out if you have any questions.

I appreciate your understanding and patience.

Regards,

Elliott

Posted Oct 19, 2018 - 14:44 AEDT

Resolved
We're continuing to monitor and evaluate options around more proactive scheduler health checks, but the scheduler is fully operational and continuing to mark it as having "degraded performance" is inaccurate.

We apologise for any impact!
Posted Oct 18, 2018 - 18:51 AEDT
Monitoring
The scheduler has been reloaded, resolving the current instance of this issue. We are now doing deeper root cause analysis to understand what caused the scheduler to evaluate incorrectly for a handful of rules.
Posted Oct 18, 2018 - 16:10 AEDT
This incident affected: Scheduler.