Announcing OpenStack’s Self-healing SIG

By , November 24, 2017 4:15 pm

One of the biggest promises of the cloud vision was the idea that all infrastructure could be managed in a policy-driven fashion, reacting to failures and other events by automatically healing and optimising services.

In OpenStack, most of the components required to implement such an architecture already exist, and are nicely scoped, for the most part without too much overlap:

However, there is not yet a clear strategy within the community for how these should all tie together. (The OPNFV community is arguably further ahead in this respect, but hopefully some of their work could be applied outside NFV-specific environments.)

Designing a new SIG

To address this, I organised an unofficial kick-off meeting at the PTG in Denver, at which it became clear that there was sufficient interest in this idea from many of the above projects in order to create a new “Self-healing” SIG. However, there were still open questions:

  1. What exactly should be the scope of the SIG? Should it be for developers and operators, or also end users?
  2. What should the name be? Is “self-healing” good enough, or should it also include, say, non-failure scenarios like optimization?

In an attempt to answer these, I formally proposed the creation of the SIG, asking the community to fill in a short survey to vote on its creation, and to provide their feedback regarding the name and scope. Unfortunately whilst everyone unanimously supported its creation, opinions were split more or less 50%-50% on the name and the scope! So on advice from Thierry, I listed the SIG as “forming”, created the corresponding wiki page, and proposed a session for the Sydney Forum, which was subsequently accepted.

A SIG is born!

We had around 30 people attend the Sydney Forum session, which was extremely encouraging! You can read more details in the etherpad, but here is the quick summary …

Most importantly, we resolved the naming and scoping issues, concluding that to avoid biting off too much in one go, it was better to be pragmatic and start small:

  • Initially focus on cloud infrastructure, and not worry too much about the user-facing impact of failures yet; we can add that concern whenever it makes sense (which is particularly relevant for telcos / NFV).
  • Not worry too much about optimization initially; Watcher is possibly the only project focusing on this right now, and again we can expand to include optimization any time we want.

So now that the naming and scoping issues are resolved, I am excited to announce that the Self-healing SIG is officially formed!

Discussion went beyond mere administravia, however:

  • We collected a few initial use cases.
  • We informally decided the governance of the SIG. I asked if anyone else would like to assume leadership, but noone seemed keen, dashing my hopes of avoiding extra work 😉 But Eric Kao, PTL of Congress, generously offered to act as co-chair.
  • We discussed health check APIs, which were mentioned in at least 2 or 3 other Forum sessions this time round.
  • We agreed that we wanted an IRC channel, and that it could host bi-weekly meetings. However as usual there was no clean solution to choosing a time which would suit everyone ;-/ I’ll try to figure out what to do about this!

Get involved

You are warmly invited to join, if this topic interests you:

Next steps

I have sent out a similar announcement to the mailing list, and next will set up the IRC channel, and see if we can make progress on agreeing times for regular IRC meetings.

Other than this administravia, it is of course up to the community to decide in which direction the SIG should go, but my suggestions are:

  • Continue to collect use cases. It makes sense to have a very lightweight process for this (at least, initially), so Eric has created a Google Doc and populated it with a suggested template and a first example. Feel free to add your own based on this template.
  • Collect links to any existing documentation or other resources which describe how existing services can be combined. This awesome talk on Advanced Fault Management with Vitrage and Mistral is a perfect example, and here is another, but we need to make it easier for operators to understand which combinations like this are possible, and easier for them to be set up.
  • Finish the architecture diagram drafted in Denver.
  • At a higher level, we could document reference stacks which address multiple self-healing cases.
  • Talk more with the OPNFV community to find out what capabilities they have which could be reused within non-NFV OpenStack clouds.
  • Perform gaps analysis on the use cases, and liase with specific projects to drive development in directions which can address those gaps.

The origin of the idea for the SIG

In case you’re interested in the history …

I first became aware of the need for this SIG while working upstream within the community on OpenStack HA – specifically on compute plane HA, where failures of compute nodes or hypervisors are automatically handled by resurrecting affects VMs on other compute nodes. I saw many groups independently trying to solve the same problem, so I created the #openstack-ha IRC channel, organised weekly meetings, and tried to bring all stakeholders together to converge on a single upstream solution. Progress was gradually made, which we presented in Austin, Boston, and most recently in Tel Aviv at OpenStack Day Israel 2017.

After the talk I had a great conversation with Ifat Afek, who is the PTL of OpenStack Vitrage, which is an awesome project providing RCA (Root Cause Analysis) of faults within OpenStack. Since Vitrage can do things like receive an alert about a fault on a compute node (e.g. from Aodh) and then automatically determine all affected VMs and call out to another service like Mistral to enact appropriate remediation, there was obvious synergy between our work.

However Vitrage goes much further than just compute HA: since it can receive various alerts from multiple types of data source, model relationships between many types of resource, and trigger external services to take action, this kind of combination has tremendous potential for building automatically self-healing cloud infrastructure. And as shown above, there are several other OpenStack projects operating in the same space, which could take this approach further; for example, Congress can be used to specify policies regarding how failures should be handled.

Talking with Ifat resulted in the idea to create a new SIG with the goals of identifying self-healing use cases, establishing and documenting what can already be achieved by combining existing OpenStack services, and enhancing collaboration between the projects and with operators to fill in any remaining gaps. And now you know the rest of the story 🙂


Leave a Reply


Panorama Theme by Themocracy