What are Self Healing Systems & How Can You Develop One?
When people get injured, their bodies self-heal. What if technology could do the same?
Companies are racing to develop self-healing systems, which could improve quality, cut costs and boost customer trust. For example, IBM is experimenting with ‘self-managing’ products that configure, protect and heal themselves.
What Is A Self Healing System?
A self-healing system can discover errors in its functioning and make changes to itself without human intervention, thereby restoring itself to a better-functioning state. There are three levels of self-healing systems, each of which has its own size and resource requirements:
In typical applications, problems are documented in an ‘exceptions log’ for further examination. Most problems are minor and can be ignored. Serious problems may require the application to stop (for example, an inability to connect to a database that has been taken offline).
By contrast, self-healing applications incorporate design elements that resolve problems. For example, applications that use Akka arrange elements in a hierarchy and assign an actor’s problems to its supervisor. Many such libraries and frameworks facilitate applications that self-heal by design.
Unlike application level self-healing, system level self-healing does not depend on a programming language or specific components. Rather, it can be generalized and applied to all services and applications, independent of their internal components.
The most common system level errors include process failures (often resolved by redeploying or restarting) and response time issues (often resolved by scaling and descaling). Self-healing systems conduct health checks on different components and automatically attempt fixes (such as redeploying) to recuperate to their desired states.
Hardware level self-healing redeploys services from an unhealthy node to a healthy one. It also conducts health checks on different components. Since true hardware level self-healing (for example, a machine that can heal failed memory or repair a broken hard disk) does not exist, current hardware level solutions are essentially system level solutions.
Reactive Versus Preventive Healing
Reactive healing is healing in response to an error and is already in widespread use. For example, redeploying an application to a new physical node in response to an error, thereby preventing downtime, is reactive healing.
The desirable level of reactive healing depends on how much risk a system can tolerate. For example, if a system relies on a single data center, the possibility of the entire data center losing power, resulting in all nodes not working, may be so slim that designing a system that responds to this possibility is unnecessary and expensive. However, if it is a critical system, it may make sense to design it to recuperate automatically after such an event.
Preventive healing proactively prevents errors. Take the example of proactively preventing processing time errors by using real-time data. You send an HTTP request to check the health of a service and better use resources. If it takes more than 500 milliseconds to respond, you design the system to scale it, and if it responds in less than 100 milliseconds, you design the system to descale it.
However, using real-time data can be troublesome if response times change a lot, because the system will scale and descale constantly (this can use a lot of resources in rigid architecture, and a smaller amount of resources in a microservices architecture).
Combining real-time and historical data is a better (and also more complex) preventive healing approach. Using our response time example, you design a system that stores response time, memory and CPU information and uses an appropriate algorithm to process it alongside real-time data to predict future needs. So, if memory usage has been increasing steadily for the past hour and reaches a critical point of 90 percent, your system determines that scaling is appropriate, thereby preventing errors.
Designing Self-Healing Systems: Three Principles & a Five-Point Roadmap
- Know your system: Naturally, if you have a deep understanding of your system, you will better be able to guess where a problem might occur and how you might respond. What scenarios are the most common? How serious are the errors that might occur?
- Design for prevention: Automation and distributed storage, computing and analytics make preventive approaches easy and affordable. A proactive, preventive approach can resolve errors before they occur.
- Make it easy for the humans in the loop: Self-healing systems reduce the maintenance burden on your team. Even when errors or potential errors require human intervention, design the process so that resolution is easy and intuitive for the humans involved. Your team will thank you!
- Use immutable infrastructure as code
- Automate testing to keep the codebase efficient
- Deploy holistic monitoring systems
- Employ leading-edge smart alerts, triggers and prescriptive analytics
- Think deeply about how the system can improve self-learning
Designing systems and applications that are self-healing (or even better, automatically determine when errors might occur and prevent them) can improve quality, cut costs and improve customer trust. Even the best systems still require human intervention, but they can be designed so that the intervention is light-touch and easy for the human. Unlike self-healing software and services, self-healing hardware is still in the sci-fi realm and is leading to a newfound appreciation for biology, spurring fresh interest in biological computing.