Operational Readiness
Writing code can be fun, but running code in a secure, scalable and reliable manner is an entirely different skillset.
With Agile and DevOps came an age of fast and dynamic software development, as discussed in Building an Agile Culture.
Fast prototyping can now be done cheaply and easily, but Operational Readiness and the step of making that prototype “production-ready” is often missed. It is often far too easy to take the running prototype that stakeholders like and call it production, the project is closed, and developers move to the next project. Wait, who’s going to be supporting that new solution?
The first challenge can hopefully be answered by the concept of DevOps service teams and product ownership. If an application or a service is running for your organization, it must be important to someone and it must have a cognizant owner. Yes, running many applications does take human resources, so staff it properly. Service teams can absolutely support more than one service, but keep in mind that gradually their workload will increase to more than they can properly support.
Operational readiness is a structured process of ensuring that the operations team acquires the tools, skills and documentation needed to operate and maintain a newly completed solution and supporting infrastructure. Therefore, it’s crucial to begin the operational readiness process early in the project planning and execution phase.
Keep in mind that patching will always be required. Even with serverless solutions, old interpreter versions will always be deprecated and require changes and testing. Let’s be realistic, this isn’t magic, it is not really serverless, it is just running on someone else’s server. The bar on the shared responsibility model can move, but ultimately the responsibility for the cybersecurity compliance of the solution is yours.
DevOps Practices
A few years ago, I was charged with maintaining the Tomcat infrastructure for a company, and I started noticing that the .war files that one developer gave me to deploy were consistently about twice the size of what other developers were sending me. It turned out that the compiler settings on his workstation were not set to compress, unlike the other developers. What else was different in his settings?
The bottom line is: don’t compile code locally, if you can’t use a build pipeline, at the very least, have your developers compile code on a shared system.
Rule #1: Don’t compile production code on your local workstation
Along the same line, I don’t care if you didn’t change a single line of code, if you recompile a new binary, it is now a brand new release that needs to be tested all over again. To that end, compiling code for a specific environment (such as a Dev build being different than the Prod build) is definitely an anti-pattern that must be avoided. Compile a generic package and pass environment-specific parameters during deployment. The same package must be deployed in all environments.
Rule #2: Don’t compile code for a specific environment
Even if you deploy the same package, server drift is another common issue. Over time, no matter how great and dedicated your OPS team is, the Dev servers will not be exactly like the Prod servers. They are patched at different times, maybe by different people, the infrastructure configuration is different, etc. These slight differences can affect the stability and behavior of the solution in sometimes inexplicable ways.
The best tool to combat server drift is actually to “Dockerize” your deployment. Building Docker images (or any equivalent solution) is actually one of the best ways to include libraries and dependencies in your deployment. It no longer matters if the Dev server has a different Java version than the Prod server, the Java version you need is packaged inside of your container image, along with your code. This truly makes for a portable deployment, and to that end, the servers matter a lot less. You should never rely on a named server (if the server name starts with ‘prd_’ then …), virtual servers are dynamic and can be rebuilt or scaled as needed as part of an ASG. Microservices are dynamic, deployment parameters are the glue that holds the environment together.
Rule #3: Servers are cattle, not pets, you don’t get to name them
I once worked with a developer who put an “apt-get update” near the beginning of his Docker-compose file. While I applaud his effort from a cybersecurity perspective, this act terrified me from a DevOps perspective, because it goes against the core principle of “build once, deploy everywhere”. I can guarantee you that the exact same Docker image, built once and stored in a Docker registry, will not have the same libraries when it is deployed in Dev today and in Production next month. While it may seem far-fetched, predictability and repeatability are two of the cornerstones of Operational Readiness. It doesn’t matter who compiled the code that day and who deployed it, the outcome must be the same. If you can, automate your builds, tests, and deployments as part of a Continuous Integration and Continuous Deployment (CI/CD) pipeline.
Rule #4: Strive for predictability and repeatability in your build and deployment process
Don’t blame Murphy
“Murphy’s law” states that “Anything that can go wrong will go wrong and at the worst possible time.” This adage, which can be applied to many different things, definitely applies to running IT solutions in a production environment.
“Hope is not a strategy”
Unofficial Google SRE motto
Expect that things will break and that mistakes will be made. Before calling your solution “production”, the stakeholders, as well as the Dev and Ops (or DevOps) teams should hold a formal “Operational Readiness Review” of the solution. Several checklists are available on the internet, and AWS also made its checklist available to customers. In a nutshell, teams should make decisions about the high availability of the proposed solution, as well as disaster recovery options. There are no “one size fits all” answers to these questions, how important is that solution to your organization and what is the cost of downtime are somewhat subjective and proper to your culture. Servers go down, entire data centers can get flooded, entire regions can be whipped-out by a nuclear strike, and entire continents can go offline after a meteor strike…you need to find the balance between your risk tolerance and the cost and complexity of the workaround, knowing that added complexity comes at the cost of added risk. The answer is not easy to obtain, and this may need to be a recurring discussion, but the conversation is well worth having.
High availability is the art of designing solutions that will instantly failover or scale in the face of adversity, disaster recovery is the art of implementing backups and being able to restore the solution in some way within an acceptable about of time. Both are not mutually exclusive, a highly available solution will not protect you from a malicious or accidental database delete the change would simply replicate to all the nodes; in which case a restore from backup or a lag site would be needed. How much data can you afford to lose between backups? How much downtime would affect your brand and send customers to the competition?
Having a stable and well-defined deployment pipeline, as well as a well-documented solution (see Create Diagrams from Text) and a well-staffed support structure are all extremely important factors in this equation.
Functional and regression testing should always be part of your deployment process, and load testing and endurance testing should also be present. I have seen virtual machines deployed from the exact same template behave widely differently from the rest, this is unfortunately not an exact science, implement proper health checks before allowing any new system to accept real workload. Test early, test often, but generate meaningful alerts. Beware of the cascading effects of downstream dependencies, should your team be alerted if it faults because of an outage caused by another team?
I highly recommend that you think about implementing two different levels of monitoring and alerting, so that you don’t overwhelm your on call people. Monitoring each instance of each component is useful, but especially with distributed systems where multiple instance of a component are running, losing one instance may not be en event worth waking someone in the middle of the night, maybe a nightly report that can be addressed in the morning will suffice. However, if the end to end test is failing, the test that mimics closely the actions a random user would routinely take, then alerting is appropriate.
Rule #5: Expect the Unexpected
“Simplicity is the ultimate sophistication.”
— Leonardo da Vinci.
You have decided how highly available you want your solution to be, scaling across multiple availability zones and data centers, with regional load balancing and read-only database replicas with caching. Things will still go wrong.
You also need to implement metrics and monitoring to get visibility into your solution and its behavior and performance. Monitor key components, and also the end-to-end user experience. Get daily reports of how many 200, 400, and 500 return codes your service is returning every day, and investigate any anomaly.
I once deployed a new version of a processing service that instantly started throwing errors. We backed out the new version and investigated the error logs. Two of the processing containers were issuing duplicate unique transaction IDs, which was confusing another downstream dependent service. How was this possible? The developer had used an algorithm to issue unique IDs that relied on the container’s deployment time…and two containers somehow got deployed in the exact same millisecond. The solution was easy, use the container ID as well as the deployment timestamp to issue unique IDs, but what are the odds of this happening?
When designing microservices, you need to design for failure. Always expect that the services you depend on can fail. Even worse than failure, services can return an invalid or confusing response only part of the time, maybe only one instance out of 100 is giving bad answers. Expect that things will fail, implement retries, and please implement exponential backoff, I have seen too many issues with fast retries overwhelming an already struggling service. Retry for a long time, and outages on dependent services can potentially last hours if not days. You need to make decisions, under stress, is it better to serve all requests poorly or is it better to only serve a limited number of requests properly and serve other users an error message?
Rule #6: Build for failure
Things will break, removing a bottleneck in your infrastructure will only reveal the next one, and your team will learn and improve. But this can only happen if your culture allows for that failure to happen and become a learning opportunity. You need to build a blameless culture, where people can hit a wall and overcome that obstacle, learn and grow. If every mistake, preventable or not, is seen as an individual failure, then a culture of “this never happened” will take place, and suspicion will overtake all. Be open about your failures and mistakes, be vulnerable and admit your shortcomings, and you will all grow together as an organization. Transparency is key.
Of course, each and every outage must be investigated, and a proper Root Cause Analysis (RCA) must be conducted, but unless an obvious and deliberate error was made by a human, the focus should never be on “who did this?” but instead on “how did the process let a human make this mistake?”. The real goal of the RCA is to make sure that the same or a similar mistake cannot be made in the future, it is part of the continuous improvement process. And of course, if the outage was caused by a software or infrastructure bug, feedback to the Development team is a must.
Rule #7: Build a blameless culture
If you don’t make mistakes, you’re not working on hard enough problems. And that’s a big mistake.
Frank Wilczek
Knowing that things will break, you should also embrace the unexpected. Netflix famously pioneered chaos engineering, a method of testing distributed software that deliberately introduces failure and faulty scenarios to verify its resilience in the face of random disruptions. Not everyone can “run the apes” in production the way Netflix does it, the impact for them may be serving a few frames of a movie at a lower resolution, and the impact on a real-time financial system could be devastating, so please adapt this to your own environment.
While it doesn’t specifically mention IT systems, the book Antifragile: Things That Gain from Disorder, by Nassim Nicholas Talebis worth reading and can most definitely apply to building resilient microservices.
Rule #8: Embrace the Unexpected
“Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.”
Nassim Nicholas Taleb
Mitigating fragility is not an option but a requirement. It may sound obvious but the point seems to be missed. For fragility is very punishing, like a terminal disease. A package doesn’t break under adverse conditions, then manages to fix itself when proper conditions are restored. Fragility has a ratchet-like property, the irreversibility of damage.
Nassim Nicholas Taleb