The DevOps Dojo

https://www.dojo.fm/feed.xml

3.3K

Downloads

Episodes

The DevOps Dojo is an educational podcast focused on DevOps and making the world of building software a little better. Each episode covers a principle, practice or common DevOps fable. Join the Dojo to expand your software development horizons!

Episodes

Sep 25, 2020

Digital Transformation with the Three Ways of DevOps

Sep 25, 2020

7 min

The three ways of DevOps comes from the Phoenix Project, a famous book in DevOps circle.

This episode covers how to use the three ways to progress in your digital transformation initiatives.

Sources:

https://www.businessinsider.com/how-changing-one-habit-quintupled-alcoas-income-2014-4?r=US&IR=T

https://www.amazon.com/Phoenix-Project-DevOps-Helping-Business/dp/0988262592

https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations-ebook/dp/B01M9ASFQ3/ref=sr_1_1?crid=316RJMM06NH59&dchild=1&keywords=the+devops+handbook&qid=1600774333&s=books&sprefix=The+devops+h%2Cstripbooks-intl-ship%2C235&sr=1-1

Transcript:

My first introduction to the principles behind DevOps came from reading The Phoenix Project by Gene Kim, Kevin Behr and George Spafford. In this seminal book, that blew my mind we follow Bill as he transforms Parts Unlimited through salvaging The Phoenix Project. An IT project that went so wrong, it could almost have been a project in the public sector. Through Bills journey to DevOps, we discover and experience the Three Ways of DevOps. In this episode, I cover the three ways of DevOps and how they can be applied in a Transformation. This is the DevOps Dojo #6, I am Johan Abildskov, join me in the dojo to learn.

In the DevOps world, few books have had the impact of The Phoenix Project. If you have not read it yet, it has my whole-hearted recommendation. It is tragically comic in its recognizability and frustratingly true. In it, we experience the three ways of DevOps. The three ways of DevOps are Principles of Flow, principles of feedback and principles of continuous learning. While each of these areas support each other and has some overlap, we can also use them as a vague roadmap towards DevOps capabilities.

The First Way of Flow addresses our ability to execute. The second way of Feedback concerns our ability to build quality in and notice defects early. The Third way of Continuous Learning focuses on pushing our organizations to ever higher peaks through experimentation. The first way of DevOps is called the principles of flow.

The first way of DevOps is called the principles of flow.

The foundational realization of the first way is that we need to consider the full flow from ideation until we provide value to the customer. This is also a a clash with the chronic conflict of DevOps with siloed Dev and Ops teams. It doesn't matter whether you feel like you did your part or not, as long as we the collective are not providing value to end-users. If you feel you are waiting a lot, try to pick up the adjacent skills so you can help where needed.

We also focus on not passing defects on and automating the delivery mechanisms such that we have a quick delivery pipeline.

Using Kanban boards or similar to visualize how work flows through our organization can help make the intangible work we do visible.

A small action with high leverage is WIP limits. Simply limiting the amount of concurrent tasks that can move through the system at any point in time can have massive impact.

Another valuable exercise to do is a Value Stream Map where you look at the flow from aha-moment to ka-ching moment. This can be a learning situation for all involved members as well as the organization around them.

Looking at the full end to end flow and having optimized that we can move on to the second way of DevOps.

The second way of DevOps is the Principles of Feedback

The first way of DevOps enables us to act on information, so the second way focuses on generating that information through feedback loops, and shortening those feedback loops to be able to act on learning while it is cheapest and has the highest impact.

Activities in the Second Way can be shifting left on security by adding vulnerability scans in our pipelines. It can be decomposing our test suites such that we get the most valuable feedback as soon as is possible.

We can also invite QA, InfoSec and other specialist competences into our cycles early to help architect for requirements, making manual approvals and reviews less like to reject a change.

Design systems are a powerful way to shift left as we can provide development teams with template projects, pipelines and deployments that adhere to desired guidelines. This enables autonomous teams to be compliant by default.

The second way is also about embedding knowledge where it is needed. This is a special case of shortening feedback loops. This can both be subject matter expert knowledge embedded on full stack teams, but it can also be transparency into downstream processes to better allow teams to predict outcomes of review and compliance activities.

A fantastic way of shifting left on code reviews, and improve knowledge sharing in the team is Mob Programming. Solving problems together as a team on a single computer. We can even invite people that are external to the team to our sessions to help knowledge sharing and to draw on architects or other key knowledge banks.

Now that we have focused on our ability to create flow and feedback we can move on to the third and final way of DevOps. The principles of continuous learning.

The first and second way of DevOps provide most of the technical capabilities for continuous learning and experimentation - so the hard work in the third way of DevOps is primarily cultural. Which makes it that much more difficult to do.

A small step could be to start talking about hypotheses that we want to test rather than tasks we want to do. We have a tendency to state things as fact and put them into our backlogs. This creates an unfortunate mental model and Taylorist Command and Control culture. Language shape our thoughts so let's start phrasing our backlog items as hypothesis. Rather than saying "Make Button A Blue", say "We believe making Button A Blue will increase clickthrough rate by 10%."

While the previous step can be useful the big theme in the third way is psychological safety. Making it safe to learn and experiment must be a priority if we want to have a healthy culture. We must make diversity a focus area, especially in the tech business we have a notoriously toxic culture.

We can measure Westrum Culture as described in a previous episode, and seek to address any shortcoming.

Learning, Diversity and Psychological safety must come from a leadership level exemplifying the virtues that the members of the organization must live. Otherwise, there will be no resilience and any benefits will be temporary. The impressive transformation of Alcoa embodies this perfectly.

Another simple, but difficult practice is to drive down the size of the work items you are working on. This will make it easier to create small self-contained experiments. This will of course put stress on your software and organizational architecture.If you want to finish with a concrete technical practice look into Chaos Engineering as described in a previous Episode. Chaos Engineer will help build resilience into your organization and is a structured approach to create more learning. As such it can bring some safer sandbox to practice learning and experimentation. This can be beneficial if the organization is quite far from psychologically safe. The three ways of DevOps: Flow, Feedback and Learning are a meaningful definition of DevOps and it even hints at a roadmap for DevOps Transformations. Use the three ways and the activities I have described here as an inspiration to kickstart or accelerate your DevOps transformation!

This has been the DevOpsDojo. You can follow me on twitter @randomsort. If you have any questions, feedback or just want to reach out and suggest a topic, do not hesitate. You can find show notes with transcripts, links and more at dojo.fm. Support the show by leaving a review, sharing this episode with a friend or colleague or subscribing to the DevOpsDojo on your favourite podcast platform. Thank you for listening, keep learning.

Jun 14, 2020

Site Reliability Engineering

Jun 14, 2020

7 min

Site Reliability Engineering or SRE, is the next big thing after DevOps. In this episode I cover what SRE is, some of its principles and some of the challenges along the way.

Sources

https://www.oreilly.com/content/site-reliability-engineering-sre-a-simple-overview/

https://landing.google.com/sre/books/

Transcript

Site Reliability Engineering - Is it just traditional siloed Ops in disguised or is it a functioning DevOps organizational structure that works. SRE comes from the trenches of Google production environments, so many organizations look to SRE after having established DevOps teams. Site Reliability Engineering has some powerful concepts and tools, but it all comes at a price. If an organization has the scale, and the willingness to invest, and change their ways of working, SRE can help you go planet-scale. But there are many obstacles on the road to Site Reliability Engineering paradise. This is the DevOpsDojo #5, I am Johan Abildskov, join me in the dojo to learn.

Site Reliability engineering is the fabled world of planet scaled operations, with autonomous DevOps teams. This seems like an oxymoron but is a functional DevOps organizational structure.

Sloss, who coined the term Site Reliability Engineering, states - "Site Reliability Engineering is what happens when you ask a software engineer to design an operations team". SRE is how Good is able to run their production environments in large scale. Both in terms of how many applications and services they run and how many engineers are required to develop and maintain those products, but also in terms of the global scale of their applications.

Site Reliability Engineering is about scaling services to world-class availability, but it is also about scaling the engineering organization such that teams can continue to be productive as the business grows.

For me the core tenants of Site Reliability Engineering are:

Minimizing Toil, Shared Ownership and the ability to say no.

Engineering is a key part of SRE, and toil is the opposite. Toil is the work that does not, in the long run, add value, or that does not require engineering. This means that if we do too much work on toil, we will not be able to scale superlinearly with the number of engineers. Examples of toil are manual deployment procedures and following complex processes. Google claims to be able to have their teams work with thirty to fifty per cent toil. With fifty per cent being a hard limit. A key point here is that if a team violates this, they will either shed responsibilities or get resources added. This requires a hard buy-in from management in supporting this. This will likely be impossible in a project funded organization. So this way of thinking of eliminating toil, and give the space in the teams to do real engineering work, is the foundation of Site Reliability Engineering.

Shared Ownership is a weird construct when the point is to run the applications that someone else is building. But it is a very important pillar of SRE. This shared ownership comes from SRE teams providing valuable metrics from production environments to product teams. It also comes from a shared agreement that if the application itself is too unstable, developers will join the SREs on-call rotation until the service has been restored to the quality the SREs can operate.

SREs also helps product teams with production readiness reviews and checklists. All in all SRE enables the product teams to run applications at planet-scale without having to maintain all the competencies that this necessarily requires inside the team. This again requires commitment from management in enforcing the requirements from the SRE teams, so we do not revert back to throwing applications over the wall of confusion.

A powerful realisation is that we should consider downtime a resource, and spent it deliberately and respectfully. With this realisation comes the underlying acceptance that absolute 100% availability is not the right target. As we approach 100% uptime, it becomes more and more expensive to increase uptime, and less and less likely that the problem will be on our end, rather than at the client-side or at an intermediate stage. As such, 100% availability as the goal, is not a good business decision.

What the right goal is a big topic in itself, and we will not cover that here.

So let us say that we have a business goal of 95%, we communicate this to our customers as our Service Level Agreement or SLA. SLAs typically also comes with pre-agreed repercussions in case of violations.

SRE teams then have Service Level Objects or SLOs. These are the internal metrics that the teams use to guide their decisions. These SLOs are somewhere between the SLA and the theoretical max. The reasoning being that if we have a more aggressive requirement internally it is unlikely that we break the externally facing SLA. If we look over a period of three months, we then have some amount of downtime that we can spend doing different stuff. This is the difference between the SLO and the currently measured availability. If we, for instance, have a SLO of 99%, each month, we can have 7.3 hours of downtime while still complying with the SLO. We call this the error budget. This means that each month we can spend 7.3 hours of downtime, and the question is then how do we get the most benefit from that?

Unfortunately in most organizations, this is spent accidentally, simply by having a flakey application, or not having invested in redundancy and resiliency. This makes it hard to spend the error budget, in healthy ways.

One way of spending the error budget is that we can allow a higher, and thus more risky rate of innovation. We can also do disaster recovery training, or upgrade our infrastructure. Google even proclaims that if they fail to spend their error budget they will take down their services, such that no one becomes dependent on them having a higher uptime than is promised.

Error budgets provide a framework of reasoning around availability that avoids the usual absolute statement that downtime is bad.

The last concrete SRE technique that I will mention in brief is the Blameless postmortem.

It is an incident analysis technique that focuses on exposing systemic errors, without trying to assign blame. Blameless postmortems strive to turn individual or team learning in to organizational learning.

Postmortems can be presented company-wide and celebrated turning the messengers of bad news into the heroes of our organizations.

Site Reliability Engineering allows ambitious companies to scale their engineering organizations and their service portfolio. But it requires the buy-in and active support to create the space to allow the SRE organization to do its work as it is supposed to be, primarily engineering updated. Unfortunately, without this company-wide belief and commitment to SRE, it can quickly become a modernization theatre without any real benefit.

This has been the DevOpsDojo on Site Reliability Engineering. You can follow me on twitter @randomsort. If you have any questions, feedback or just want to reach out and suggest a topic, do not hesitate. You can find show notes with transcripts, links and more at dojo.fm. Support the show by leaving a review, sharing this episode with a friend or colleague or subscribing to the DevOpsDojo on your favourite podcast platform. Thank you for listening, keep learning.

Jun 7, 2020

Containers

Jun 7, 2020

7 min

Containers are all the jazz, and they contribute to all sorts of positive outcomes. In this episode, I cover the basics of Containerization.

Sources

Containers will not fix your broken Culture

docker.io

Transcript

Containers - If one single technology could represent the union of Dev and Ops it would be containers. In 1995, Sun Microsystems told us that using Java we could write once and run anywhere. Containers are the modern, and arguably in this respect more successful, way to go about this portability. Brought to the mainstream by Docker, containers promise us the blessed land of immutability, portability and ease of use. Containers can serve as a breaker of silos or the handoff mechanism between traditional Dev and Ops. This is the DevOps Dojo Episode #4, I’m Johan Abildskov, join me in the dojo to learn.

As with anything, containers came to solve problems in software development. The problems containers solve are around the deployment and operability of applications or services in traditional siloed Dev and Ops organizations.

On the Development side of things deployment was and is most commonly postponed to the final stages of a project. Software is perhaps only on run the developers own computer. This can lead to all sorts of problems. The architecture might not be compatible with the environments that we deploy the software into. We might not have covered security and operability issues, because we are still working in a sandbox environment. We have not gotten feedback from those who operate applications on how we can enable monitoring and lifecycle management of our applications. And thus, we might have created a lot of value, but we are completely unable to deliver it.

On the Operations side of things, we struggle with things such as implicit dependencies. The applications run perfectly fine on staging servers, or on the developer PC, but when we receive it, it is broken. This could be because the version of the operating systems doesn’t match, there are different versions of tooling, or even something as simple as an environment variable or file being present.

Different applications can also have different dependencies to operating systems and libraries. This makes it difficult to utilize hardware in a cost-efficient way.

Operations commonly serve many teams, and there might be many different frameworks, languages, and delivery mechanisms. Some teams might come with a jar file and no instructions, while others bring thousands of lines of bash.

In both camps, there can be problems with testing happening on something other than the thing we end up deploying.

Containers can remedy most of these pains. As with physical containers, it does not matter what we stick into them, we will still be able to stack them high and ship them across the oceans.

In the case of Docker we create a so called Dockerfile that describes what goes into our container. This typically starts at the operating system level or from some framework like nodejs. Then we can add additional configurations and dependencies, install our application and define how it is run and what it exposes. This means that we can update our infrastructure and applications independently. It also means that we can update our applications independently from each other. If we want to move to a new PHP version, it doesn’t have to be everyone at the same time, but rather product by product fitting it into their respective timelines. This can of course lead to a diverse landscape of diverging versions, which is not a good thing. With great power comes great responsibility.

The Dockerfile can be treated like source code and versioned together with our application source.

The Dockerfile is then compiled into a container image that can be run locally or distributed for deployment. This image can be shared through private or public registries.

Because many people and organizations create and publish these container images, it has become easy to run a test on tooling. We can run a single command, and then we have a configured Jenkins, Jira or whatever instance running, that we can throw away when we are done with it. This leads to faster and safer experimentation.

The beautiful thing is that this container image then becomes our build artifact, and we can test this image extensively, deploy it to different environments to poke and prod it. And it is the same artifact that we test and deploy. The container that we run, can be traced to an image which can be traced to a Dockerfile from a specific Git s ha. That is a lot of traceability.

Because we now have pushed some of the deployment responsibility to the developers, we have an easier time architecting for deployment. Our local environments look more like production environments. Which should remove some surprises from the process of releasing our software leading to better outcomes and happier employees.

Some of you might think, isn’t this just virtual machines. And it kind of almost is. Intuitively at least. But containers are implemented to borrow more directly from the host operating system, which leads to lower startup times, and smaller images.

We can create and share so-called base images. Images that are can be seen as a template or runtime for specific types of applications. This can help reduce the lead time from project start to something that can be deployed in production to almost zero, as the packaging and deployment has been taken care of.

But as Bridget Kromhout said, “Containers will not fix your broken culture”. Containers are not the silver bullet that they are sometimes touted as.

When we move into a container universe, perhaps even moving towards container orchestration with Kubernetes, we tend to neglect or forget about the Ops capabilities and problems we still need to solve. Backups and failovers. Patching of OS and libraries. Performance monitoring and alerting. There are many things that might become implicit and that can lead to risky business decisions. While Docker may lead us as developers to be able to somewhat better maintain and run our application in production, I want to make it very clear. Docker is not a replacement for Ops

Using containers is an enabler for many things, and will also create tension against a bureaucratic organization, because of its ease of use. It will be mind-blowing for some, and will require mindset shifts in both Dev and Ops. It also paves the way for more lifecycle management later on, with for instance Kubernetes. To reap the full benefits of containers we have to architect our applications for it using principles such as the 12 factor application.

This will again introduce tension and help us build better applications.

So while containers will not fix your broken culture, if you are not already thinking about containerization, you probably should be.

This has been the DevOpsDojo on Containers. You can follow me on twitter @randomsort. You can find show notes and more at dojo.fm. Support the show by leaving a review, sharing this episode with a friend or colleague or subscribing to the DevOpsDojo on your favourite podcast platform. Thank you for listening, keep learning.

Jun 1, 2020

Measuring your Culture with the Westrum Typology of Organizational Culture

Jun 1, 2020

8 min

The Westrum model has been shown to predict software delivery performance. It also helps us get a quantifiable handle on the intangible concept of culture.

With concrete focus points, it is a fantastic way to start improving your culture. This episode covers the Westrum Model.

Sources

https://cloud.google.com/solutions/devops/devops-culture-westrum-organizational-culture

https://inthecloud.withgoogle.com/state-of-devops-18/dl-cd.html

https://qualitysafety.bmj.com/content/13/suppl_2/ii22

https://www.amazon.com/Accelerate-Software-Performing-Technology-Organizations/dp/1942788339

Transcript

Culture - We know it is the foundation upon which we build high performing teams. Yet, it is a difficult topic to address. We struggle to quantify culture, and what does good culture mean?

How can we approach improving our culture without resorting to jamborees and dancing around the bonfires? Team building can feel very disconnected from our everyday lives.

The Westrum Typology of Organizational cultures is a model that helps us quantify our culture with focus on information flow in the organization. It has even been shown to drive software delivery performance. The Westrum model gives us an actionable approach to good culture. I’m Johan Abildskov, join me in the dojo to learn.

In any conversation about transformations, whether digital, agile or DevOps, you can be certain that before much time has passed, someone clever will state “Culture eats strategy for breakfast”. This quote is from Peter Drucker and implies that no matter how much effort we put into getting the strategy perfect, execution will fail if we do not also improve the culture. Culture is to organizations what personality is to people. We can make all the new years resolutions, fancy diets and exercise plans, but if we do not change our habits, our patterns, or personality, even the best-laid plans will fail. While a human lapse in strategy might involve eating an extra cupcake and result in weight gain we had not planned for, an organizational lapse in culture might be accidentally scolding someone for bringing bad news to light, which results in an organization where problems and challenges are hidden.

So it is important that we focus on establishing a healthy culture on top of which we can execute our clever strategies. Our culture is the behaviour that defines how we react as an organization.

Ron Westrum is an American sociologist, that has done research on the influence of culture to for instance patient outcomes in the health sector. He has built a model of organizational culture based on information flow through the organization. Being in the good end of this scale has been shown by the DevOps Research and Assessment team to be predictive of software delivery performance and organizational performance.

There are three categories of organizations in the Westrum model, Pathological or power oriented. Bureacratic or rule oriented and generative or performance oriented.

In pathological organizations, information is wielded as a weapon, used to fortify ones position, or withheld as leverage to be injected at the right moment to sabotage others, or cover ones own mistakes. Cooperation is discouraged as that can bring instability into the power balance, and the only accountability that is present is scapegoating and the blame game. Obviously this is a toxic environment, and the least performing organization type.

In the bureaucratic organizations the overarching theme is that it doesn’t matter if we did something wrong or in a bad way, as long as we do it by the book. Responsibilities are accepted, but the priority is not sensemaking, the priority is that no one can claim we did something wrong. Bad news are typically ignored, by the logic that the process is right, and the process is working.

Generative organizations focus on outcome or performance. It doesn’t matter who gets credit as long as the organization wins. Failures are treated as learning opportunities and might even be sought after in order to increase organizational learning. This increases transparency and allows local solutions to be exploited globally.

So from bad to good the three organizational types are Pathological, Bureaucratic and Generative.

So far I haven’t said anything about the specific characteristics or properties of these types. Nor given you any hints as to how you can measure or improve your culture. The Westrum model is measured on six different properties. The way it is done is through ask the members on a scale from 1-7 how much the agree with each of the Westrum properties. We can average and use that to plot into the Westrum model. I have built a free tool you can use for this. You can find a link in the show notes at dojo.fm.

First, Westrum talks about cooperation on the team and across teams. We create cross-functional teams from each of the areas that take part in our software delivery, making sure that goals and incentives are aligned.

Second, we train the messengers that bring bad news. We should celebrate bad news, as they represent a huge learning opportunity, and takes a lot of vulnerability on behalf of the messenger. We can use techniques such as blameless post-mortems to create organizational learning from incidents.

Third, We share the risk across stakeholders, we also hold developers accountable for availability, and Ops for the speed with which features can be delivered. We set up the technological scaffolding that can enable this, such as providing development teams with metrics and insights from production environments.

Fourth, we encourage bridging. Or to use the original DevOps phrase, we break down silos. Common silos are business, Information Security, Quality Assurance, Development and Operations. Reach out beyond the boundaries and find shared goals and motivations. Figure out how you can make each other successful. The good news is that the most effective solution is to talk together. The bad news is that we as an industry tend to be very bad at sitting down and talking to each other. Management has a huge influence on this characteristic, as misaligned goals and incentives will destroy initiatives in this area.

Fifth is our ability to learn from failures. Modern distributed systems are typically so complex that it is unreasonable not to think of your services as always being degraded in some way or another. This is in stark contrast to the very risk-averse enterprise organization with zero tolerance for failures. When we learn from our failures, we might even inject them in our environments to maximize our learning. It doesn’t matter how technologically savvy we are, if our competition is able to out-learn us.

Sixth and last is how we approach novelties when they are presented to us. If we are able to build a culture where employees are able to experiment, in a safe way, we will be able to innovate and improve continuously. This means that we have to challenge the misconception that developer efficiency comes from high utilization, and that we need to have a high level of control. If we can build autonomous teams that can experiment with their process, we will end up in a good place.

I have gotten real respect for the Westrum model as it takes something as intangible as culture and makes it concrete and measurable. If we measure this once in a while and address the categories where we are not doing so well, or where we are relapsing to unhealthy behaviour, we end up with something as rare as an actionable concrete strategy for improving our culture. That is powerful.

So in summary, there are six categories of the Westrum model. High Cooperation, Training the messengers, sharing the risks, encouraging bridging, learning from failures and implementing novelty.

It can be easily measured, has been shown to drive software delivery performance, and has concrete focus areas that can guide you towards a better culture.

So what is holding you back?

This has been the DevOpsDojo on how to measure your culture with Westrum Typology of Organizational Culture. You can follow me on twitter @randomsort. If you have any questions, feedback or just want to reach out and suggest a topic, do not hesitate. You can find show notes with transcripts, links and more at dojo.fm. Support the show by leaving a review, sharing this episode with a friend or colleague or subscribing to the DevOpsDojo on your favourite podcast platform. Thank you for listening, keep learning.

May 25, 2020

Chaos Engineering

May 25, 2020

6 min

Chaos Engineering is a set of techniques to build a more proactive and resilient production environment and organization.

In this Episode of the DevOps Dojo, I introduce you to the basics of Chaos Engineering.

Sources

https://principlesofchaos.org/

https://github.com/Netflix/chaosmonkey

https://chaostoolkit.org/

Transcript

Chaos Engineering. The art and practice of randomly introducing failures into our production environment. It seems so counterintuitive that we intentionally break our infrastructure, but Chaos Engineering is a structured and responsible way of approaching exactly this.

Modern distributed systems are so complex that it is difficult to say anything about their behaviour in normal circumstances. This means it is nigh impossible to predict how they will behave when pushed to their limits in a hostile environment. Chaos Engineering allows us to probe our running systems, in order to build confidence in their performance and resilience. I am Johan Abildskov. Join me in the dojo to learn.

The first time I head about Chaos Engineering, was when I learned about the tool Chaos Monkey. Chaos Monkey is a tool created by Netflix that they run in their production, randomly restarting systems and servers. My first reaction was awe. How was this a sane thing to do? It just sounds so wrong. However, I have come to see that this is the only way that you can keep improving your systems in a structured way. Rather than having the unforeseen happen because of user behaviour, you inject tension in the system to discover weaknesses before they cause user-facing problems.

This is the natural continuation of the lean practice of artificially introducing tension in your production lines to continuously optimize productivity.

There are two types of companies, those that are only able to react. Whether it is to the market or to the competition. And there are those companies that are able to disrupt themselves and be proactive in the market.

My money is on the success of the proactive organizations.

But back to Chaos Engineering. To me, the simplest way of explaining Chaos Engineering is that it is applying the scientific method to our production systems. We formulate hypotheses and conduct Chaos Experiments to investigate them.

A chaos experiment has four parts.

First, we defined a steady state as some measurable condition of a system that means it is working as normal. This could be distribution of failure rates or response time on some percentile of the requests. It is key that we have something measurable.

Second, we hypothesize that our system will maintain this steady state under some hostile conditions. This can be server restarts, disk failures or degraded performance in network or in services that we depend on.

Third, we have the execution of the experiment, done with tools like Chaos Monkey or Chaos Toolkit. Here we run the experiment trying to disprove our hypothesis.

Fourth, we collect data and analyse the results. The harder it is to disprove our hypothesis, the more confidence we have in the behaviour of our complex distributed system.

So the four components of a Chaos Experiment are the steady-state, our hypothesis, the execution and the analysis.

But wait, you’ll say. You will tell me I am crazy, that you can’t just to such horrible things to your production environment. And you’ll likely be right. Just as we need to build confidence in our production systems, we need to build confidence in our ability to conduct chaos experiments.

If you are completely new to Chaos Engineering, and perhaps are not even practising disaster recovery, this would be a really good place to start. This helps build our muscle and gets us used to manipulate environments. These trainings are commonly called game days. In some circles, they practice the “wheel of misfortune”, where you randomly select and incident and walk through that.

My best suggestion for a starting disaster recovery training is to test whether you are able to restore your backups to a testing system. It is both a healthy exercise and something that you should be practising regularly.

But let’s say that you are ready to start your chaos engineering journey. I still do not recommend that you just set Chaos Monkey loose in your production environment, and have it crash servers, and see what happens. Remember, Chaos Engineering is a cool naming for a boring practice.

Your first chaos experiments can be run in a completely segregated environment, ideally provisioned for this experiment only. It is likely that you will uncover many findings in even such a small environment. Keep repeating your experiment and raising the resilience of your applications. You should see that your service is already becoming more resilient in your production environments as a consequence of these experiments and remediations.

When you become more confident in your experimentation, you escalate the chaos experiment to be running in your staging environment. This should lead to more learning and more confidence in your capabilities. At this point, it becomes obvious that you must be able to abort a running chaos experiment, should it exceed some predefined boundary conditions.

When you are ready you can run your supervised chaos experiments in production. This is often scary, but it is where the learning is maximized and the uncertain is greatest. This will again lead to a new set of discovered weaknesses and remedies applied.

You truly have graduated in your Chaos Engineering when you are able to take your supervised Chaos Experiments, and run them unsupervised automatically in your production environments.

When the first Chaos Experiment has been successfully completed it is time to build your hypothesis backlog and continue your experiments in order to introduce increasing tension in your systems, and out-learn your competition.

Chaos Engineering helps us build a culture where failure is considered normal. A culture where we accept that any sufficiently complex system at all times will be running in a degraded but functional state. A culture where we don’t ask what happens if this fails, but rather what happens when this fails.

This has been the DevOpsDojo on Chaos Engineering. You can follow me on twitter @randomsort. You can find show notes and more at dojo.fm. Support the show by leaving a review, sharing this episode with a friend or colleague or subscribing to the DevOpsDojo on your favourite podcast platform. Until next time, keep learning. Thank you for listening.

May 18, 2020

The Five Cloud Characteristics

May 18, 2020

5 min

The five cloud characteristics comes from the American Standard institute NIST and has been shown by the Accelerate State of DevOps report to drive DevOps Performance.
In this episode I cover the characteristics and why they matter.

Sources:

State of devops 2018: https://services.google.com/fh/files/misc/state-of-devops-2018.pdf

Cloud Charateristics: https://csrc.nist.gov/publications/detail/sp/800-145/final

Transcript

The Cloud. It is the promised land of IT infrastructure. The magical realm where seemingly infinite resources are available at the click of a button. Where computers appear out of thin air to do our bidding. It is a billion-dollar business, but even those who have invested heavily in using public clouds struggle to reap the benefits. They are still stuck pushing tickets and waiting days, weeks or months for virtual machines. In this episode, I cover the five cloud characteristics and why they matter for our DevOps performance. I’m Johan Abildskov, join me in the dojo to learn.

In this episode, I am going to talk about software infrastructure. In short, computers and connectivity between them. With a few more words it is about getting the compute, memory, storage and network resources that we need to run our applications. There are three categories of infrastructure at this level of abstraction. On-premises, or on-prem as it is called, where everything is hosted inside the organizational perimeter. Public Cloud where everything is hosted externally, at a provider such as AWS, Azure, Alibaba or Google Cloud Platform. And finally, the hybrid cloud where some workloads are hosted in a public cloud, while others are hosted on-prem. Each deployment pattern is valid and has its uses. In the DevOps community, we have a common narrative stating that cloud is superior to on-prem, and sometimes we fall into the trap of forgetting the tradeoffs we are making. My opinion is that while it is difficult to become as high performing on-premises as in the cloud, it is trivial to screw the cloud up, just as bad as on-prem. So let’s look at the cloud characteristics that drive DevOps performance.

I learned about the five cloud characteristics from the Accelerate State of DevOps 2018. They found that those organizations that agreed with all five characteristics were 23 times more likely to be high performers. In 2019 that number had increased to 24 timers as likely.
The characteristics come from the American standards institute NIST. Disregarding the cloud deployment model, they cover the characteristics our infrastructure should have, in order for it to be called cloud. This is very valuable in terms of aligning our vocabulary. Without further ado, and there has been much, let’s move to the characteristics themselves.

The first is “On-demand self-service”. That is, consumers can provision the resources they need, as they need them, without going through an approval process or ticketing system.
This is the first trap of cloud migrations. If we simply lift-and-shift our infrastructure, but leave the processes in place we are not going to maximize our gain. Cloud is a powerful tool to shorten feedback loops, build autonomy and allow the engineers to make the economic tradeoffs that impact their products. But that is only the case if on-demand self-service is present.

The second characteristic is broad network access. This is, to me, the least interesting characteristic, but that might be because I have not been in organizations where this has been a big pain. This refers to the capabilities of our cloud are generally available through various platforms, such that the cloud capabilities are not hidden from the engineers.

The third characteristic is resource pooling. This means that there is a pool of resources, and we as consumers do not control exactly where our workloads go. We can declare properties that we desire for our workloads, such as SSD disks, GPUs or a specific geographic region, but not particular hosts. For on-prem solutions, this can be addressed with platforms such as Kubernetes. One common way that we break this characteristic is manually configured servers, that are not maintained through version-controlled scripts. This leads to configuration drift, and that is a big pain to work with. Remember: Servers are cattle, not pets. Resource pooling also allows us to have higher utilization of our resources in a responsible way.

The fourth characteristic is “rapid elasticity”. This means that we can scale our infrastructure on demand. This often becomes “Oh, we can scale up, as much as we want, at a moments notice. This is however only one side of the equation. This also means that we can scale down unused resources. This allows us to get the biggest Return on investment on our infrastructure spending. Spending extra capital when a surge hits our applications, whether that is black friday, the first of the month, or something we could not anticipate in advance.
This can be obtained in on-premises solutions, but requires some upfront investment in order to be ahead of the utilization curve.

The fifth characteristic is metered service. We only pay for what we use. This allows us to get much more transparency in our cost. The Accelerate State of DevOps report 2019 found that those who matched the characteristics were 1.6 more like to go under budget, and 2.6 times more likely to be able to accurately estimate their costs. This characteristic truly represents the commoditization of cloud computing.

These were the five characteristics of cloud computing. It does not matter where we host our workloads as long as we get on-demand self-service, broad network access, resource pooling, rapid elasticity and metered service. I hope that when you discuss software infrastructure you consider these characteristics and how they enable business agility. Think about where your organization is missing the target, and what you can do to adapt.

This has been the DevOpsDojo on the five cloud characteristics. You can follow me on twitter @randomsort. You can find show notes on the website at dojo.fm . Support the show by leaving a review, sharing this episode with a friend or colleague. Subscribe to the DevOpsDojo on your favourite podcast platform to keep up with the show. In the next episode I will cover Chaos engineering. Until next time, keep learning. Thank you for listening.

May 10, 2020

Trailer - Introducing the DevOps Dojo

May 10, 2020

39 sec

Welcome to the DevOps Dojo

This is where we learn. This episode introduces the DevOps Dojo and its host Johan Abildskov.