Chaos Engineering for SQL Server - Andrew Pruski - PSConfEU 2023

Опубликовано: 07 Июль 2023
на канале: PowerShell Conference EU

119

Hello everyone, and welcome to this session on Chaos Engineering for SQL Server. I'm Andrew Pruski, a Field Solutions Architect for Pure Storage, and I'll be guiding you through this session. Today, we'll be discussing the principles and implementation of chaos engineering for SQL Server.

Chaos engineering is all about testing how production systems react to failure. It's a way to build confidence in our systems and identify any misconfigurations. It's important to note that chaos engineering is not about breaking things in production. Instead, we should have staging or development environments for running chaos engineering experiments. These environments should closely resemble production to ensure accurate results.

To give you some context, let's take a quick look at the history of chaos engineering. Companies like Netflix, Google, and Slack have all experienced failures in their systems and have adopted chaos engineering as a way to prevent future failures. For example, Netflix developed Chaos Monkey, a tool that intentionally switches off servers to test system resilience.

In this session, we will focus on chaos engineering for SQL Server. Before diving into the principles and implementations, it's important to analyze past failures in your production environment. This helps identify what went wrong, the technologies and strategies in place, and the lessons learned from those failures. One strategy to build confidence in your systems is to test the high availability (HA) strategy, such as clustering, availability groups, mirroring, or replication.

To design chaos engineering experiments, it's helpful to create a likelihood impact map. This involves rating the likelihood and impact of different failure scenarios in your environment. Involving a group of engineers in this exercise can generate a variety of scenarios. By identifying scenarios with a high likelihood and impact, you can prioritize them as prime candidates for your first chaos engineering experiments.

Some potential scenarios to test could include power failures, testing backups and restores, or other failure scenarios specific to your environment. It's important to choose experiments that are likely to happen and have an impact. For example, testing restores can be automated using tools that check for any issues.

I also want to share a personal story about a previous job where we relied on a third-party backup tool with an instant restore feature. However, when we needed to perform a point-in-time restore, the feature failed and took longer than expected. This highlights the importance of regularly testing restores to ensure confidence in the backup strategy.

Another area to focus on is monitoring systems. They can be a point of failure, so it's crucial to have highly available and regularly tested monitoring systems. Testing scenarios like a runaway transaction causing the transaction log to grow rapidly can help identify if alerts are being sent in a timely manner.

Handling update statements on production systems is also important to test. Running an update statement without a proper where clause can lead to disastrous consequences. Implementing a deployment pipeline with thorough testing and safeguards like code analysis can help prevent such issues.

Disaster recovery is another aspect that needs regular testing. Many companies overlook this, but having a valid DR plan and testing it regularly is essential. Chaos engineering experiments can help uncover vulnerabilities and shortcomings in the DR plan.

Lastly, it's important to consider the personal element of working with production systems. Having contingency plans in case key individuals are unavailable or unable to fulfill their role is crucial. Simulating scenarios where key individuals are absent can help identify gaps and ensure the system remains operational.

In this part of the lecture, we discuss the importance of testing failures in order to improve system resilie…

Смотрите видео Chaos Engineering for SQL Server - Andrew Pruski - PSConfEU 2023 онлайн, длительностью часов минут секунд в хорошем качестве, которое загружено на канал PowerShell Conference EU 07 Июль 2023. Делитесь ссылкой на видео в социальных сетях, чтобы ваши подписчики и друзья так же посмотрели это видео. Данный видеоклип посмотрели 119 раз и оно понравилось 4 посетителям.