Windows resiliency: Best practices and the path forward



 Windows IT Pro Blog:

The broad, open nature and scale of the Windows computing ecosystem is part of what makes it a powerful and unmatched choice across the globe. The recent CrowdStrike incident underscores the need for mission-critical resiliency within every organization, and our unique ability to support the change required.

When a major incident arises, we focus on remediation, learning, and change, all while communicating transparently to our ecosystem. On Saturday, David Weston described our "first responder" approach. Since the start, we engaged over 5,000 support engineers working 24x7 to help bring critical services back online. We are providing ongoing updates via the Windows release health dashboard, where we detail remediation steps, including a signed Microsoft Recovery Tool.

Our goal is to be your trusted partner as you leverage technology and the end-to-end Microsoft stack to deliver amazing value for your workforce, your customers, and your partners. That means, when an issue arises, we immediately engage with partners and customers to dig into the details, help, learn, and evolve.

This incident shows clearly that Windows must prioritize change and innovation in the area of end-to-end resilience. These improvements must go hand in hand with ongoing improvements in security and be in close cooperation with our many partners, who also care deeply about the security of the Windows ecosystem.

Examples of innovation include the recently announced VBS enclaves, which provide an isolated compute environment that does not require kernel mode drivers to be tamper resistant, and the Microsoft Azure Attestation service, which can help determine boot path security posture. These examples use modern Zero Trust approaches and show what can be done to encourage development practices that do not rely on kernel access. We will continue to develop these capabilities, harden our platform, and do even more to improve the resiliency of the Windows ecosystem, working openly and collaboratively with the broad security community.

There is always the chance that an outage will impact an organization. Over the last few days, we've been on thousands of calls with organizations around the world. We've observed that those who were able to remediate and recover the most quickly followed a similar set of practices. We want to share those best practices with you.

Best practices to support resiliency in your organization​

  1. Have business continuity planning (BCP) and a major incident response plan (MIRP) in place. Include response and recovery best practices that outline the steps needed to get your environment back up and operating, including who to call and how to get support.
  2. Back up data securely and often. We recommend your organization utilize cloud storage and backup solutions, as these are great options for securely accessing, sharing, and collaborating on files from anywhere. Organizations utilizing cloud storage solutions have had better experiences getting back online, as this removed barriers to simply resetting the device.
  3. Ensure that you can restore your Windows devices quickly. A key component of resiliency in the event of an issue is to regularly create system restore points and use Windows built-in recovery options to restore devices. If you use Azure virtual machines, you can take a snapshot of your VMs. Organizations with recent restore points were able to recover more quickly from the recent CrowdStrike issue and we observed that virtualized/cloud environments were among the quickest to recover.
  4. Utilize deployment rings. Extend safe deployment practices into your environment by creating deployment rings to manage the rollout of updates and new features. Utilize your existing device management tools to manage deployment risk using the same approach Microsoft does. Alternatively, take advantage of automated deployment with Windows Autopatch. If you are using non-Microsoft products in your environment, including antivirus solutions, ensure that they offer ring-based deployment so you can control the pace and scale for your environment. As an example, Microsoft Defender allows for custom configuration of both engine and intelligence update staging.
  5. Use the latest Windows security defaults and enable Windows security baselines. Enable the security features that are available in Windows by default. Take advantage of Windows security baselines, which provide Microsoft-recommended, well-tested configurations based on feedback from Microsoft security engineering teams, product groups, partners, and organizations. Windows offers several built-in security features to leverage, from firewalls to encryption to biometrics, and more at the enterprise level with endpoint detection and response (EDR), data protection, vulnerability management, compliance monitoring and more.
  6. Adopting a cloud-native approach to managing Windows devices can make it easier to deploy updates and support recovery efforts in outage scenarios. Look at ways to move away from on-premises solutions to cloud management solutions, cloud identity solutions, and ring-based deployment and update management solutions like Windows Autopatch.

Our commitment to transparency​

Our focus continues to be on helping our customers recover from this incident. We will practice transparency in sharing learnings, best practices, and, eventually, more detailed discussions that include changes designed to strengthen the broader ecosystem moving forward.


 Source:

 
Utilize deployment rings. Extend safe deployment practices into your environment by creating deployment rings to manage the rollout of updates and new features. Utilize your existing device management tools to manage deployment risk using the same approach Microsoft does. Alternatively, take advantage of automated deployment with Windows Autopatch. If you are using non-Microsoft products in your environment, including antivirus solutions, ensure that they offer ring-based deployment so you can control the pace and scale for your environment. As an example, Microsoft Defender allows for custom configuration of both engine and intelligence update staging.

blah blah...
Choose a technology partner with an OFF switch.

I get a bad feeling MS will figure out how to profit off the CrowdStrike outage by adding a push-button rollback feature, based on VSS. Vendor pushes a crippling update, automatically revert to the last good snapshot.
 

My Computer

System One

  • OS
    Windows 7
So basically, what they're saying is that the processes who live on floor VTL0 can only talk about things like what's for supper. Only the higher-ups, who live on floor VTL1, can talk about anything they like. That's so innovative I mean. Impressive!
 

My Computers

System One System Two

  • OS
    11 Home
    Computer type
    Laptop
    Manufacturer/Model
    Asus TUF Gaming F16 (2024)
    CPU
    i7 13650HX
    Memory
    16GB DDR5
    Graphics Card(s)
    GeForce RTX 4060 Mobile
    Sound Card
    Eastern Electric MiniMax DAC Supreme; Emotiva UMC-200; Astell & Kern AK240
    Monitor(s) Displays
    Sony Bravia XR-55X90J
    Screen Resolution
    3840×2160
    Hard Drives
    512GB SSD internal
    37TB external
    PSU
    Li-ion
    Cooling
    2× Arc Flow Fans, 4× exhaust vents, 5× heatpipes
    Keyboard
    Logitech K800
    Mouse
    Logitech G402
    Internet Speed
    20Mbit/s up, 250Mbit/s down
    Browser
    FF
  • Operating System
    11 Home
    Computer type
    Laptop
    Manufacturer/Model
    Medion S15450
    CPU
    i5 1135G7
    Memory
    16GB DDR4
    Graphics card(s)
    Intel Iris Xe
    Sound Card
    Eastern Electric MiniMax DAC Supreme; Emotiva UMC-200; Astell & Kern AK240
    Monitor(s) Displays
    Sony Bravia XR-55X90J
    Screen Resolution
    3840×2160
    Hard Drives
    2TB SSD internal
    37TB external
    PSU
    Li-ion
    Mouse
    Logitech G402
    Keyboard
    Logitech K800
    Internet Speed
    20Mbit/s up, 250Mbit/s down
    Browser
    FF

Latest Support Threads

Latest Tutorials

Back
Top Bottom