SRE in the Age of AI - staging-devopsy.kinsta.cloud

Site reliability engineering (SRE) is a concept introduced by Google in 2004 and since then it has been adopted by various leading software organizations. In its purest form SRE is what you get when you treat operations like it is a software problem.

Industry-leading reports point out the strategic value that SRE offers to the software-centric community. Key takeaways from the sixth edition of SRE Report 2024 reflect the importance of SRE’s foundational role, specifically operationalizing cloud-native distributed software systems at scale. Another significant development is the introduction of the fifth DORA metric in 2021 — reliability, which clearly outlined the importance of reliability and the SRE practices.

With the seamless incorporation of software in every sphere of life, we must tightly integrate reliability practices pragmatically. The operational performance of our dynamic and complex software ecosystem is vital, and site reliability engineering practices come to our rescue.

With the next wave of more complex integration and convergence of software into more disruptive fields, such as material science, pharmaceuticals, health science, security forces and space technologies, it is evident that we cannot overlook operational performance indicators and software assurance. Evolving the SRE practices side by side, creating innovative assets for our SREs and finding new ways to operate a software-centric ecosystem in a reliable way are priorities for the leaders.

Furthermore, significant transformations are happening in the process of software development. Starting from the exponential rise of open source, requiring SRE to rethink and be open to new ways of collaboration to meeting the reliability expectation. Recovery from operational incidents or even updates or correction require new knowledge, new partners and new ways to operationalize reliability.

Moving to the next, immediate challenge is the deployment of software applications in a hybrid way cloud with on-premises. The new deployment models bring a fresh set of reliability challenges and introduce new forms of risks to deal with during runtime. Some areas of reference that can be affected are failover strategy, performance tuning and disaster recovery & rollback. SRE practices such as observability play a crucial role in the evolution.

Another major interruption is through generative AI, coding assistants and in general adoption of AI at scale. Lastly, the introduction of new regulations for data, a secure supply chain and rising demand for sustainable and green software contribute to the evolution of SRE at its core.

We discussed the key triggers of why SRE practices will evolve above and beyond the status quo. Let’s explore more about how SRE prepares themselves to respond to the next-generation user experience and customer demand with a suite of new capabilities. In the next section, we will further explore SRE-driven software operations at scale and associate key trends to the evolution of the practices.

Leading SRE Practice in the Age of AI

SRE practices are key to creating a resilient software ecosystem. Organizations will start seeing tangible results once the SRE focuses on incorporating new technology and ways and collaborating with new partners in the ecosystem. Further in this article, we will explore the approach toward value realization of the injection of AI into SRE evolution.

Tackling Open Source — SRE in Action with AI-Powered Assets

Let’s dive deep into the key areas described above, starting from open source. One of the main challenges for SREs is runtime maintenance of open source. It is estimated that 96% of codebases contain some open source that can result in operational overhead if organizations are not prepared enough. One of the examples is Log4j, where various organizations were affected by the transitive dependency, and it was reported that hundreds of hours were wasted in the process of dealing with it.

With the rise of open-source integration into mainstream applications, SREs are tasked to monitor and manage open-source vulnerabilities in real-time knowingly or unknowingly. As the complexity of the ecosystem increases, it becomes difficult without a pragmatic approach. Starting from ensuring that the software bill of material (SBOM) is in place and smart integration of SBOM into the runtime environment to quickly understand and react to the incidents, use of tools to quickly detect such vulnerabilities & react to them without creating additional toil.

Another possible approach for SREs is to proactively develop more collaboration with the open-source program office (OSPO) to create SRE assets such as playbooks, checklists, tools and its integration with operations such as SBOM and SW inventory management, including open-source libraries for better operational performance. The OSPO, along with SRE, can trigger AI-supported audits time to time to track the operational risks and mitigation strategies.

SRE Unwraps AI’s Potential for Next-Generation Reliability

Soon SREs will have to deal with machine-written, AI-generated code. It poses security and compliance challenges at a different scale. SREs must be prepared to tackle new types of vulnerabilities introduced by these coding assistants. Another area of consideration is bad actors can enhance their capabilities and inject malicious code by exploiting coding assistants. SREs continue to evolve their knowledge and assets with new tools that can defend against these threats introduced through AI assistants.

Advancing to take a deeper dive into the generative AI technology, most companies use open-source LLM models due to cost constraints. The usage of open-source LLM requires customization to adapt to workflows and integrate it with proprietary data to enhance the value. SREs can integrate with the data science community to provide early feedback. This feedback would help reduce hallucination, which is when AI models make up stuff. SREs can also measure the hallucination quotient, through metrics to monitor the performance of the models.

Looking forward to another capability is the use AI agents. There is substantial excitement about pairing AI agents with SRE. SREs will play a pivotal role in improving the output of AI agents by human preference and making it more useful with time.

There are various other possibilities for incorporating generative AI into SRE workflows. Generative AI is new, but its adoption is growing at a massive scale. By implementing a proactive approach, SREs can remain ahead of the curve and minimize the associated risks.

The Strategic Importance of AI-Enhanced SRE for Cloud-Agnostic Posture

Moving on to the transformation to a cloud-agnostic posture of leading organizations is an opportunity for SRE to step up. Some of the core operational challenges of onboarding to the cloud include cost-sprawl, increased clutter and security & compliance risks are highlighted in my previous article.

SRE can play a vital role in performance tuning of cloud deployment in runtime, enhancing rollback & recovery posture in the cloud. AI-powered backup can support SREs to dynamically adjust the regular backup schedules and recommend backup strategies for applications based on usage and criticality. Lastly, rollback planners intelligently model rollbacks by defining checkpoints, analyzing logs, etc. SRE can leverage AI capabilities for resource optimization and predictive maintenance based on historic data.

SRE practices such as observability, when powered by AI capabilities, can support SREs in this transformative journey. From monitoring operational KPIs and dynamically adjusting thresholds to exponential ideas of self-healing capabilities, next-generation observability tools are evolving in multiple dimensions.

Instead of the traditional fragmented ecosystem of observability, if we can standardize the interfaces to democratize the development of troubleshooting, predictive maintenance and insights of applications on top of observability tools, then SRE has the potential to create more assets for the community to simplify operations for complex cloud-native applications. AI-powered automation applications can produce key business insights, developer feedback dashboards, troubleshooting apps and much more. The possibilities with AI are unlimited if we standardize the ecosystem in some way.

Another area is the mitigation of security incidents. With generative AI tools, SRE can intelligently scan applications for runtime vulnerabilities and recommend corrective action, and AI agents can accompany SREs to help navigate the security challenge. When a security incident occurs, AI-powered security assessment and audits and proactive defense through pattern recognition and predictive system behavior provide SREs practical insights for better resilience.

Leveraging New Skills

The sixth edition of the SRE Report 2024 indicates that 53% of professionals consider AI will be valued. Technology can improve the efficiency and effectiveness of SREs and eventually enable them to prepare and respond to challenging technology-powered reliability incidents quickly. In developing organizational-level talent strategies for operations, the “Ops” side of DevOps is essential. Often reskilling and upskilling of “Ops” staff is not on the priority list of large organizations, which creates a strategic gap for reliability. The value of SREs leveraging these new skills will likely be seen over time, mostly in the last mile of introducing technologies in the development flow. The key challenge for SRE is to figure out where to start harnessing the power of new technologies. SRE communities will be uniquely valuable for embracing the evolving roadmap of SRE.

Conclusion

In today’s time, SREs are likely to move forward and regularly enhance their capabilities. To instill trust in the digital ecosystem, it is important for organizations to focus on the “Ops” side of DevOps. SREs in the long term can devise a resilient future by ensuring smooth operationalization of software combined with new technology. The best-case scenario is to have the right balance of investment into developer productivity with regard to investing into resiliency and operational efficiency with new assets, tools and skills for SRE.