Thanks for helping us make the first IRConf a huge success! Stay tuned for videos, updates, and details about how to stay in touch with the IR community.
For the first time, IRConf brought together front-line DevOps and SRE practitioners for a half-day virtual conference for industry experts and new voices in incident response. Incident responders came together to swap horror stories about the biggest outages, discuss best practices, and gain a better understanding of how the best are dealing with incidents.
Watch the whole conference as a playlist!
Author of DevOps for Dummies & Head of Community Engagement for AWS
Tech has spent ages treating incidents as something engineers can respond to, measure, even prevent. But what would happen if we accepted incidents as part of the software development lifecycle?
In this talk I’ll turn incident management on its head and suggest an entirely different approach — treating incidents as business as usual. I’ll discuss the importance of celebrating failure and how to utilize incidents to galvanize and empower your teams.
Senior Applied Resilience Engineer @ Netflix
Much of the discussion on improving incident response focuses on improving tools and processes. While certainly important, we also need to look at how people interact during incidents.
In this talk, we'll examine some core principles of Resilience Engineering and look at the various ways teams of _people_ cope during incidents and how they can improve their work with other _people_ to drive better outcomes in incidents.
SVP Eng @ Pendo.io, ex-Director of Customer Reliability Engineering @ Google
Cofounder & CEO @ Honeycomb
Cofounder @ Kintaba, ex-Facebook Tools & Enterprise Products
CTO @ 1Password
Cofounder & CEO @ Komodor
Moderated by Jason English, Principle Analyst @ Intellyx
Sr. Reliability Advocate @ Gremlin
Over the years a lot of research has been conducted and many books have been written on how to improve the resilience of our software. This talk will dive deep into the three keep practices identified by the authors of Accelerate to improve reliability: Chaos Engineering, GameDays, and Disaster Recovery. We will discuss the key measures of tempo and stability, and how practicing Chaos Engineering will increase both.
Product Manager @ Datadog
On-call culture has changed little since the start of software development, unlike other major cultural shifts witnessed in development practices and technologies. It’s time to rethink on-call is and rebuild it with a human-first approach for a more effective and sustainable on-call practice.
Developer Advocate @ Cyral
Have your security needs taken a back seat to “run fast and break things”? Join us for this deep dive into adding container scanning to a DevOps pipeline and production monitoring. You can achieve a robust security posture and still release continuously.
SRE @ Honeycomb
In this presentation, we’ll adopt a perspective that tries to consider that causes of incidents are explanations we come up with, and not actual hard culprits. We will then cover the changes in attitude and benefits that can come from changing the framing around causes and errors, and the positive impacts they can have on your system.
CTO @ Komodor
Kubernetes has given us the power to move extremely fast. But without any safeguards in place, we more often than not find ourselves staring at it blankly trying to figure out what the hell went wrong (sounds familiar, right?).
In this session, we are going to do a live demonstration of common Kubernetes failure scenarios, both app and infra related, and may the Gods of the demo be kind to us.
We will laugh a little and cry a little, as we cover Kubernetes monitoring, observability & troubleshooting best practices and talk about metrics, distributed tracing, logging, network visualization and more. But cheer up! We’ll wrap up by introducing some helpful tools, in order to find and fix issues as fast as possible.
Senior DevOps Engineer @ Wix.com
Incident management can be challenging, BUT! There are things you could do to make it easier. In this talk I’ll cover the proactive ways you could take and incorporate in your day to day routine, in order to prepare you for a smoother and more efficient incident management process.
Software Engineer @ Riskified