DevOps And SRE – A Match Made In (Technology) Heaven

Russ Felker

CTO at GlobalTranz

Learning Objectives

While operations, deployment, and quality are key focus areas of DevOps, continuous improvement of application performance and reliability once deployment is done can fall into a lower priority. That’s where a Site Reliability Engineering team can provide a critical set of eyes and hands to focus on improvement after deployment. Ensuring the right strategy for how DevOps and SRE work together and build off of each other’s core strengths and focuses.


Key Takeaways:



  • How to leverage a combination of DevOps and SRE to improve deployments consistently

  • The overlaps and differences in DevOps and SRE

  • Ways to structure an organization to staff the teams and make continuous improvement ingrained culturally


"As you look at these two groups in conjunction, you can see how additive they are, how much they contribute and collaborate together to make an overall improved process."

Russ Felker

CTO at GlobalTranz

Transcript

Hi, and welcome to DevOps and SRP, a matchmaking technology heaven. I’m Russ Felker, CTO GlobalTranz. And I’ll be taking you through this little background about me, again, Chief Technology Officer, I’ve come up through a variety of different industries, as well as different roles both within the infrastructure side, as well as within the product development side. Software operations is not just something that I’ve done and worked with teams within the past, but is one of my passions. I’ve been doing this for 35 years, it’s been, it’s been, it’s been a little while. And I’ve seen a lot of different approaches to solving some of the common problems that exist in product development organizations, and in general technology organizations. I do live in Chicago. So I am central time. So right in the middle of everything there. And really looking forward to walking through this with you guys today. Now, I will say before we start, and we get into the content, that there’s there’s a there’s a hidden agenda within within this presentation. And so as you are going through that I encourage you to to look for and and keep track of the different references that are sprinkled throughout the presentation as we go through. So I’ll leave it at that. But let’s have some fun. At the end, we’ll come back and we’ll talk about all those different little pieces that are in there. So Site Reliability Engineering. So Site Reliability Engineering is is one of the big things we’re going to talk about. Lots of people have heard of DevOps, and probably implemented DevOps, have DevOps, working within their within their groups. Now, they might have heard of Site Reliability Engineering, it’s actually been around for quite a while. And I’ve got a couple of definitions up here. Google and Microsoft have slightly different but similar definitions for it. And really, it is the counterpart. So it’s the part that happens as something is in production. And it’s the team that’s looking at that production system and at the production environment and at those production users. And translating the needs and the experience back up the chain. Or as DevOps is moving things down. SRP is moving things back up the chain. And we’ll go into that a little bit more as we get into some of the differences, but also some of the ways that these two groups can really work together to enable a more robust and resilient product environment. Alright, so first off, we’ve got, you know, the cagematch of sorry, and DevOps at it seems like they should be contentious. You’ve got one group DevOps that’s pushing things down that software development pipeline, you got another group, sorry, that’s looking at things from the operational perspective, and looking at ways to move things back up that chain. The key is they have some similarities, they’re both looking at metrics. But with that, sorry, you’re assigning kind of prescriptive actions to those metrics. Whereas DevOps is more about gathering, implementing the measures implementing the collection of the data that supports those metrics, you know, from SRP, you’re really looking at that deployment and operation. Whereas with DevOps, well, deployment operation is the ultimate goal of any code that’s being created anything that’s being created, really within the technology environment. They’re more about streamlining the movement of that functionality. How do I get that functionality from the initial point of creation all the way through to an operational system? And how do I do that effectively in an automated fashion. And so that, that, that therein lies a key difference of the approaches or the perspectives of the different groups. And one of the keys with SRT, and DevOps is they both break down silos, they’re there, but you have more of this process and technology serving operations. So operations as a technology problem, versus DevOps that’s looking at execution for that deployment and code delivery out into the operational system and the operational environments. So, as you look at DevOps in SRT, you’ve got the push being done from all the way from the initial feature identification all the way through to actually operating a product. And DevOps is there to help streamline that. And one of the things if you look over on the left there, you’ll see that you’ve got product management. This is one that doesn’t get included a lot from a DevOps perspective, but it is critical to the DevOps process.


Product Management feature identification, story creation, and most importantly, the creation of the user acceptance criteria, the UA C’s that go along with that feed all of the components that come after, from development to QA and testing, to even the push process of how you get that code out. ua C’s are a critical component. And of course, within operations, that’s what your users should be expecting. So of course, that’s exactly what you need to have identified and clearly documented at the beginning stages. Sorry, looks at a little bit differently. So it’s not looking necessarily first at that UAC. And then looking at how am I going to move this thing through all of these different points? It’s looking more at,


I’ve got this thing in production. And I have users that are using it. How can I make this better? How can I make this more resilient? How can I make this more


acceptable to my users, and that backward looking perspective is really the key difference. You’re using many of the same metrics, you’re using many of the same tools. But while the DevOps group uses kind of these, these automation tool sets, the SRP team is using metrics and reporting type tool sets to have a recognition that then allows them to go back in and use actually many of the same tools that the developers use from IDs, you know, to the QA type environments and things like that, you know, one of the things I want to differentiate here is the shift left concept. So there’s been ileen came out and there was this enormous push for everything to go shift left. And shift left is great that you’re, the sooner you can identify a problem, the earlier in the process, you can mitigate a risk, the lower cost it is to have that risk into to get past it. But you can’t just look at it from pushing everything left. And then being saying, Okay, great. So we push these things left, and now we’re good, you have to look at the end result. And you have to look at how that end result is impacting users, how it’s impacting the company, how it’s impacting all of the different components that go into making a software component or a technology component operational, in that it has business value, because that’s pretty much the definite definition of operational is that it has business value. And so you have to look at not just the left to right arrow that you see with DevOps, but the right to left arrow. And a great example of this is a bug, you know, any bug could have in this is a bit simplistic, but three potential causes, it could have a testing problem, where maybe what it needed to do wasn’t tested appropriately, it could have a development problem in that it was not coded correctly. I shouldn’t say there for because it could have a production problem that was pushed to production without aligning to the operational systems. But then it could also have a product management in that the UAC. And the story was not defined to mirror the user’s expectations, of helping to define where that bug lives were the core, the root cause of that bug lives is one of the keys that SRU brings to the table. It’s not something that DevOps is really going to be able to help with. But it is something that SRP can help with. And so as you look at these two groups in conjunction, you can see how additive they are, how much they contribute and collaborate together to make an overall improved process. So as I mentioned, there are lots of metrics that we gather in these days, especially with the move to cloud with the move to infrastructure as code, with the move to really everything is code, there are so many metrics that we can that we can get, there’s so much data we can get. And, you know, I got a whole bunch of quotes up here that are just, you know, kind of, you know, across the board from business to to things that that I’ve had said to me the things that I’ve said to teams, you know, aren’t we measuring that Wait a minute, you said there’s a problem here, aren’t we measuring that. And you have to look at all of these different things and and these different metrics as feeders into how you approach things, and how you look at resolving problems. And each of these is a key component in coming to the root cause of any one thing that might happen. And there’s a couple of things I want to point out here, one of those is on the top right, which is who is actioning that metric because actioning a metric is just as important as capturing a metric and you can’t solve You can’t improve, and it saves what’s measured is improved. But I would, I would say that you, you can capture, but you have to action. If you capture without action, you’re never going to improve. And the last, the other one I want to really point to here is right at the bottom, and this is attributed to Einstein, I wasn’t there when he said it, maybe it was like, maybe he said it just this way, maybe he said it differently. But you know, not everything that counts can be counted, and not everything that be can be counted counts, and that’s the one I want to focus on. because not everything that could be counted counts when it comes to metrics. metrics are great. And we want metrics and we want data. But unless you are, unless you have a reason to capture that unless you understand the impact of that metric, and how that translates into resiliency, and just the responsiveness of systems, there’s no real reason to capture it, because it doesn’t count,


in essence, so when you’re capturing all these metrics, and this is great, but this feeds us into some of the fears, and I’ll call them, you know, desires of these teams. So this was from a survey of 100 different surveys that were all kind of compiled together and normalized. And the two keys that really pop out are unknowns, and false positives, you know, job definitions there. It’s probably, you know, in many ways, a thing we all we all contend with, but unknowns and false positives, relate directly back to those metrics. Because, again, if you are capturing something, but you don’t know why you’re capturing it, or you don’t have clear color associated with that metric, you’re not going to be able to action that metric effectively, and resolve it. And, you know, one of the things with with SRP, and this is a key piece is that the unknowns can be killers. It’s like, what’s out there that we don’t know, you know, we don’t know, we don’t know. And that’s something that saved but I there’s a, there’s a corollary to that, which is we don’t know if we really know what we think we know. That seems convoluted. But if it goes to the observability, and the telemetry that we have now, and the the the delusion of data that we get on a regular basis, we have to now take that and make sure that it applies to the resiliency recovery and responsiveness questions that we need to be asking about our software. And the SRP group is the one that asks those questions, and feeds it back. And then the DevOps group is the one that helps to take those answers to the questions and put them into effect.


So


an SRP can ask the question, they can come up with something and they can provide that how back up the chain. But then you have the automation and the monitoring, and the processes that DevOps puts into place to ensure that that flows through enabling that shift, right, that we talked about earlier. And so eliminating that fear of the unknown is a critical component to what SRT can do. And they don’t just do it for the internal teams. You know, I’ve talked a lot about the internal teams and talking about but it does it for the users too. And you know, it the worst is, we have a problem. And we don’t know why users love that answer. That’s their favorite answer. Right? Everybody knows that. It’s not they hate that answer. I want to know why everybody wants to know why these days, there is a consistent move to the fact that almost everything has an answer of some sort of why it occurred. And everybody wants to know that answer. And so those unknowns can be addressed, especially with this team up of DevOps and that sorry, so how can you do this? How can you kind of take this forward? How can you take some of these things where you have this telemetry, you have these metrics, and you have associated action? And this is something that I’ve used in the past, I’ve used it within my current company, I’ve just been other companies. It’s a matrix. We all have the matrix, right? The Matrix is great. Don’t total hints on the other part of this talk right there. But the key is, you have to have responsibility and to count ability for a metric. If you don’t, and you’re just capturing a metric, but you don’t have clear accountability and responsibility for that metric. It will never get action, because the action is a combination. It’s not just how and what what’s the metric? How are we going to do it? How are we going to fix it? Who is responsible for actually coming up with that. And that is where you come in. And this is a sample, this doesn’t mean it applies to every organization, this might not apply directly to your organization, there might be other metrics that need to be in here, there might be different groups that would be involved in product management isn’t on here, even though they are participating, but I ran out of room on the slide and it was getting too small. So it you know, this, this can go out to the right this can go down farther at there’s, there’s a number of things that could be added here. But this gives you an idea, it’s an example set that you can, you can look at, you can say, there are groups that are going to have to participate and have to be involved in resolution, and, and even identification. But you still have to have that one core responsible team for who is responsible for how you’re creating and implementing the fix as it’s recognized. And so that’s that’s the critical point of the matrix is to is to, is to just show that responsibility set, make sure everybody’s clear on it. So there’s one here at the bottom user experience. And I want to talk to that a little bit because it’s it’s something that we don’t think of as as a quantitative measure, in many cases. But in the end, it is it may not be as quantitative as for example, mean time to resolution or, you know, recovery point objective or something like that. But it feeds into a critical component of what SRP brings to the table. Apart from these more quantitative metrics. The key is that any system to have business value has to have users. And those users could be another system. But in the end, there’s a flow up to where there’s a person that is the final recipient of the systematic output, whatever that may be. And SRP is the group that can really help to bring in that user experience. And this is where you can now start to roll up some of these metrics, we saw the metrics matrix there, but you can start to roll those up and assign values to them. That are Service Level Indicators and service level objectives. I think we’ve all talked probably till we’re blue in the face about KPIs key performance indicators, and those are great. But SLI is an S ellos are things that relate directly to


the users themselves. And how the system functions in relationship to the business activity. It is designed to support. KPIs tend to measure quantitative in many fields might be financial, they might be, they might be system based. So you know, there are a lot of different types of KPIs that are out there. And they can be used across multiple departments, not just here. And they’re important because you want to understand whether you’re hitting your KPIs whether something is off the rails or looks like it’s trending off the rails. But Service Level Indicators and service level objectives let you see if you are servicing your customers, your users effectively. And that is a key component to any system. And you know, Perl Zoo up the top, great quote, the story point is the key that story points is in how long is this going to take me to build but the point that there are stories behind this, these numbers have impacts, they have external impacts, they have internal impacts their impacts to these to these metrics and these measures that we’re doing, and we have to pay attention to them. And we have to track them to understand what our users sentiment is. Because Make no mistake, you can have the most performance system, one of the best systems, beautifully designed, wonderfully engineered. If the users just don’t like it, it fails in its ability to have impact and business value within the environment within the organization. Because you will have a natural tendency to avoid utilization, you’ll have a natural tendency to pick on small problems. And you’ll have a natural tendency of avoidance for that system, whatever it is, and for whatever the reason, sentiment is real emotion is real. And it’s part of a system story. And that’s where SRP can really bring a new perspective and one that can be fed back up the chain and influence our engineering decisions and approaches to gaining user acceptance outside of just the criteria that’s there. The user acceptance criteria, but actually user acceptance of the system. And I would say even user excitement. Because if you can’t, if you don’t measure user excitement, and only manage user acceptance, you know, how often have we said, you know, oh, well, I’ll accept that. You say it in such a way, you’re not excited about when you say somebody, oh, I’ll accept that


you’re not, you know, you’re not


in conveying a sense of, of joy, wonder. Whereas when you say, I’m excited about using that, that is a different story. That is when the users get in, and they give you the best feedback, and they give you they’re working with you collaboratively. And they’re helping to move the system forward in measurable and valuable ways to the organization. And that’s what we have to do. And that’s how we get to trustworthy systems. And I’ll talk about that here in the next slide. So we’ve got all this done, guy, they’re beautiful metrics matrix, we have SRP and DevOps, and they’re working together. And they’re, they’re, you know, they’re, they’re sitting around campfires singing songs together, and somebody picked up the guitar, and it’s a beautiful, beautiful sight. Um, you know, any, and there’s still actions that each performs. You know, DevOps is going to help to create those standards for the movement, they’re going to implement the measures. In many cases, they’re going to eliminate engineering silos, not that SRU doesn’t do that as well, they bring in some other groups and do it a little bit differently. But and they’re going to enable continuous deployment, sorry, is going to action, they are going to action, they’re going to roll these metric trucks are going to create these SLI S and S ellos. Their job is not the dumping ground of problems. They are they they should be recognizing problems, they should be looking for problems, they should be trying to find those problems before anybody else does. But it’s about firebreaks not putting out fires. That’s the key. They’re looking to build things up and do it proactively as opposed to reactionary bug, fix bug fix bug fix. And they work to uncover those opportunities. They’re working with the users to say what what are the things that we could do to make this more resilient? What’s something that that needs to be more responsive? What needs to be more recoverable when there is a problem? Because Make no mistake problems are going to happen? Anybody who tells you different is selling something. What do they do together? So again, they break down engineering, and operation and user silos together. DevOps breaks those engineering silos, sorry, breaks out operational and user silos. And together, they bring all of those groups. And then the key is the perspective on trust worthy systems. Something a user can trust. Now trust is a very feely term, right? It’s not measurable, necessarily. You can’t say, what’s your, you know, exact measure of trust in a system. But it’s something that’s real, it’s something that’s tangible, you can feel it when there’s distrust. And this is what s SRP and DevOps can do together. Alright, so that is the end of this. So I hope everybody’s been keeping track of our of our side game. I’m going to go to the next slide. And you get a good score yourselves on whether did you keep up with all of the references and all of the all of the the references to pop culture pieces that were within it? If you did, here you go. We had five movies, one musical reference, and that was the last slide. You know, that I hope everybody got that one. I tried to make the titles everybody would get that one. We had some TV commercials and we had written word. I will, I will take feedback on the fact that I use a lot more movie references than written word and that is a reflection on my my consumption of media. But regardless of that, I hope everyone had a good time and, and and and hopefully had some some fun learnings for the conversation. And I look forward to talking with y’all in another conference. Thanks so much.


Get full Q/N Access

Sign up to Q/N with a few details to watch this presentation.

  • Hidden
  • Hidden