Project management à la SRE: How to juggle the needs of your project and production

You’ve probably felt the frustration that arises when a project fails to meet established deadlines. And perhaps you’ve also encountered scenarios where project staff or computing have been reallocated to higher priority projects. It can be super challenging to get projects done on time with this kind of uncertainty. 

That’s especially true for Site Reliability Engineering (SRE) teams. Project management principles can help, but in IT, many project management frameworks are directed at teams that have a single focus, such as a software-development team. 

That’s not true for SRE teams at Google. They are charged with delivering infrastructure projects as well as their primary role: supporting production. Broadly speaking, SRE time is divided in half between supporting production environments and focusing on product. 

A common problem

In a recent endeavor, our SRE team took on a project to regionalize our infrastructure to enhance the reliability, security, and compliance of our cloud services. The project was allocated a well-defined timeline, driven by our commitments to our customers and adherence to local regulations. As the technical program manager (TPM), I decomposed the overarching goal into smaller milestones and communicated to the leadership team to ensure they remained abreast of the progress.

However, throughout the execution phase of the project, we encountered a multitude of unrelated production incidents — the Spanner queue was growing long, and the accumulation of messages led to increased compilation times for our developer builds; this in turn led to bad builds rolling out. On top of this, asynchronous tasks were not completing as expected. When the bad build was rolled back, all of the backlogged async tasks fired at once. Due to these unforeseen challenges, some engineers were temporarily reassigned from the regionalization project to handle operational constraints associated with production infrastructure. No surprise, the change in staff allocation towards production incidents resulted in the project work being delayed. 

Better planning with SRE

Teams that manage production services, like SRE, have many ways to solve tough problems. The secret is to choose the solution that gets the job done the fastest and with the least amount of red tape for engineers to deal with.

In our organization, we’ve started taking a proactive approach to problem-solving by incorporating enhanced planning at the project’s inception. As a TPM, my biggest trick to ensuring projects are finished on time is keeping some engineering hours in reserve and planning carefully when the project should start.

How many resources should you hold back, exactly? We did a deep dive into our past production issues and how we’ve been using our resources. Based on this, when planning SRE projects, we set aside 25% of our time for production work. Of course, this 25% buffer number will differ across organizations, but this new approach, which takes into account our critical business needs, has been a game-changer for us in making sure our projects are delivered on time, while ensuring that SREs can still focus on production incidents — our top priority for the business.

Key takeaways

In a nutshell, planning for SRE projects is different from planning for projects in development organizations, because development organizations spend the lion’s share of their time working on projects. Luckily, SRE Program Management is really good at handling complicated situations, especially big programs. 

Beyond holding back resources, here are few other best practices and structures that TPMs employ when planning SRE projects:

Ensuring that critical programs are staffed for success

Providing opportunities for TPMs to work across services, cross pollinating with standardized solutions and avoiding duplication of work

Providing more education to Site Reliability Managers and SREs on the value of early TPM engagement and encourage services to surface problem statements earlier

Leveraging the skills of TPMs to manage external dependencies and interface with other partner organizations such as Engineering, Infrastructure Change Management, and Technical Infrastructure

Providing coverage at times of need for services with otherwise low program management demands

Enabling consistent performance evaluation and provide opportunities for career development for the TPM community

The TPM role within SRE is at the heart of fulfilling SRE’s mission: making workflows faster, more reliable, and preparing for the continued growth of Google’s infrastructure. As a TPM, you need to ensure that systems and services are carefully planned and deployed, taking into account multiple variables such as price, availability, and scheduling, while always keeping the bigger picture in mind. To learn more about project management for TPMs and related roles, consider enrolling in this course, and check out the following resources:

Program Management Practices

The Evolving SRE Engagement Model

Part III. Practices

Read More