Blog: Spotlight on SIG Architecture: Production Readiness

Author: Frederico Muñoz (SAS Institute)

This is the second interview of a SIG Architecture Spotlight series that will cover the different
subprojects. In this blog, we will cover the SIG Architecture: Production Readiness
subproject
.

In this SIG Architecture spotlight, we talked with Wojciech Tyczynski
(Google), lead of the Production Readiness subproject.

About SIG Architecture and the Production Readiness subproject

Frederico (FSM): Hello Wojciech, could you tell us a bit about yourself, your role and how you
got involved in Kubernetes?

Wojciech Tyczynski (WT): I started contributing to Kubernetes in January 2015. At that time,
Google (where I was and still am working) decided to start a Kubernetes team in the Warsaw office
(in addition to already existing teams in California and Seattle). I was lucky enough to be one of
the seeding engineers for that team.

After two months of onboarding and helping with different tasks across the project towards 1.0
launch, I took ownership of the scalability area and I was leading Kubernetes to support clusters
with 5000 nodes. I’m still involved in SIG Scalability
as its Technical Lead. That was the start of a journey since scalability is such a cross-cutting topic,
and I started contributing to many other areas including, over time, to SIG Architecture.

FSM: In SIG Architecture, why specifically the Production Readiness subproject? Was it something
you had in mind from the start, or was it an unexpected consequence of your initial involvement in
scalability?

WT: After reaching that milestone of Kubernetes supporting 5000-node clusters,
one of the goals was to ensure that Kubernetes would not degrade its scalability properties over time. While
non-scalable implementation is always fixable, designing non-scalable APIs or contracts is
problematic. I was looking for a way to ensure that people are thinking about
scalability when they create new features and capabilities without introducing too much overhead.

This is when I joined forces with John Belamaric and
David Eads and created a Production Readiness subproject within SIG
Architecture. While setting the bar for scalability was only one of a few motivations for it, it
ended up fitting quite well. At the same time, I was already involved in the overall reliability of
the system internally, so other goals of Production Readiness were also close to my heart.

FSM: To anyone new to how SIG Architecture works, how would you describe the main goals and
areas of intervention of the Production Readiness subproject?

WT: The goal of the Production Readiness subproject is to ensure that any feature that is added
to Kubernetes can be reliably used in production clusters. This primarily means that those features
are observable, scalable, supportable, can always be safely enabled and in case of production issues
also disabled.

Production readiness and the Kubernetes project

FSM: Architectural consistency being one of the goals of the SIG, is this made more challenging
by the distributed and open nature of Kubernetes?
Do you feel this impacts the approach that Production Readiness has to take?

WT: The distributed nature of Kubernetes certainly impacts Production Readiness, because it
makes thinking about aspects like enablement/disablement or scalability more challenging. To be more
precise, when enabling or disabling features that span multiple components you need to think about
version skew between them and design for it. For scalability, changes in one component may actually
result in problems for a completely different one, so it requires a good understanding of the whole
system, not just individual components. But it’s also what makes this project so interesting.

FSM: Those running Kubernetes in production will have their own perspective on things, how do
you capture this feedback?

WT: Fortunately, we aren’t talking about “them” here, we’re talking about “us”: all of us are
working for companies that are managing large fleets of Kubernetes clusters and we’re involved in
that too, so we suffer from those problems ourselves.

So while we’re trying to get feedback (our annual PRR survey is very important for us), it rarely
reveals completely new problems – it rather shows the scale of them. And we try to react to it –
changes like “Beta APIs off by default” happen in reaction to the data that we observe.

FSM: On the topic of reaction, that made me think of how the Kubernetes Enhancement Proposal (KEP)
template has a Production Readiness Review (PRR) section, which is tied to the graduation
process. Was this something born out of identified insufficiencies? How would you describe the
results?

WT: As mentioned above, the overall goal of the Production Readiness subproject is to ensure
that every newly added feature can be reliably used in production. It’s not possible to enforce that
by a central team – we need to make it everyone’s problem.

To achieve it, we wanted to ensure that everyone designing their new feature is thinking about safe
enablement, scalability, observability, supportability, etc. from the very beginning. Which means
not when the implementation starts, but rather during the design. Given that KEPs are effectively
Kubernetes design docs, making it part of the KEP template was the way to achieve the goal.

FSM: So, in a way making sure that feature owners have thought about the implications of their
proposal.

WT: Exactly. We already observed that just by forcing feature owners to think through the PRR
aspects (via forcing them to fill in the PRR questionnaire) many of the original issues are going
away. Sure – as PRR approvers we’re still catching gaps, but even the initial versions of KEPs are
better now than they used to be a couple of years ago in what concerns thinking about
productionisation aspects, which is exactly what we wanted to achieve – spreading the culture of
thinking about reliability in its widest possible meaning.

FSM: We’ve been talking about the PRR process, could you describe it for our readers?

WT: The PRR process
is fairly simple – we just want to ensure that you think through the productionisation aspects of
your feature early enough. If you do your job, it’s just a matter of answering some questions in the
KEP template and getting approval from a PRR approver (in addition to regular SIG approval). If you
didn’t think about those aspects earlier, it may require spending more time and potentially revising
some decisions, but that’s exactly what we need to make the Kubernetes project reliable.

Helping with Production Readiness

FSM: Production Readiness seems to be one area where a good deal of prior exposure is required
in order to be an effective contributor. Are there also ways for someone newer to the project to
contribute?

WT: PRR approvers have to have a deep understanding of the whole Kubernetes project to catch
potential issues. Kubernetes is such a large project now with so many nuances that people who are
new to the project can simply miss the context, no matter how senior they are.

That said, there are many ways that you may implicitly help. Increasing the reliability of
particular areas of the project by improving its observability and debuggability, increasing test
coverage, and building new kinds of tests (upgrade, downgrade, chaos, etc.) will help us a lot. Note
that the PRR subproject is focused on keeping the bar at the design level, but we should also care
equally about the implementation. For that, we’re relying on individual SIGs and code approvers, so
having people there who are aware of productionisation aspects, and who deeply care about it, will
help the project a lot.

FSM: Thank you! Any final comments you would like to share with our readers?

WT: I would like to highlight and thank all contributors for their cooperation. While the PRR
adds some additional work for them, we see that people care about it, and what’s even more
encouraging is that with every release the quality of the answers improves, and questions “do I
really need a metric reflecting if my feature works” or “is downgrade really that important” don’t
really appear anymore.

Originally posted on Kubernetes – Production-Grade Container Orchestration
Author:

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *