Fast vs Easy: Benchmarking Ansible Operators for Kubernetes

With Kubernetes, you get a lot of powerful functionality that makes it relatively easy to manage and scale simple applications and API services right out of the box. These simple apps are generally stateless, so the Kubernetes can deploy, scale and recover from failures without any specific knowledge. But what if Kubernetes native capabilities are not enough?

Operators in Kubernetes and Red Hat OpenShift clusters are a common means for controlling the complete application lifecycle (deployment, updates, and integrations) for complex container-native deployments.

Initially, building and maintaining an Operator required deep knowledge of Kubernetes’ internals. They were usually written in Go, the same language as Kubernetes itself. 

The Operator SDK, which is a Cloud Native Computing Foundation (CNCF) incubator project, makes managing Operators much easier by providing the tools to build, test, and package Operators. The SDK currently incorporates three options for building an Operator:

  • Go
  • Ansible
  • Helm

Go-based Operators are the most customizable, since you’re working close to the underlying Kubernetes APIs with a full programming language. But they are also the most complex, because the plumbing is directly exposed. You have to know the Go language and Kubernetes internals to be able to maintain these operators.

Ansible-based operators are also very customizable, but they use Ansible’s declarative YAML syntax and abstraction layers, making them easier to build and maintain. 

Helm-based Operators are a little less powerful for complete application lifecycle management, but like Ansible, they are easier for people who aren’t already familiar with Go to maintain, because they use Helm chart syntax.

In the chart below, we see how Operators that are built using Helm are only capable of managing installation and upgrades, while Ansible and Go are capable of managing a system’s entire lifecycle.

Fast vs Easy blog 1

All three operator types are fully supported, and are successfully being used for production operator deployments, but the question we want to answer is this:

Each operator type has different tradeoffs between ease of use and performance, but can we quantify the differences?

As interpreted systems, we already know Ansible and Helm-based Operators will have more overhead than a compiled Go Operator, but how much of a difference is there? And are there ways we can minimize the impact so you can have the ease of maintenance of a higher-level Operator and minimize the performance penalty?

 

Benchmarking Operators

Benchmarking is always difficult, especially when comparing different architectures. But in this case, we are helped by the fact that there are official example Operators for each of the Operator SDK’s three Operator types. The examples all manage Memcached instances in a cluster.

Even though the three test Operators do the same thing, they each have slight differences (e.g. using a different Memcached image or version), so there could still be a little variance based on factors outside of our control.

That said, the code that was used to benchmark the three Operator types is all available open source on GitHub, in the Operator SDK Performance Testing repository, and is built to run on a local computer, in GitHub Actions CI, or on an Amazon EC2 instance.

The benchmark does the following:

  1. Builds a fresh new Kubernetes cluster using Molecule and Kind.
  2. Installs two required dependencies: cert-manager and prometheus-operator.
  3. Then, for each Operator type, it:
    1. Builds the Operator.
    2. Deploys 15 instances of Memcached into the cluster.
    3. Times how long it takes for all 15 instances to be running.
    4. Tears down the 15 instances and the Operator.

The first time I ran the benchmark, I was surprised to see Ansible lagging far behind Helm and Go when it deployed or updated the 15 Memcached instances.

Fast vs Easy blog 2

I had expected Go to be fastest—and it was—but I also expected Helm and Ansible to be closer in performance. After spending some time debugging Operator SDK’s Ansible integration, I had found that Ansible’s `gather_facts` option was running for Operators using roles, and that was causing at least a 30% performance regression.

Later versions of Operator SDK fixed that problem, and with the fix applied, Ansible’s performance was much closer to Helm’s:

fast vs easy blog 3

Improving Operator Performance

There are other performance improvements we are working towards, including making calls to Ansible’s k8s modules much faster by caching the Kubernetes API connection and making YAML file handling even faster, but there are a few other tricks our team learned when benchmarking:

  1. If possible, always fold multiple operations (e.g. in a with_items loop) into one task. If you want to load five different Kubernetes resources, chain them together into one file instead of looping over five separate files.
  2. If your Operator doesn’t need to worry about ‘dependent resources’ (e.g. changes to Pods that are created as a result of a Deployment managed by the Operator), you can disable watchDependentResources to save Ansible from having to run a reconciliation loop whenever dependent resources are changed.

Another helpful thing you can do inside your own Ansible-based Operator is to add Ansible’s profile_tasks callback, which outputs task execution times in the playbook output. To do this, you’ll need to add callback_whitelist = profile_tasks to the operator’s ansible.cfg file. Since Operators are essentially containers, the best way to handle this is by modifying the Dockerfile used to build the Operator to append this directive to the default configuration. For example:

echo "callback_whitelist = profile_tasks" >> ${HOME}/.ansible.cfg

(This assumes you are already writing an .ansible.cfg file to the user home directory in your Operator.)

For one example of a fairly radical speedup that was achieved by testing different changes and optimizing an Ansible operator’s performance, check out this issue in the Kiali project’s issue queue: Analyze performance of the Operator and see if we can speed it up. In Kiali’s case, they were able to make their Operator four times faster with a few small changes!

We’re still working on making the benchmarks more robust, and finding new ways to make your Ansible Playbooks for Kubernetes (whether or not they are used in the context of an Operator) faster, and we are happy about the progress we are able to share so far.

 

Conclusion

Ansible Operators make maintaining applications in Kubernetes easier, with a manageable performance penalty over other solutions that are less flexible or require advanced programming knowledge and expertise.

If you’re interested in learning more about Ansible Operators, check out the Ansible Operator Katacoda course. Also check out the following resources if you want to learn more:

Ansible Blog | Ansible.com | Kubernetes Operators with Ansible

Originally posted on Ansible Blog
Author: Jeff Geerling

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *