Performance Improvements in Automation Controller 4.1

Red Hat Ansible Automation Platform 2 is the next generation automation platform from Red Hat’s trusted enterprise technology experts. With the release of Ansible Automation Platform 2.1, users now have access to the latest control plane – automation controller 4.1.

Automation controller helps standardize how automation is deployed, initiated, delegated, and audited, allowing enterprises to automate with confidence while reducing sprawl and variance. Users can manage inventory, launch and schedule workflows, track changes, and integrate into reporting, all from a centralized user interface and RESTful API.

Automation controller 4.1 provides significant performance improvements when compared to its predecessor Ansible Tower 3.8. To put this into context, we used Ansible Tower 3.8 to run jobs, capture various metrics while jobs were running/finished, and compare that with automation controller 4.1. This post highlights the significant performance improvements in automation controller 4.1.

Benchmark framework

In order to deep dive into the prospective performance enhancements with the latest automation controller, we at the performance engineering team at Red Hat created a benchmarking framework consisting of the following workflow:

  • Installation of RHEL 8.3 virtual machines with 4 CPU and 16 GB RAM deployed within the IBM Cloud
  • Installation of Ansible Automation Platform 2.1
  • Installation of various exporters to monitor the systems
  • Running the performance benchmarks
  • Capturing the results
  • Tearing down the cluster once complete

This entire workflow process is automated via the use of Jenkins jobs and various Ansible Playbooks. Prometheus records real time metrics of the cluster, Elasticsearch delivers the analytics engine, and Grafana monitors system metrics. 

The performance benchmark consists of:

  • Creating an inventory with 100 hosts
  • Creating a job template using the chatty_tasks.yml

The result prints a specified number of debug messages per host and the created inventory. The chatty_tasks playbook uses the lightweight debug module to print the messages. 

This makes sure we don’t put overhead on the resources and allows us to get the actual performance numbers of automation controller. The performance benchmarking job template is run with a concurrency of 10, i.e. run 10 invocations of the job template at the same time. Each run generates approximately 10,000 job events, and we run five consecutive batches. While the jobs are running, we capture the CPU, RAM, and network utilization among other system metrics on both automation controller and the database nodes. We then capture the redis queue length, insertion rate of job events in the database. After the jobs have finished, we calculate the average job run time, job events processing time and job events lag.  

 

Automation controller  performance gains

With automation controller 4.1, there are significant performance improvements in different aspects. Some of the major improvements include:

  • Average job duration decreased by ~22% 
  • Job events processing time decreased by ~23% 
  • Cleanup job runtime decreased by ~98%
  • Gather analytics runtime decreased by ~60%

Job runtime

When comparing a Single Node Ansible Tower 3.8 with an external database versus automation controller 4.1 Single Hybrid Node with an external database, the average job duration decreased by about 22% and the job events processing time decreased by about 23%. 

When automation controller 4.1 cluster having one control node and one execution node was compared with automation controller 4.1 cluster having hybrid node, there was an even greater performance gain. Although it is not an exact comparison, when we use a cluster with a control node and an execution node, it highlights a major reason why one should be using the concept of the separate control and execution planes introduced in automation controller 4.1. The average job duration decreased by about 38% and the job events processing time decreased by about 48%, as compared to Ansible Tower 3.8. 

 

 

Cleanup job

Cleanup job assists in the deleting old data from the controller, including system tracking information, tokens, job histories, and activity streams. You can use this if you have specific retention policies or need to decrease the storage used by your controller database. With automation controller 4.0, the job events table was horizontally partitioned. Prior to automation controller 4.0, it was one table that would exponentially increase as you ran more jobs. As the table increased, it caused various performance issues in Ansible Tower. One of those issues was long running cleanup jobs. With the introduction of horizontal partitioning, there was significant performance improvement in the cleanup jobs runtime. To measure this performance gain, we created an Ansible Tower 3.8 instance and an automation controller 4.1 instance. We then created a database with 1 billion job events, ran the cleanup jobs and we observed the runtime of the cleanup jobs. The result was a 98% decrease in the cleanup job runtime. 

 

Red Hat Insights for Red Hat Ansible Automation Platform

The job events table partitioning also had a major performance impact on the Automation Analytics. Automation controller is able to gather these analytics much faster in automation controller 4.1. We compared the gather analytics in Ansible Tower 3.8 and automation controller 4.1, with a different number of job events in the table. There was a performance improvement of about 60% in the gather analytics runtime. 

 

 

Disk IOPS

One important aspect of all these performance gains is how much added pressure it puts on resources in terms of CPU, RAM and Disk IOPS. In automation controller 4.1 there is a significant increase in average disk IOPS as compared to Ansible Tower 3.8. In the above results, automation controller 4.1 is running jobs much faster and hence it is writing the job events on disk much faster as well. With the introduction of the receptor, which is an overlay network intended to ease the distribution of work across a collection of execution nodes, those job events are getting written to the disk temporarily on the control node and that is why you see the increase in the average disk IOPS.

Due to this, it is important to have high performing disks to achieve the expected performance gains of automation controller. Per the guidance indicated in the requirements of automation controller storage volume, the minimum baseline is 1,500 IOPS. 

 

Takeaways & where to go next

With the performance improvements that are available in automation controller 4.1, automation controller can run jobs and process the results much faster. This allows users to run more concurrent jobs and  view the job output in the automation controller dashboard closer to real-time. This leads to a smoother and overall better user experience. With the enhanced use of horizontal partitioning, the performance improvements in cleanup jobs help to keep the automation controller database size in control and not impact its function while running. With the better performance of gather analytics, automation controller is now able to send the analytics data faster allowing users to see the results on Insights for Ansible Automation Platform more efficiently.

If you’re interested in detailed information on automation controller, then the automation controller documentation is a must-read. To download and install the latest version, please visit the automation controller installation guide. To view the release notes of recent automation controller releases, please visit release notes 4.1.0. If you are interested in more details about Ansible Automation Platform, be sure to check out our e-books.

Originally posted on Ansible Blog
Author: Nikhil Jain

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *