Enhancing/Maximizing your Scaling capability with Automation Controller 2.3

Red Hat Ansible Automation Platform 2 is the next generation automation platform from Red Hat’s trusted enterprise technology experts. We are excited to announce that the Ansible Automation Platform 2.3 release includes automation controller 4.3.

In the previous blog, we saw that automation controller 4.1 provides significant performance improvements as compared to Red Hat Ansible Tower 3.8. Automation controller 4.3 is taking that one step further. We will elaborate on an important change with callback receiver workers in automation controller 4.3 and how it can have an impact on the performance.

 

Callback Receiver

The callback receiver is the process in charge of transforming the standard output of Ansible into serialized objects in the automation controller database. This enables reviewing and querying results from across all your infrastructure and automation.  This process is I/O and CPU intensive and requires performance considerations.

Every control node in automation controller has a callback receiver process. It receives job events that result from Ansible jobs. Job events are JSON structures, created when Ansible calls the runner callback plugin hooks. This enables Ansible to capture the result of a playbook run. The job event data structures contain data from the parameters of the callback plugin hooks plus unique IDs that reference other job events. The following is an example job event:

 "event": "playbook_on_play_start",
            "counter": 2,
            "event_display": "Play Started (all)",
            "event_data": {
                "playbook": "chatty_tasks.yml",
                "playbook_uuid": "aca1b0da-f29c-4fcf-be35-1aa59a30a4e0",
                "play": "all",
                "play_uuid": "faacc0d4-457c-ac33-a7f4-00000000006a",
                "play_pattern": "all",
                "name": "all",
                "pattern": "all",
                "uuid": "faacc0d4-457c-ac33-a7f4-00000000006a",
                "guid": "a70eb73c9c2241e0995963a6dcd4b89b"
            },

These job events are pushed to the redis database queue and processed by the callback receiver. Each callback receiver has workers that process these job events and saves them in the database. Prior to automation controller 4.3, by default each callback receiver had four workers to process job events regardless of the size of the control node. For customers who vertically scale their control nodes, this could cause performance issues as the callback receiver workers were not scaled based on the capacity of the control node(s).

 

Performance Issues

Large Ansible Automation Platform clusters generate a huge volume of job events when running at their maximum capacity (max allowed forks), i.e. running loads of jobs. Also, if the job templates are run at higher verbosity, that generates even more job events. During our performance analysis, we noticed that job events were getting queued at the redis database waiting to be processed when a large volume of job events took place that could not be handled by the default four callback receiver workers. As more and more job events were queued up at the redis database (an in-memory database), the underlying control node ran out of memory (OOM) and the redis database processes were killed. 

 

Solution

While versions of automation controller prior to 4.3 had the option of modifying the JOB_EVENT_WORKERS setting to increase the size of the callback receiver from the default four, it was not a well known administrative setting. Now, in automation controller 4.3, vertically scaling control nodes not only increases capacity to run jobs (which generate events), it proportionally scales the number of callback receiver workers to better handle the output from those jobs and to utilize host resources available to automation controller.  

This is accomplished by enhancements to the traditional installer and the Red Hat OpenShift operator. For virtual machine and bare metal installations, the 4.3 installer sets the number of callback receiver workers equal to the number of CPU. For example, if a VM control node has eight CPUs, the installer sets the callback receiver worker to eight. 

For Red Hat OpenShift operator based installs, the number of callback receiver workers is set to the CPU limit for the task container if the CPU limit is greater than four. Additionally, administrators may set the callback receiver worker manually if they so choose by setting JOB_EVENT_WORKERS property in a custom settings file. For more information on making this modification manually, visit the performance tuning guide.

 

Takeaways & where to go next

With the above change of how callback receiver workers are implemented, the risk of running into OOM issues is reduced and improves the overall performance of automation controller. In the next blog, we compare some of the results of the above change in two different clusters of automation controller.

If you’re interested in detailed information on automation controller, then the automation controller documentation is a must-read. To download and install the latest version, please visit the automation controller installation guide. To view the release notes of recent automation controller releases, please visit release notes 4.3. If you are interested in more details about Ansible Automation Platform, be sure to check out our e-books.

Originally posted on Ansible Blog
Author: Nikhil Jain

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *