Of debugging Ansible Tower and underlying cloud images

Recently I was experimenting with Tower’s isolated nodes feature – but somehow it did not work in my environment. Debugging told me a lot about Ansible Tower – and also why you should not trust arbitrary cloud images.

Ansible Logo

Recently I was experimenting with Tower’s isolated nodes feature – but somehow it did not work in my environment. Debugging told me a lot about Ansible Tower – and also why you should not trust arbitrary cloud images.

Background – Isolated Nodes

Ansible Tower has a nice feature called “isolated nodes”. Those are dedicated Tower instances which can manage nodes in separated environments – basically an Ansible Tower Proxy.

An Isolated Node is an Ansible Tower node that contains a small piece of software for running playbooks locally to manage a set of infrastructure. It can be deployed behind a firewall/VPC or in a remote datacenter, with only SSH access available. When a job is run that targets things managed by the isolated node, the job and its environment will be pushed to the isolated node over SSH, where it will run as normal.

Ansible Tower Feature Spotlight: Instance Groups and Isolated Nodes

Isolated nodes are especially handy when you setup your automation in security sensitive environments. Think of DMZs here, of network separation and so on.

I was fooling around with a clustered Tower installation on RHEL 7 VMs in a cloud environment when I run into trouble though.

My problem – Isolated node unavailable

Isolated nodes – like instance groups – have a status inside Tower: if things are problematic, they are marked as unavailable. And this is what happened with my instance isonode.remote.example.com running in my lab environment:

Ansible Tower showing an instance node as unavailable

I tried to turn it “off” and “on” again with the button in the control interface. It made the node available, it was even able to executed jobs – but it became quickly unavailable soon after.

Analysis

So what happened? The Tower logs showed a Python error:

# tail -f /var/log/tower/tower.log
fatal: [isonode.remote.example.com]: FAILED! => {"changed": false,
"module_stderr": "Shared connection to isonode.remote.example.com
closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n
File \"/var/lib/awx/.ansible/tmp/ansible-tmp-1552400585.04
-60203645751230/AnsiballZ_awx_capacity.py\", line 113, in <module>\r\n
_ansiballz_main()\r\n  File \"/var/lib/awx/.ansible/tmp/ansible-tmp
-1552400585.04-60203645751230/AnsiballZ_awx_capacity.py\", line 105, in
_ansiballz_main\r\n    invoke_module(zipped_mod, temp_path,
ANSIBALLZ_PARAMS)\r\n  File \"/var/lib/awx/.ansible/tmp/ansible-tmp
-1552400585.04-60203645751230/AnsiballZ_awx_capacity.py\", line 48, in
invoke_module\r\n    imp.load_module('__main__', mod, module, MOD_DESC)\r\n
File \"/tmp/ansible_awx_capacity_payload_6p5kHp/__main__.py\", line 74, in
<module>\r\n  File \"/tmp/ansible_awx_capacity_payload_6p5kHp/__main__.py\",
line 60, in main\r\n  File
\"/tmp/ansible_awx_capacity_payload_6p5kHp/__main__.py\", line 27, in
get_cpu_capacity\r\nAttributeError: 'module' object has no attribute
'cpu_count'\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact
error", "rc": 1}

PLAY RECAP *********************************************************************
isonode.remote.example.com : ok=0    changed=0    unreachable=0    failed=1  

Apparently a Python function was missing. If we check the code we see that indeed in line 27 of file awx_capacity.py the function psutil.cpu_count() is called:

def get_cpu_capacity():
    env_forkcpu = os.getenv('SYSTEM_TASK_FORKS_CPU', None)
    cpu = psutil.cpu_count()

Support for this function was added in version 2.0 of psutil:

2014-03-10
Enhancements
424: [Windows] installer for Python 3.X 64 bit.
427: number of logical and physical CPUs (psutil.cpu_count()).

psutil history

Note the date here: 2014-03-10 – pretty old! I check the version of the installed package, and indeed the version was pre-2.0:

$ rpm -q --queryformat '%{VERSION}\n' python-psutil
1.2.1

To be really sure and also to ensure that there was no weird function backporting, I checked the function call directly on the Tower machine:

# python
Python 2.7.5 (default, Sep 12 2018, 05:31:16) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import inspect
>>> import psutil as module
>>> functions = inspect.getmembers(module, inspect.isfunction)
>>> functions
[('_assert_pid_not_reused', <function _assert_pid_not_reused at
0x7f9eb10a8d70>), ('_deprecated', <function deprecated at 0x7f9eb38ec320>),
('_wraps', <function wraps at 0x7f9eb414f848>), ('avail_phymem', <function
avail_phymem at 0x7f9eb0c32ed8>), ('avail_virtmem', <function avail_virtmem at
0x7f9eb0c36398>), ('cached_phymem', <function cached_phymem at
0x7f9eb10a86e0>), ('cpu_percent', <function cpu_percent at 0x7f9eb0c32320>),
('cpu_times', <function cpu_times at 0x7f9eb0c322a8>), ('cpu_times_percent',
<function cpu_times_percent at 0x7f9eb0c326e0>), ('disk_io_counters',
<function disk_io_counters at 0x7f9eb0c32938>), ('disk_partitions', <function
disk_partitions at 0x7f9eb0c328c0>), ('disk_usage', <function disk_usage at
0x7f9eb0c32848>), ('get_boot_time', <function get_boot_time at
0x7f9eb0c32a28>), ('get_pid_list', <function get_pid_list at 0x7f9eb0c4b410>),
('get_process_list', <function get_process_list at 0x7f9eb0c32c08>),
('get_users', <function get_users at 0x7f9eb0c32aa0>), ('namedtuple',
<function namedtuple at 0x7f9ebc84df50>), ('net_io_counters', <function
net_io_counters at 0x7f9eb0c329b0>), ('network_io_counters', <function
network_io_counters at 0x7f9eb0c36500>), ('phymem_buffers', <function
phymem_buffers at 0x7f9eb10a8848>), ('phymem_usage', <function phymem_usage at
0x7f9eb0c32cf8>), ('pid_exists', <function pid_exists at 0x7f9eb0c32140>),
('process_iter', <function process_iter at 0x7f9eb0c321b8>), ('swap_memory',
<function swap_memory at 0x7f9eb0c327d0>), ('test', <function test at
0x7f9eb0c32b18>), ('total_virtmem', <function total_virtmem at
0x7f9eb0c361b8>), ('used_phymem', <function used_phymem at 0x7f9eb0c36050>),
('used_virtmem', <function used_virtmem at 0x7f9eb0c362a8>), ('virtmem_usage',
<function virtmem_usage at 0x7f9eb0c32de8>), ('virtual_memory', <function
virtual_memory at 0x7f9eb0c32758>), ('wait_procs', <function wait_procs at
0x7f9eb0c32230>)]

Searching for a package origin

So how to solve this issue? My first idea was to get this working by updating the entire code part to the multiprocessor lib:

# python
Python 2.7.5 (default, Sep 12 2018, 05:31:16) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing
>>> cpu = multiprocessing.cpu_count()
>>> cpu
4

But while I was filling a bug report I wondered why RHEL shipped such an ancient library. After all, RHEL 7 was released in June 2014, and psutil had cpu_count available since early 2014! And indeed, a quick search for the package via the Red Hat package search showed a weird result: python-psutil was never part of base RHEL 7! It was only shipped as part of some very, very old OpenStack channels:

access.redhat.com package search, results for python-psutil

Newer OpenStack channels in fact come along with newer versions of python-psutil.

So how did this outdated package end up on this RHEL 7 image? Why was it never updated?

The cloud image is to blame! The package was installed on it – most likely during the creation of the image: python-psutil is needed for OpenStack Heat, so I assume that these RHEL 7 images where once created via OpenStack and then used as the default image in this demo environment.

And after the initial creation of the image the Heat packages were forgotten. In the meantime the image was updated to newer RHEL versions, snapshots were created as new defaults and so on. But since the package in question was never part of the main RHEL repos, it was never changed or removed. It just stayed there. Waiting, apparently, for me 😉

Conclusion

This issue showed me how tricky cloud images can be. Think about your own cloud images: have you really checked all all of them and verified that no package, no start up script, no configuration was changed from the Linux distribution vendor’s base setup?

With RPMs this is still manageable, you can track if packages are installed which are not present in the existing channels. But did someone install something with pip? Or any other way?

Take my case: an outdated version of a library was called instead of a much, much more recent one. If there would have been a serious security issue with the library in the meantime, I would have been exposed although my update management did not report any library to be updated.

I learned my lesson to be more critical with cloud images, checking them in more detail in the future to avoid having nasty surprises during production. And I can just recommend that you do that as well.

Useful command line options for ansible-playbook

Ansible LogoAnsible provides quite some useful command line options. Most of them are especially interesting during debugging.

Background

There are three major ways to work with Ansible:

  • launching single tasks with the ansible command
  • executing playbooks viaansible-playbook
  • using Tower to manage and run playbooks

While Tower might be the better option to run Ansible in the day-to-day business, and the ansible CLI itself is most likely only in one-time runs used, the executing of playbooks on the command line often happens during the development of playbooks, when no Tower is available – or during debugging. In such cases, there are quite some useful command line options which might not even be known to the seasoned Ansible user.

Do I say this right? – Syntax checking

Playbooks are written in YAML, and in YAML syntax is crucial – especially indentation:

Data structure hierarchy is maintained by outline indentation.

To check if a playbook is correctly formatted, the option --syntax-check looks at all involved playbooks and verifies the correct syntax. During a syntax check, no playbooks are actually executed.

$ ansible-playbook --syntax-check oraclejdk-destroy.yml
ERROR! Syntax Error while loading YAML.

The error appears to have been in '/home/liquidat/Gits/github/ansible-demo-oraclejdk/oracle-windows-destroy.yml': line 10, column 11,
but may be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  win_template: src=data/remove-program.j2 dest=C:\\temp\\remove-program.ps1
    - name: remove application
          ^ here

The syntax check helps if a playbook fails for no apparent reason – or if a playbook was edited a lot and it is simply not sure if everything was moved around correctly.

Whom am I talking to? – Listing affected hosts

With complex playbooks and dynamic inventories it sometimes is hard to say against which hosts a playbook will actually be executed. In such cases, the option --list-hosts will output a list of affected hosts, including the name of the actual play and the pattern with which the hosts were chosen:

$ ansible-playbook --list-hosts oraclejdk-destroy.yml

playbook: oraclejdk-destroy.yml

  play #1 (windows): remove OracleJDK on Windows	TAGS: []
    pattern: [u'windows']
    hosts (1):
      radon

  play #2 (rhel): remove OracleJDK on RHEL	TAGS: []
    pattern: [u'rhel']
    hosts (2):
      neon
      helium
...

This works also together with the -l option and might help debugging your inventory.

Again, no tasks are actually execute when the list of hosts is queried.

What’s going on here? – List tasks

Another thing which can get pretty complicated is the list of tasks actually executed: think of complex playbooks including other complex playbooks. That can get pretty complex and difficult to understand – here the option --list-tasks comes in handy. It lists what will be done, showing the names of the tasks but not executing any of them on the target nodes:

$ ansible-playbook --list-tasks oraclejdk-destroy.yml

playbook: oraclejdk-destroy.yml

  play #1 (windows): remove OracleJDK on Windows	TAGS: []
    tasks:
      copy Java remove script to temp	TAGS: []
      remove application	TAGS: []
      remove temp dir in Windows	TAGS: []

  play #2 (rhel): remove OracleJDK on RHEL	TAGS: []
    tasks:
      remove java dir	TAGS: []
...

What’s that thing? – List all tags

Besides all tasks, the used tags can be listed as well.

$ ansible-playbook --list-tags setup-control.yml

playbook: setup-control.yml

  play #1 (tuzak): 	TAGS: []
      TASK TAGS: [base_setup, db, imap, ldap, mail, oc, smtp]
...

Again, this option helps providing an overview what a playbook has to offer, how to use it. And again this option does not execute any task on the target node.

Are you sure? – Running in test mode

Ansible provides a so called check mode, also called dry run mode (in Tower for example). Invoked via --check the check mode does not alter the target nodes, but tries to output what would change and what not. Note however that this needs to be supported by the used modules, and not all modules support this.

For example, the following listing shows several tasks not supporting the dry run, which is indicated by the “skipping” line.

$ ansible-playbook --check oraclejdk-setup.yml

PLAY [set up OracleJDK on Windows] *********************************************

TASK [setup] *******************************************************************
ok: [radon]

TASK [set up temp dir in Windows] **********************************************
skipping: [radon]

TASK [copy JDK to Windows client] **********************************************
skipping: [radon]

TASK [run exe installer] *******************************************************
skipping: [radon]
...
PLAY [set up OracleJDK on RHEL] ************************************************

TASK [setup] *******************************************************************
ok: [helium]
ok: [neon]

TASK [copy JDK to RHEL client] *************************************************
skipping: [helium]
skipping: [neon]
....

This is quite useful to get an idea what impact the run of a playbook might have on target nodes. The lack of support in several modules dampens the positive effect a bit, though.

But since the --diff option (see below) supports it, it can be quite handy in certain situations.

Let me have a look at that… – Going through tasks step by step

Imagine that a playbook runs without errors, but somehow the result is not what exactly what was expected. In such cases one way to debug everything is to go through each task at a time, step by step, checking the state of all involved components after each task. This can be done with the option --step.

$ ansible-playbook --step oraclejdk-setup.yml

PLAY [set up OracleJDK on Windows] *********************************************
Perform task: TASK: setup (y/n/c): y

Perform task: TASK: setup (y/n/c): *********************************************

TASK [setup] *******************************************************************
ok: [radon]
Perform task: TASK: set up temp dir in Windows (y/n/c): y

Perform task: TASK: set up temp dir in Windows (y/n/c): ************************

TASK [set up temp dir in Windows] **********************************************
changed: [radon]
Perform task: TASK: copy JDK to Windows client (y/n/c): 
...

This is incredibly helpful on complex setups involving multiple nodes.

And yes, this time the tasks are actually executed on the target node!

Get me right there! – Starting playbooks in the middle

During debugging and development it might make sense to start playbooks not at the beginning, but somewhere in between. For example, because a playbook failed at task 14, and you don’t want to go through the first 13 tasks again. Starting at a given task requires the appropriate name of the task – and the option --start-at-task:

$ ansible-playbook --start-at-task="run exe installer" oraclejdk-setup.yml

PLAY [set up OracleJDK on Windows] *********************************************

TASK [setup] *******************************************************************
ok: [radon]

TASK [run exe installer] *******************************************************
ok: [radon]
...

In this example, the two tasks “set up temp dir in Windows” and “copy JDK to Windows client” are skipped, and the playbook starts directly at “run exe installer”. Note that skipped tasks are not shown or listed at all, and that the setup is run nevertheless.

As shown above, the proper name of each task is listed with the --list-tasks option.

Get down to business! – Showing diffs

Ansible is often used to deploy files, especially using templates. Usually, when a file is changed, Ansible just highlights that a change occurred – but not what was actually changed. In such cases, the option --diff comes in handy: it shows the diff in typical patch form:

$ ansible-playbook --diff examples/template.yml

PLAY [template example] ********************************************************

TASK [setup] *******************************************************************
ok: [helium]

TASK [copy template] ***********************************************************
changed: [helium]
--- before: /tmp/template.conf
+++ after: dynamically generated
@@ -1,2 +1,3 @@
 hostname: ansible-demo-helium
-bumble: bee
+foo: bar
+MX: 10 mx2.redhat.com.,5 mx1.redhat.com.

PLAY RECAP *********************************************************************
helium                     : ok=2    changed=1    unreachable=0    failed=0

This can be even combined with the option --check: in such cases, the diff is printed, but the change is not performed on the target node. That’s pretty handy indeed

That was interesting! – Summary

To summarize, ansible-playbook has quite some options to help debugging playbooks. The fact that many do not alter the target nodes makes it possible to use them on productive systems as well (but with care, as always). They also help a lot when it comes to understanding unknown playbooks, for example from other departments or coworkers.