Debugging VM Resources

The previous module added new VM resources to the experiment but it is equally important to understand how to debug these VM resources if there are issues. This module will discuss where VM resources are located in experiments, how to search for errors, and methods for quickly debugging malfunctioning VM resources.

Locating Data Output

As described in Extracting VMR data, when a VMR prints to stdout, that output will be captured by FIREWHEEL and logged to the folder specified by the logging.vmr_log_dir configuration parameter on the compute node which has launched the VM.

Therefore, now that we have started our experiment, we can view that output. By default, the logging.vmr_log_dir is /tmp/firewheel/vm_resource_logs. However, it is also common to use /scratch/vm_resource_logs as well. For this tutorial, we will assume it’s located at /scratch/vm_resource_logs.

If you look in this directory after your experiment has started, you will see 4 files:

$ ls /scratch/vm_resource_logs
Client.json  Client.log  Server.json  Server.log

There will be a log file for each VM and a JSON file, containing some of the same messages, formatted in JSON (for easy data analysis). Inside the Client.log file, you will see all the interactions between FIREWHEEL and the VM. This includes the VM starting up (represented by messages indicating that QEMU Guest Agent not alive yet), writing various VM resources to the the VM (Creating directory: /var/launch/-250/set_hostname.sh), setting the IP addresses, and our own cURL output (at the bottom of the log).

Similarly, opening the Client.json will have more terse output, but formatted in JSON. Towards the end of this file, we should see our own output:

This log data will be used in the next module for data analysis.

Breaking VMRs

Assuming that the entire experiment to this point was coded with perfection, let’s introduce an intentional bug. Prior to doing this, we recommend making a copy of your existing model_component_objects.py file into your home directory so that it is easy to revert our changes.

$ cp model_component_objects.py ~/simple_server_model_component_objects.py

Now, let’s try adding a new, purposely broken VM resource to the SimpleServer object. In this module, we will compress our file to transfer and purposely fail to decompress it correctly. We will use two methods for doing this:

A new python VMR to compress the file (after it is dropped).
A new bash program to decompress the file prior to the experiment starting.

New VMRs

Python has the build-in tarfile which is useful for compressing and decompressing files. We will leverage this library to compress our file. First, create a new file called compress.py in the vm_resources directory and add the following:

Listing 6 vm_resources/compress.py

#!/usr/bin/env python3
import tarfile

filename = "/opt/file.txt"

# Place the new tarfile in /tmp/out.tgz
with tarfile.open("/tmp/out.tgz", "w:gz") as tar:
    tar.add(filename, arcname=os.path.basename(filename))

Note

The shebang line (i.e., #!/usr/bin/env python3) is necessary because we provide the VMR as the executable program to use with run_executable. Without this line, the Python script would be interpreted as a shell script.

Importantly, there is a bug in this program. At line 7, we use the os library without importing it, which will cause the program to fail. For now, we will ignore this error and continue to try and decompress the file with a shell script.

The shell script to decompress this file is rather simple:

$ # Place the output in /tmp
$ tar -C /tmp -xvzf /tmp/out.tgz

Now that we know what our VMRs are, we can modify model_component_objects.py to execute them. We know that the file gets created at time=-5 so these should each occur after that. We will run the Python VMR at time=-4 and the shell script at time=-3. Finally, we will verify our file exists by printing out the first few lines of the file at time=-2. Recall that this will fail!

Add the following to model_component_objects.py:

@require_class(Ubuntu1604Server)
class SimpleServer:
    """SimpleServer Class documentation."""

    def __init__(self):
        self.configure_files_to_serve()

        # Start the web server at time=1
        # The server needs to run in the ``/opt`` directory because that is where the
        # file will be located.
        self.run_executable(
            1, "bash", arguments="-c 'pushd /opt; python3 -m http.server; popd'"
        )

        # Adding our compress/decompress VMRs
        self.run_executable(-4, "compress.py", vm_resource=True)
        self.run_executable(-3, "tar", arguments="-C /tmp -xvzf /tmp/out.tgz")

        # Print the first few lines of the file
        self.run_executable(-2, "head", arguments="/tmp/file.txt")

Viewing the Results

Let’s restart the experiment and see what happens.

Note

The logging verbosity for the VM Resource Handler on the DEBUG level is significant and we recommend changing it at least to INFO. This can be accomplished by running:

$ firewheel config set -s logging.level INFO

$ firewheel experiment -r tutorials.simple_server minimega.launch

Once the experiment has configured, let’s investigate the VM resource logs for the Server. Recall that this module will assume they are located at /scratch/vm_resource_logs. A truncated section of output is shown below:

[2024-06-06 00:43:06 GMT INFO] CALL ARGS: /var/launch/-4/compress.py/call_arguments.sh
[2024-06-06 00:43:06 GMT INFO] PID returned: 1281
[2024-06-06 00:43:08 GMT WARNING] compress.py (1281) exited after 2.074975 seconds with code: 1
[2024-06-06 00:43:08 GMT INFO] b'Traceback (most recent call last):\n  File "/var/launch/-4/compress.py/compress.py", line 9, in <module>\n    tar.add(filename, arcname=os.path.basename(filename))\nNameError: name \'os\' is not defined\n'
[2024-06-06 00:43:12 GMT INFO] CALL ARGS: /var/launch/-3/tar/call_arguments.sh
[2024-06-06 00:43:12 GMT INFO] PID returned: 1285
[2024-06-06 00:43:16 GMT INFO] CALL ARGS: /var/launch/-2/head/call_arguments.sh
[2024-06-06 00:43:16 GMT INFO] PID returned: 1290
[2024-06-06 00:43:18 GMT WARNING] head (1290) exited after 2.074531 seconds with code: 1
[2024-06-06 00:43:18 GMT INFO] b"head: cannot open '/tmp/file.txt' for reading: No such file or directory\n"

While is is not overtly obvious via vm mix that there was an error, the logs indicate that a non-zero exit code was generated from our new VM resources as well as specific error messages.

Note

The design decision to not indicate the failure of VM resources via non-zero exit codes within the VM resource status is intentional. While zero exit codes are the typical convention, they don’t always indicate errors. Furthermore, some non-zero exit codes could be expected for certain experiments. For these reasons, it is left to the user to understand the expected behavior of the VMRs used within their experiment.

In-Experiment Debugging

In our contrived example, it is easy to identify the error. However, that may not always be the case. This section will demonstrate how to debug and test VMRs within an experiment without restarting the experiment each time. Note that, in some cases, experiment restart may be the only way to test VMR bug fixes, but the added time delay makes this a secondary option.

For each VM, FIREWHEEL creates a custom directory for all the VM resources. On Linux-based systems, the path is /var/launch.

Final Steps

Now that we have our VMR working properly, we can re-run the experiment and verify it is working as expected.

$ firewheel experiment -r tutorials.simple_server minimega.launch

After the experiment has been configured, we can revisit the Server VM resource logs and see output similar to:

[2024-06-06 12:05:14 GMT INFO] CALL ARGS: /var/launch/-4/compress.py/call_arguments.sh
[2024-06-06 12:05:14 GMT INFO] PID returned: 1261
[2024-06-06 12:05:20 GMT INFO] CALL ARGS: /var/launch/-3/tar/call_arguments.sh
[2024-06-06 12:05:20 GMT INFO] PID returned: 1265
[2024-06-06 12:05:22 GMT INFO] b'file.txt\n'
[2024-06-06 12:05:25 GMT INFO] CALL ARGS: /var/launch/-2/head/call_arguments.sh
[2024-06-06 12:05:25 GMT INFO] PID returned: 1270
[2024-06-06 12:05:27 GMT ERROR] 'utf-8' codec can't decode byte 0xef in position 0: invalid continuation byte
Traceback (most recent call last):
File "/opt/firewheel/src/firewheel/vm_resource_manager/vm_resource_handler.py", line 441, in run_vm_resource
    self._run_vm_resource(schedule_entry, queue)
File "/opt/firewheel/src/firewheel/vm_resource_manager/vm_resource_handler.py", line 532, in _run_vm_resource
    exitcode = self.driver.get_exitcode(pid)
File "/opt/firewheel/src/firewheel/vm_resource_manager/abstract_driver.py", line 324, in get_exitcode
    cache = self.exec_status(pid)
File "/opt/firewheel/src/firewheel/vm_resource_manager/drivers/qemu_guest_agent_driver.py", line 532, in exec_status
    self.store_captured_output(pid, result)
File "/opt/firewheel/src/firewheel/vm_resource_manager/drivers/qemu_guest_agent_driver.py", line 565, in store_captured_output
    cache["stdout"] += str(
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 0: invalid continuation byte

In our output, we can see that the errors from compress.py and tar have been eliminated and that the tar command has returned the correct output (i.e., file.txt). However, in this case we found a new error with printing our data to the VM Resource Handler. As we can see, FIREWHEEL can only log valid UTF-8 encoded data but our file was created with random bytes via os.urandom(), which will not print correctly. In this case, we can ignore the error as it does not impact the experiment and we have validated that our VMR’s are working correctly.

Now we can revert our changes and proceed to the next module to analyze our data.

$ cp ~/simple_server_model_component_objects.py model_component_objects.py

Note

We do not have to remove the compress.py file as it will no longer be executed within the experiment.