Data Analysis with Jupyter

As described in the previous module, we can view the automated experiment output from the VM resource log files. In this module, we will use the JSON-formatted files to begin to analyze the data (i.e., the latency of getting the web page). As there is only one client, the initial data will be uninteresting, but the following modules will add complexity and build on this experiment to enable more interesting data analysis.

In this module, you will use standard Python-based data science tools including Jupyter, pandas, seaborn, and Matplotlib and learn how to assess experiment data with them. This module is simply an example of how experiment data can be analyzed and there are numerous other tools to do so (e.g., Elasticsearch/Kibana). While we strive to provide comments explaining what each section of code is accomplishing in-depth discussion/tutorials about these tools is outside the scope of this module.

First we will install Jupyter, pandas, seaborn, and Matplotlib into our Python virtual environment.

$ source /opt/firewheel/fwpy/bin/activate
$ python -m pip install jupyter pandas seaborn matplotlib

Opening Jupyter

Once installed, you can start Jupyter by running jupyter-notebook, which will result in similar output:

$ jupyter-notebook
[I 2024-06-03 09:57:15.026 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2024-06-03 09:57:15.031 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-06-03 09:57:15.037 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-06-03 09:57:15.042 ServerApp] notebook | extension was successfully linked.
...
[I 2024-06-03 09:57:15.346 ServerApp] Jupyter Server 2.14.1 is running at:
[I 2024-06-03 09:57:15.346 ServerApp] http://localhost:8888/tree?token=d68f967dfce80a2f0e4204452ff13046275d75ea420ad0b2
[I 2024-06-03 09:57:15.346 ServerApp]     http://127.0.0.1:8888/tree?token=d68f967dfce80a2f0e4204452ff13046275d75ea420ad0b2
[I 2024-06-03 09:57:15.346 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Important: Take note of the lines with the URL/token as that will be used to access the notebook.

Depending on how your cluster is configured, Jupyter will likely be running on http://localhost:8888. If you had to port-forward your miniweb dashboard, you will need to do the same for Jupyter (e.g., ssh -Llocalhost:8888:localhost:8888 <node>).

Now we can access the notebook via the link provided in the jupyter-notebook output (e.g., http://localhost:8888/tree?token=d68f...).

Create a new notebook and add the following import statements.

import glob  # noqa: F401
from pathlib import Path

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from firewheel.config import Config

Getting the Right Path

First we need to identify where the VM resource logs are located. This was described in the previous module, but it can easily be programmatically identified by accessing FIREWHEEL configuration and concatenating the logging.root_dir parameter with the logging.vmr_log_dir parameter. As a reminder, an example of the logging subsection of the configuration is provided below:

logging:
    cli_log: cli.log
    discovery_log: discovery.log
    firewheel_log: firewheel.log
    level: DEBUG
    minimega_log: minimega.log
    root_dir: /scratch/
    vmr_log_dir: vm_resource_logs

# First we should identify the VM resource logging directory
fw_config = Config().config

# Get the full path to the VM resource logs
vm_resource_log_path = Path(fw_config["logging"]["root_dir"]) / Path(
    fw_config["logging"]["vmr_log_dir"]
)

# Now we should get all the client data

# USE THIS LINE FOR THE SINGLE CLIENT EXPERIMENT
paths = vm_resource_log_path.glob("Client.json")  # For the single client case

# USE THIS LINE FOR THE MULTI-CLIENT EXPERIMENT
# paths = vm_resource_log_path.glob(
#     "client-*.json"
# )  # When we add complexity in later modules.


# We can use an empty list to store a single dataframe per VM resource log
dfs = []
for file in paths:
    # Read the line-delienated JSON data
    data = pd.read_json(file, lines=True)

    # Add Client Name
    data["client"] = file.stem

    # Append the data frame to the list
    dfs.append(data)

# Concatenate all the data frames into a single data frame
df = pd.concat(dfs, ignore_index=True)

# Now we have a single dataframe that should look akin to:
df

	msg	timestamp	time	client
0	Disabling apt-daily.service and apt-daily.timer	2024-06-03 08:46:07	NaN	Client
1	Checking apt-daily status	2024-06-03 08:46:07	NaN	Client
2	● apt-daily.service	2024-06-03 08:46:07	NaN	Client
3	Loaded: masked (/dev/null; bad)	2024-06-03 08:46:07	NaN	Client
4	Active: inactive (dead) since Mon 2024-06-03 0...	2024-06-03 08:46:07	NaN	Client
5	Main PID: 831 (code=exited, status=0/SUCCESS)	2024-06-03 08:46:07	NaN	Client
6		2024-06-03 08:46:07	NaN	Client
7	Jun 03 09:45:11 host systemd[1]: Starting Dail...	2024-06-03 08:46:07	NaN	Client
8	Jun 03 09:45:14 host systemd[1]: Started Daily...	2024-06-03 08:46:07	NaN	Client
9	Jun 03 09:46:05 host systemd[1]: Stopped Daily...	2024-06-03 08:46:07	NaN	Client
10	● apt-daily.timer - Daily apt download activities	2024-06-03 08:46:07	NaN	Client
11	Loaded: loaded (/lib/systemd/system/apt-daily....	2024-06-03 08:46:07	NaN	Client
12	Active: inactive (dead) since Mon 2024-06-03 0...	2024-06-03 08:46:07	NaN	Client
13		2024-06-03 08:46:07	NaN	Client
14	Jun 03 09:45:11 host systemd[1]: Started Daily...	2024-06-03 08:46:07	NaN	Client
15	Jun 03 09:46:05 host systemd[1]: Stopped Daily...	2024-06-03 08:46:07	NaN	Client
16	Killing running apt processes	2024-06-03 08:46:07	NaN	Client
17	Warning: Stopping apt-daily.service, but it ca...	2024-06-03 08:46:07	NaN	Client
18	apt-daily.timer	2024-06-03 08:46:07	NaN	Client
19	Removed symlink /etc/systemd/system/timers.tar...	2024-06-03 08:46:07	NaN	Client
20	Created symlink from /etc/systemd/system/apt-d...	2024-06-03 08:46:07	NaN	Client
21	ens1 -> 00:00:00:00:00:02	2024-06-03 08:46:41	NaN	Client
22	SETTING ens1 to IP: 1.0.0.2/24	2024-06-03 08:46:41	NaN	Client
23	inet 1.0.0.2/24 scope global ens1	2024-06-03 08:46:41	NaN	Client
24	connect: Network is unreachable	2024-06-03 08:46:52	NaN	Client
25	NaN	2024-06-03 08:48:13	0.157	Client
26	% Total % Received % Xferd Average Speed ...	2024-06-03 08:48:13	NaN	Client
27	Dload Upload Total Spent Left Speed	2024-06-03 08:48:13	NaN	Client
28		2024-06-03 08:48:13	NaN	Client
29	0 0 0 0 0 0 0 0 --...	2024-06-03 08:48:13	NaN	Client
30	100 50.0M 100 50.0M 0 0 318M 0 ...	2024-06-03 08:48:13	NaN	Client

Removing Unnecessary Data

In this particular experiment, we are only interested in the well-formatted JSON provided by cURL. FIREWHEEL attempts to format non-JSON messages into a parsable format and will always provide a msg and timestamp field. However, in our case we have also added the time field (as the measure of how long cURL took). Therefore, we can ignore all rows without the time field.

# Drop all rows (axis=0) where the `time` column is `NaN`
# Then drop all columns where no data exists (in this case the `msg` column
dropped = df.dropna(axis=0, subset=["time"]).dropna(axis=1)
dropped

	timestamp	time	client
25	2024-06-03 08:48:13	0.157	Client

Plot the data

Now we can use seaborn to plot the data in a simple bar chart.

# Ensure the image is large enough
plt.figure(figsize=(10, 6))

# Plot a simple bar chart with each client along the x-axis
# and the time it took along the y-axix
fig = sns.barplot(data=dropped, x="client", y="time")

# Enhance the output image with a title and better axis labels
plt.xlabel("Client Name")
plt.ylabel("Time to get web page")
plt.title("Amount of Time To Fetch the Experiment Web page")

# Show the image
plt.show()

../../../_images/8979c76c2328cd0035d692bbcefb197702a6189be7a50eb7d6c4d27bf325587d.png