{ "cells": [ { "cell_type": "markdown", "id": "f0ba404d-342f-418d-8cda-bb90605601a8", "metadata": {}, "source": [ "# Data Analysis with Jupyter\n", "\n", "As described in the previous module, we can view the automated experiment output from the VM resource log files.\n", "In this module, we will use the JSON-formatted files to begin to analyze the data (i.e., the latency of getting the web page).\n", "As there is only one client, the initial data will be uninteresting, but the following modules will add complexity\n", "and build on this experiment to enable more interesting data analysis.\n", "\n", "In this module, you will use standard Python-based data science tools including [Jupyter](https://jupyter.org/), [pandas](https://pandas.pydata.org/), [seaborn](https://seaborn.pydata.org/), and [Matplotlib](https://matplotlib.org/) and learn how to assess experiment data with them.\n", "This module is simply an example of how experiment data can be analyzed and there are numerous other tools to do so (e.g., Elasticsearch/Kibana).\n", "While we strive to provide comments explaining what each section of code is accomplishing in-depth discussion/tutorials about these tools is outside the scope of this module.\n", "\n", "First we will install [Jupyter](https://jupyter.org/), [pandas](https://pandas.pydata.org/), [seaborn](https://seaborn.pydata.org/), and [Matplotlib](https://matplotlib.org/) into our Python virtual environment.\n", "\n", "```bash\n", "$ source /opt/firewheel/fwpy/bin/activate\n", "$ python -m pip install jupyter pandas seaborn matplotlib\n", "```\n", "\n", "\n", "## Opening Jupyter\n", "Once installed, you can start Jupyter by running `jupyter-notebook`, which will result in similar output:\n", "\n", "```bash \n", "$ jupyter-notebook\n", "[I 2024-06-03 09:57:15.026 ServerApp] jupyter_lsp | extension was successfully linked.\n", "[I 2024-06-03 09:57:15.031 ServerApp] jupyter_server_terminals | extension was successfully linked.\n", "[I 2024-06-03 09:57:15.037 ServerApp] jupyterlab | extension was successfully linked.\n", "[I 2024-06-03 09:57:15.042 ServerApp] notebook | extension was successfully linked.\n", "...\n", "[I 2024-06-03 09:57:15.346 ServerApp] Jupyter Server 2.14.1 is running at:\n", "[I 2024-06-03 09:57:15.346 ServerApp] http://localhost:8888/tree?token=d68f967dfce80a2f0e4204452ff13046275d75ea420ad0b2\n", "[I 2024-06-03 09:57:15.346 ServerApp] http://127.0.0.1:8888/tree?token=d68f967dfce80a2f0e4204452ff13046275d75ea420ad0b2\n", "[I 2024-06-03 09:57:15.346 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).\n", "```\n", "\n", "**Important:** Take note of the lines with the URL/token as that will be used to access the notebook.\n", "\n", "Depending on how your cluster is configured, Jupyter will likely be running on ``http://localhost:8888``.\n", "If you had to port-forward your miniweb dashboard, you will need to do the same for Jupyter (e.g., ``ssh -Llocalhost:8888:localhost:8888 <node>``).\n", "\n", "Now we can access the notebook via the link provided in the `jupyter-notebook` output (e.g., ``http://localhost:8888/tree?token=d68f...``).\n", "\n", "Create a new notebook and add the following import statements." ] }, { "cell_type": "code", "execution_count": 1, "id": "573ef911-3b94-419f-8109-f22a74f4673b", "metadata": {}, "outputs": [], "source": [ "import glob # noqa: F401\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "from firewheel.config import Config" ] }, { "cell_type": "markdown", "id": "b30511b7-67a7-4337-8270-730d0d6e8183", "metadata": {}, "source": [ "## Getting the Right Path\n", "\n", "First we need to identify where the VM resource logs are located. This was described in the previous module, but it can easily be programmatically identified by accessing FIREWHEEL configuration and concatenating the `logging.root_dir` parameter with the `logging.vmr_log_dir` parameter. As a reminder, an example of the `logging` subsection of the configuration is provided below:\n", "\n", "```yaml\n", "logging:\n", " cli_log: cli.log\n", " discovery_log: discovery.log\n", " firewheel_log: firewheel.log\n", " level: DEBUG\n", " minimega_log: minimega.log\n", " root_dir: /scratch/\n", " vmr_log_dir: vm_resource_logs\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "id": "89163212-8738-4712-80a9-4ee0c2cb0acc", "metadata": {}, "outputs": [], "source": [ "# First we should identify the VM resource logging directory\n", "fw_config = Config().config\n", "\n", "# Get the full path to the VM resource logs\n", "vm_resource_log_path = Path(fw_config[\"logging\"][\"root_dir\"]) / Path(\n", " fw_config[\"logging\"][\"vmr_log_dir\"]\n", ")" ] }, { "cell_type": "code", "execution_count": 3, "id": "b9236c7a-630a-486b-9932-6b779b5c370e", "metadata": {}, "outputs": [], "source": [ "# Now we should get all the client data\n", "\n", "# USE THIS LINE FOR THE SINGLE CLIENT EXPERIMENT\n", "paths = vm_resource_log_path.glob(\"Client.json\") # For the single client case\n", "\n", "# USE THIS LINE FOR THE MULTI-CLIENT EXPERIMENT\n", "# paths = vm_resource_log_path.glob(\n", "# \"client-*.json\"\n", "# ) # When we add complexity in later modules.\n", "\n", "\n", "# We can use an empty list to store a single dataframe per VM resource log\n", "dfs = []\n", "for file in paths:\n", " # Read the line-delienated JSON data\n", " data = pd.read_json(file, lines=True)\n", "\n", " # Add Client Name\n", " data[\"client\"] = file.stem\n", "\n", " # Append the data frame to the list\n", " dfs.append(data)\n", "\n", "# Concatenate all the data frames into a single data frame\n", "df = pd.concat(dfs, ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 4, "id": "42c86210-0264-4085-bf52-1421a7e65bea", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>msg</th>\n", " <th>timestamp</th>\n", " <th>time</th>\n", " <th>client</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Disabling apt-daily.service and apt-daily.timer</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Checking apt-daily status</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>● apt-daily.service</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Loaded: masked (/dev/null; bad)</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>Active: inactive (dead) since Mon 2024-06-03 0...</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>Main PID: 831 (code=exited, status=0/SUCCESS)</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td></td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>Jun 03 09:45:11 host systemd[1]: Starting Dail...</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>Jun 03 09:45:14 host systemd[1]: Started Daily...</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>Jun 03 09:46:05 host systemd[1]: Stopped Daily...</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>● apt-daily.timer - Daily apt download activities</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>Loaded: loaded (/lib/systemd/system/apt-daily....</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>Active: inactive (dead) since Mon 2024-06-03 0...</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td></td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>Jun 03 09:45:11 host systemd[1]: Started Daily...</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>Jun 03 09:46:05 host systemd[1]: Stopped Daily...</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>Killing running apt processes</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>Warning: Stopping apt-daily.service, but it ca...</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>apt-daily.timer</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>Removed symlink /etc/systemd/system/timers.tar...</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>20</th>\n", " <td>Created symlink from /etc/systemd/system/apt-d...</td>\n", " <td>2024-06-03 08:46:07</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>21</th>\n", " <td>ens1 -> 00:00:00:00:00:02</td>\n", " <td>2024-06-03 08:46:41</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>22</th>\n", " <td>SETTING ens1 to IP: 1.0.0.2/24</td>\n", " <td>2024-06-03 08:46:41</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>23</th>\n", " <td>inet 1.0.0.2/24 scope global ens1</td>\n", " <td>2024-06-03 08:46:41</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>24</th>\n", " <td>connect: Network is unreachable</td>\n", " <td>2024-06-03 08:46:52</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>25</th>\n", " <td>NaN</td>\n", " <td>2024-06-03 08:48:13</td>\n", " <td>0.157</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>26</th>\n", " <td>% Total % Received % Xferd Average Speed ...</td>\n", " <td>2024-06-03 08:48:13</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>27</th>\n", " <td>Dload Upload Total Spent Left Speed</td>\n", " <td>2024-06-03 08:48:13</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>28</th>\n", " <td></td>\n", " <td>2024-06-03 08:48:13</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>29</th>\n", " <td>0 0 0 0 0 0 0 0 --...</td>\n", " <td>2024-06-03 08:48:13</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " <tr>\n", " <th>30</th>\n", " <td>100 50.0M 100 50.0M 0 0 318M 0 ...</td>\n", " <td>2024-06-03 08:48:13</td>\n", " <td>NaN</td>\n", " <td>Client</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " msg timestamp \\\n", "0 Disabling apt-daily.service and apt-daily.timer 2024-06-03 08:46:07 \n", "1 Checking apt-daily status 2024-06-03 08:46:07 \n", "2 ● apt-daily.service 2024-06-03 08:46:07 \n", "3 Loaded: masked (/dev/null; bad) 2024-06-03 08:46:07 \n", "4 Active: inactive (dead) since Mon 2024-06-03 0... 2024-06-03 08:46:07 \n", "5 Main PID: 831 (code=exited, status=0/SUCCESS) 2024-06-03 08:46:07 \n", "6 2024-06-03 08:46:07 \n", "7 Jun 03 09:45:11 host systemd[1]: Starting Dail... 2024-06-03 08:46:07 \n", "8 Jun 03 09:45:14 host systemd[1]: Started Daily... 2024-06-03 08:46:07 \n", "9 Jun 03 09:46:05 host systemd[1]: Stopped Daily... 2024-06-03 08:46:07 \n", "10 ● apt-daily.timer - Daily apt download activities 2024-06-03 08:46:07 \n", "11 Loaded: loaded (/lib/systemd/system/apt-daily.... 2024-06-03 08:46:07 \n", "12 Active: inactive (dead) since Mon 2024-06-03 0... 2024-06-03 08:46:07 \n", "13 2024-06-03 08:46:07 \n", "14 Jun 03 09:45:11 host systemd[1]: Started Daily... 2024-06-03 08:46:07 \n", "15 Jun 03 09:46:05 host systemd[1]: Stopped Daily... 2024-06-03 08:46:07 \n", "16 Killing running apt processes 2024-06-03 08:46:07 \n", "17 Warning: Stopping apt-daily.service, but it ca... 2024-06-03 08:46:07 \n", "18 apt-daily.timer 2024-06-03 08:46:07 \n", "19 Removed symlink /etc/systemd/system/timers.tar... 2024-06-03 08:46:07 \n", "20 Created symlink from /etc/systemd/system/apt-d... 2024-06-03 08:46:07 \n", "21 ens1 -> 00:00:00:00:00:02 2024-06-03 08:46:41 \n", "22 SETTING ens1 to IP: 1.0.0.2/24 2024-06-03 08:46:41 \n", "23 inet 1.0.0.2/24 scope global ens1 2024-06-03 08:46:41 \n", "24 connect: Network is unreachable 2024-06-03 08:46:52 \n", "25 NaN 2024-06-03 08:48:13 \n", "26 % Total % Received % Xferd Average Speed ... 2024-06-03 08:48:13 \n", "27 Dload Upload Total Spent Left Speed 2024-06-03 08:48:13 \n", "28 2024-06-03 08:48:13 \n", "29 0 0 0 0 0 0 0 0 --... 2024-06-03 08:48:13 \n", "30 100 50.0M 100 50.0M 0 0 318M 0 ... 2024-06-03 08:48:13 \n", "\n", " time client \n", "0 NaN Client \n", "1 NaN Client \n", "2 NaN Client \n", "3 NaN Client \n", "4 NaN Client \n", "5 NaN Client \n", "6 NaN Client \n", "7 NaN Client \n", "8 NaN Client \n", "9 NaN Client \n", "10 NaN Client \n", "11 NaN Client \n", "12 NaN Client \n", "13 NaN Client \n", "14 NaN Client \n", "15 NaN Client \n", "16 NaN Client \n", "17 NaN Client \n", "18 NaN Client \n", "19 NaN Client \n", "20 NaN Client \n", "21 NaN Client \n", "22 NaN Client \n", "23 NaN Client \n", "24 NaN Client \n", "25 0.157 Client \n", "26 NaN Client \n", "27 NaN Client \n", "28 NaN Client \n", "29 NaN Client \n", "30 NaN Client " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now we have a single dataframe that should look akin to:\n", "df" ] }, { "cell_type": "markdown", "id": "b3e0fbeb-2476-4cab-ba09-968122df4b23", "metadata": {}, "source": [ "## Removing Unnecessary Data\n", "In this particular experiment, we are only interested in the well-formatted JSON provided by cURL.\n", "FIREWHEEL attempts to format non-JSON messages into a parsable format and will always provide a `msg` and `timestamp` field.\n", "However, in our case we have also added the `time` field (as the measure of how long cURL took). Therefore, we can ignore all rows without the `time` field." ] }, { "cell_type": "code", "execution_count": 5, "id": "ccd9ddd0-0540-4d80-a7a8-fa06a2c33a4f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>timestamp</th>\n", " <th>time</th>\n", " <th>client</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>25</th>\n", " <td>2024-06-03 08:48:13</td>\n", " <td>0.157</td>\n", " <td>Client</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " timestamp time client\n", "25 2024-06-03 08:48:13 0.157 Client" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Drop all rows (axis=0) where the `time` column is `NaN`\n", "# Then drop all columns where no data exists (in this case the `msg` column\n", "dropped = df.dropna(axis=0, subset=[\"time\"]).dropna(axis=1)\n", "dropped" ] }, { "cell_type": "markdown", "id": "87255cfa-9f6c-455a-882e-e76ea2075a5e", "metadata": {}, "source": [ "## Plot the data\n", "Now we can use [seaborn](https://seaborn.pydata.org/) to plot the data in a simple bar chart." ] }, { "cell_type": "code", "execution_count": 6, "id": "01c383f7-beeb-41aa-9389-b586526deecd", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 1000x600 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Ensure the image is large enough\n", "plt.figure(figsize=(10, 6))\n", "\n", "# Plot a simple bar chart with each client along the x-axis\n", "# and the time it took along the y-axix\n", "fig = sns.barplot(data=dropped, x=\"client\", y=\"time\")\n", "\n", "# Enhance the output image with a title and better axis labels\n", "plt.xlabel(\"Client Name\")\n", "plt.ylabel(\"Time to get web page\")\n", "plt.title(\"Amount of Time To Fetch the Experiment Web page\")\n", "\n", "# Show the image\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 5 }