StatsProcessor#

Overview#

The StatsProcessor module contains the MBArea object, which is the foundation of the MAPIT API. The first step in using MAPIT is to define a material balance area (i.e., MBArea). This object takes a number of parameters that are used to define a material balance area. The initial properties can be later modified by accessing the specific object properties.

After the MBArea is successfully defined, different statistical tests can be applied to the MBArea by calling object methods. The results are returned after calling the method, but results are also stored as object attributes that can be easily accessed.

Tip

The MBArea object is designed to streamline the analysis experience while providing flexibility. For example, a MBArea could be initialized, copied, then have a few properties modified to compare “what-if” scenarios.

# initalize with some variables
MBA0 = MBArea(...) 

# clone MBArea
MBA1 = copy.copy(MB0) 

# modify input term errors
MBA1.inputErrorMatrix = otherErrorMatrix 

# calculate sigma MUF
MBA0.SEMUF()
MBA1.SEMUF()

# do comparison between baseline and modified input error cases
# ...

Important

If modifying the error matrix after having calculated errors or a statistical quantity, the errors must be recalculated using the calcErrors method.

Parallel Processing#

MAPIT provides parallel processing capabilities through the (Ray)[https://www.ray.io/] library. By default, Ray provides a local dashboard at 127.0.0.1:8265 which can be used to monitor progress and view job related statistics. Two key parameters are used for parallel processing; ncpu and nbatch. ncpu controls the number of CPUs provided to Ray whereas nbatch is the number of iterations to process for each task. Once provided, each Ray worker (total is equal to ncpu) works through a queue of tasks. Each task returns some of the iterations requested by the user (defined by nbatch) until all results are processed. A table showing the relationship between user specified variables iterations, ncpu, and nbatch the number of tasks performed by each worker is shown below. Workers process tasks in the queue until the queue is completed.

`iterations`	`ncpu`	`nbatch`	total number of tasks	tasks completed per worker
100	5	1	100	20
100	5	5	20	4
100	5	20	5	1

nbatch is provided as a parameter as there is overhead incurred when copying data to/from workers. If nbatch is too small, then parallel processing might be slower than sequential processing if the calculation time is small compared to the memory copying time. We do not provide guidance on setting these parameters as performance will be system specific.

Classes#

class MAPIT.core.StatsProcessor.MBArea(rawInput, rawInventory, rawOutput, rawInputTimes, rawInventoryTimes, rawOutputTimes, inputErrorMatrix, inventoryErrorMatrix, outputErrorMatrix, mbaTime, inputTypes, outputTypes, inputCalibrationPeriod=None, inventoryCalibrationPeriod=None, outputCalibrationPeriod=None, iterations=1, dopar=False, ncpu=1, nbatch=1, GUIObject=None, dataOffset=0, rebaseToZero=True, doTQDM=True)#

Object representing a material balance area.

Parameters:

rawInput (list of ndarrays) – Raw input data for the material balance area, list of 2D ndarrays. Each entry in the list should correspond to a different location and the shape of ndarray in the list should be [MxN] where M is the sample dimension (number of samples) and N is the isotopic dimension, if applicable. If only considering one isotope, each ndarray in the rawData list should be [Mx1]. It is expected that M will have rate units (i.e., kg/hr) as this quantity will be integrated.
rawInventory (list of ndarrays) – Raw inventory data for the material balance area, list of 2D ndarrays. Shape structure is the same as rawInput. It is expected that M will have mass units (i.e., kg) as this quantity will not be integrated.
rawOutput (list of ndarrays) – Raw output data for the material balance area, list of 2D ndarrays. Shape structure is the same as rawInput. It is expected that M will have rate units (i.e., kg/hr) as this quantity will be integrated.
rawInputTimes (list of ndarrays) – A list of ndarrays that has length equal to the total number of input locations. Each array should be $[m, 1]$ in shape where $m$ is the number of samples. len(rawInputTimes) and the shape of each list entry (ndarray) should be the same as for rawInput. Each entry in each ndarray should correspond to a timestamp indicating when the value was taken.
rawInventoryTimes (list of ndarrays) – A list of ndarrays that has length equal to the total number of inventory locations. Shape structure is the same as rawInputTimes.
rawOutputTimes (list of ndarrays) – A list of ndarrays that has length equal to the total number of output locations. Shape structure is the same as rawInputTimes.
inputErrorMatrix (ndarray) – 2D ndarray of shape [Mx2] describing the relative standard deviation to apply to rawInput. M sample dimension in each input array and should be identical to M described in rawInput. The second dimension (e.g., 2) refers to the random and systematic error respectively such that ErrorMatrix[0,0] refers to the random relative standard deviation of the first location and ErrorMatrix[0,1] refers to the systematic relative standard deviation.
inventoryErrorMatrix (ndarray) – 2D ndarray with the same shape structure as inputErrorMatrix describing errors to apply to rawInventory.
outputErrorMatrix (ndarray) – 2D ndarray with the same shape structure as inputErrorMatrix describing errors to apply to rawOutput.
mbaTime (int) – The material balance period.
inputTypes (list of strings) – Defines the type of input. This should be a list of strings that is the same length as the number of input locations. The strings should be one of the following: ‘discrete’ or ‘continuous’.
outputTypes (list of strings) – Defines the type of output. This should be a list of strings that is the same length as the number of output locations. The strings should be one of the following: ‘discrete’ or ‘continuous’.
inputCalibrationPeriod (list of float, default=None) – List of floats of length M describing the calibration period for each location in rawInput. If not supplied, no recalibration is performed and it is assumed a single calibration period is applied to the length of the data.
inventoryCalibrationPeriod (list of float, default=None) – List of floats of length M describing the calibration period for each location in rawInventory. If not supplied, no recalibration is performed and it is assumed a single calibration period is applied to the length of the data.
outputCalibrationPeriod (list of float, default=None) – List of floats of length M describing the calibration period for each location in rawOutput. If not supplied, no recalibration is performed and it is assumed a single calibration period is applied to the length of the data.
iterations (int, default=1) – Number of statistical realizations.
doPar (bool, default=False) – Controls the use of parallel processing provided by Ray. If used, progress can be monitored on a local dashboard that is accessible at http://127.0.0.1:8265.
ncpu (int, default=1) – The number of CPUs to use if parallel processing is enabled.
nbatch (int, default=1) – The number of batches to process for each job.
GUIObject (object, default=None) – An object containing MAPIT GUI parameters. Only used interally by the GUI.
dataOffset (int, default=0) – Offset to apply to the data. If specified, data before this value in time will be removed. For example, if dataOffset=273, then any data with a corresponding time before 273 will be excluded from calculations.
rebaseToZero (bool, default=False) – Used in conjunction with dataOffset. If true, then times after dataOffset will be rebased to start at zero (i.e., if dataOffset=273, then t=274 will be rebased to be t=1).
doTQDM (bool, default=True) – Boolean used to control progress bar of calculations.

Returns:

None

calcCUMUF()#

Calculates cumulative MUF using StatsTests.CUMUF. The result is returned and stored as an attribute after the calculation is complete. Automatically calculates MUF if not present as an attribute.

Returns:: CUMUF sequence with identical shape to the input MUF.
Return type:: ndarray

calcErrors()#

Function that applies the specified error matrices to the supplied raw data and stores the results as object attributes. Uses the Preprocessing.SimErrors implementation.

Returns:: None

calcGEMUF_V1()#

Calculates the GEMUFV1 transform using the singular, current MUF value to estimate the unknown loss vector (e.g., ${Z G}_{i} = M_{i}^{T} Σ_{N}^{- 1} {muf}_{i}$ ). Automatically calculates relevant quantities needed for the calculation if not already present (e.g., covariance matrix, simulated measurement error, etc.).

Returns:: GEMUFV1 sequence with shape $[n, j]$ where $n$ length equal to the maximum time based on the number of material balances that could be constructed given the user provided mbaTime and number of samples in the input data. $j$ is the number of iterations given as input. The term $n$ is calculated by finding the minimum of each of the provided input times.
Return type:: ndarray

calcGEMUF_V5B3()#

Calculates the GEMUFV5B3 transform using a weighted series of MUF values to estimate the unknown loss vector (e.g., ${Z G}_{i} = M_{i}^{T} Σ_{N}^{- 1} {muf}_{i}$ ). Automatically calculates relevant quantities needed for the calculation if not already present (e.g., covariance matrix, simulated measurement error, etc.).

Returns:: GEMUFV5B3 sequence with shape $[n, j]$ where $n$ length equal to the maximum time based on the number of material balances that could be constructed given the user provided mbaTime and number of samples in the input data. $j$ is the number of iterations given as input. The term $n$ is calculated by finding the minimum of each of the provided input times.
Return type:: ndarray

calcMUF()#

Calculates MUF using StatsTests.MUF. The result is returned and stored as an attribute after the calculation is complete.

Returns:

MUF sequence with shape $[n, j]$ where $n$ length equal to the maximum time based on the number of material balances that could be constructed given the user provided mbaTime and number of samples in the input data. $j$ is the number of iterations given as input. The term $n$ is calculated by finding the minimum of each of the provided input times.

For example:

import numpy as np

time1[-1] = 400
time2[-1] = 300
time3[-1] = 800

n = np.floor(
        np.min(
        (time1,time2,time3)))

Tip

MAPIT doesn’t assume that time series provided have zero value if unspecified. For example, if a time series starts at t=800, it is assumed that values before t=800 are undefined so MUF cannot be calculated before t=800. The user can modify input data such that values before t=800 are present, but zero, if that assumption is valid.

Return type:

ndarray

calcPageTT()#

Calculates Page’s trend test on SITMUF using StatsTests.PageTrendTest. The result is returned and stored as an attribute after the calculation is complete. Automatically calculates SITMUF if not present as an attribute.

Returns:: The results of the trend test which has shape $[m, n]$ .
Return type:: ndarray

calcSEMUF()#

Calculates $σ$ MUF using StatsTests.SEMUF. The result is returned and stored as an attribute after the calculation is complete. Automatically calculates MUF if not present as an attribute.

Returns:

SEID (ndarray): sequence with shape $[n, j, 1]$ where $n$ is the number of material balances and $j$ is the number of iterations given as input. The term $n$ is calculated by finding the minimum of each of the provided input times.
SEMUFContribR (ndarray): the random contribution to the overall SEMUF with shape $[j, l, n]$ where $j$ is the number of iterations given as input, $l$ is the total number of locations stacked in the order [inputs, inventories, outputs] and $n$ is the number of material balances.
SEMUFContribS (ndarray): the systematic contribution to the overall SEMUF with shape $[j, l, n]$ where $j$ is the number of iterations given as input, $l$ is the total number of locations stacked in the order [inputs, inventories, outputs] and $n$ is the number of material balances.
ObservedValues (ndarray): the observed values used to calculate SEMUF with shape $[j, l, n]$ where $j$ is the number of iterations given as input, $l$ is the total number of locations stacked in the order [inputs, inventories, outputs] and $n$ is the number of material balances.

Return type:

tuple (ndarray, ndarray, ndarray, ndarray)

calcSITMUF()#

Calculates the SITMUF transform using Picard’s formulation with the Cholesky decomposition ( $C_{i}^{- 1} {muf}_{i}$ ). Automatically calculates relevant quantities needed for the calculation if not already present (e.g., covariance matrix, simulated measurement error, etc.).

Returns:: SITMUF sequence with shape $[n, j]$ where $n$ length equal to the maximum time based on the number of material balances that could be constructed given the user provided mbaTime and number of samples in the input data. $j$ is the number of iterations given as input. The term $n$ is calculated by finding the minimum of each of the provided input times.
Return type:: ndarray

StatsProcessor

Contents

StatsProcessor#

Overview#

Parallel Processing#

Classes#