StatsProcessor#
Overview#
The StatsProcessor
module contains the MBArea
object, which is the foundation of the MAPIT API.
The first step in using MAPIT is to define a material balance area (i.e., MBArea
). This object takes a
number of parameters that are used to define a material balance area. The initial properties can
be later modified by accessing the specific object properties.
After the MBArea
is successfully defined, different statistical tests can be applied to
the MBArea
by calling object methods. The results are returned after calling the method,
but results are also stored as object attributes that can be easily accessed.
Tip
The MBArea
object is designed to streamline the analysis experience while
providing flexibility. For example, a MBArea
could be initialized, copied,
then have a few properties modified to compare “what-if” scenarios.
# initalize with some variables
MBA0 = MBArea(...)
# clone MBArea
MBA1 = copy.copy(MB0)
# modify input term errors
MBA1.inputErrorMatrix = otherErrorMatrix
# calculate sigma MUF
MBA0.SEMUF()
MBA1.SEMUF()
# do comparison between baseline and modified input error cases
# ...
Important
If modifying the error matrix after having calculated errors or a statistical quantity, the errors must be recalculated using the calcErrors
method.
Parallel Processing#
MAPIT provides parallel processing capabilities through the (Ray)[https://www.ray.io/] library. By default, Ray provides a local dashboard at 127.0.0.1:8265 which can be used to monitor progress and view job related statistics. Two key parameters are used for parallel processing; ncpu
and nbatch
. ncpu
controls the number of CPUs provided to Ray whereas nbatch
is the number of iterations to process for each task. Once provided, each Ray worker (total is equal to ncpu
) works through a queue of tasks. Each task returns some of the iterations requested by the user (defined by nbatch
) until all results are processed. A table showing the relationship between user specified variables iterations
, ncpu
, and nbatch
the number of tasks performed by each worker is shown below. Workers process tasks in the queue until the queue is completed.
|
|
|
total number of tasks |
tasks completed per worker |
---|---|---|---|---|
100 |
5 |
1 |
100 |
20 |
100 |
5 |
5 |
20 |
4 |
100 |
5 |
20 |
5 |
1 |
nbatch
is provided as a parameter as there is overhead incurred when copying data to/from workers. If nbatch
is too small, then parallel processing might be slower than sequential processing if the calculation time is small compared to the memory copying time. We do not provide guidance on setting these parameters as performance will be system specific.
Classes#
- class MAPIT.core.StatsProcessor.MBArea(rawInput, rawInventory, rawOutput, rawInputTimes, rawInventoryTimes, rawOutputTimes, inputErrorMatrix, inventoryErrorMatrix, outputErrorMatrix, mbaTime, iterations=1, dopar=False, ncpu=1, nbatch=1, GUIObject=None, dataOffset=0, rebaseToZero=True, doTQDM=True)#
Object representing a material balance area.
- Parameters:
rawInput (list of ndarrays) – Raw input data for the material balance area, list of 2D ndarrays. Each entry in the list should correspond to a different location and the shape of ndarray in the list should be [MxN] where M is the sample dimension (number of samples) and N is the isotopic dimension, if applicable. If only considering one isotope, each ndarray in the rawData list should be [Mx1]. It is expected that M will have rate units (i.e., kg/hr) as this quantity will be integrated.
rawInventory (list of ndarrays) – Raw inventory data for the material balance area, list of 2D ndarrays. Shape structure is the same as
rawInput
. It is expected that M will have mass units (i.e., kg) as this quantity will not be integrated.rawOutput (list of ndarrays) – Raw output data for the material balance area, list of 2D ndarrays. Shape structure is the same as
rawInput
. It is expected that M will have rate units (i.e., kg/hr) as this quantity will be integrated.rawInputTimes (list of ndarrays) – A list of ndarrays that has length equal to the total number of input locations. Each array should be \([m,1]\) in shape where \(m\) is the number of samples. len(rawInputTimes) and the shape of each list entry (ndarray) should be the same as for rawInput. Each entry in each ndarray should correspond to a timestamp indicating when the value was taken.
rawInventoryTimes (list of ndarrays) – A list of ndarrays that has length equal to the total number of inventory locations. Shape structure is the same as
rawInputTimes
.rawOutputTimes (list of ndarrays) – A list of ndarrays that has length equal to the total number of output locations. Shape structure is the same as
rawInputTimes
.inputErrorMatrix (ndarray) – 2D ndarray of shape [Mx2] describing the relative standard deviation to apply to
rawInput
. M sample dimension in each input array and should be identical to M described inrawInput
. The second dimension (e.g., 2) refers to the random and systematic error respectively such thatErrorMatrix[0,0]
refers to the random relative standard deviation of the first location andErrorMatrix[0,1]
refers to the systematic relative standard deviation.inventoryErrorMatrix (ndarray) – 2D ndarray with the same shape structure as
inputErrorMatrix
describing errors to apply torawInventory
.outputErrorMatrix (ndarray) – 2D ndarray with the same shape structure as
inputErrorMatrix
describing errors to apply torawOutput
.mbaTime (int) – The material balance period.
iterations (int, default=1) – Number of statistical realizations.
doPar (bool, default=False) – Controls the use of parallel processing provided by Ray. If used, progress can be monitored on a local dashboard that is accessible at http://127.0.0.1:8265.
ncpu (int, default=1) – The number of CPUs to use if parallel processing is enabled.
nbatch (int, default=1) – The number of batches to process for each job.
GUIObject (object, default=None) – An object containing MAPIT GUI parameters. Only used interally by the GUI.
dataOffset (int, default=0) – Offset to apply to the data. If specified, data before this value in time will be removed. For example, if dataOffset=273, then any data with a corresponding time before 273 will be excluded from calculations.
rebaseToZero (bool, default=False) – Used in conjunction with dataOffset. If true, then times after
dataOffset
will be rebased to start at zero (i.e., if dataOffset=273, then t=274 will be rebased to be t=1).doTQDM (bool, default=True) – Boolean used to control progress bar of calculations.
- Returns:
None
- calcCUMUF()#
Calculates cumulative MUF using
StatsTests.CUMUF
. The result is returned and stored as an attribute after the calculation is complete. Automatically calculates MUF if not present as an attribute.- Returns:
CUMUF sequence with identical shape to the input MUF.
- Return type:
ndarray
- calcErrors()#
Function that applies the specified error matrices to the supplied raw data and stores the results as object attributes. Uses the
Preprocessing.SimErrors
implementation.- Returns:
None
- calcMUF()#
Calculates MUF using
StatsTests.MUF
. The result is returned and stored as an attribute after the calculation is complete.- Returns:
MUF sequence with shape \([n,j]\) where \(n\) length equal to the maximum time based on the number of material balances that could be constructed given the user provided
mbaTime
and number of samples in the input data. \(j\) is the number of iterations given as input. The term \(n\) is calculated by finding the minimum of each of the provided input times.For example:
import numpy as np time1[-1] = 400 time2[-1] = 300 time3[-1] = 800 n = np.floor( np.min( (time1,time2,time3)))
Tip
MAPIT doesn’t assume that time series provided have zero value if unspecified. For example, if a time series starts at t=800, it is assumed that values before t=800 are undefined so MUF cannot be calculated before t=800. The user can modify input data such that values before t=800 are present, but zero, if that assumption is valid.
- Return type:
ndarray
- calcPageTT()#
Calculates Page’s trend test on SITMUF using
StatsTests.PageTrendTest
. The result is returned and stored as an attribute after the calculation is complete. Automatically calculates SITMUF if not present as an attribute.- Returns:
The results of the trend test which has shape \([m,n]\).
- Return type:
ndarray
- calcSEMUF()#
Calculates \(\sigma\) MUF using
StatsTests.SEMUF
. The result is returned and stored as an attribute after the calculation is complete. Automatically calculates MUF if not present as an attribute.- Returns:
SEID (ndarray): sequence with shape \([n,j,1]\) where \(n\) is the number of material balances and \(j\) is the number of iterations given as input. The term \(n\) is calculated by finding the minimum of each of the provided input times.
SEMUFContribR (ndarray): the random contribution to the overall SEMUF with shape \([j,l,n]\) where \(j\) is the number of iterations given as input, \(l\) is the total number of locations stacked in the order [inputs, inventories, outputs] and \(n\) is the number of material balances.
SEMUFContribS (ndarray): the systematic contribution to the overall SEMUF with shape \([j,l,n]\) where \(j\) is the number of iterations given as input, \(l\) is the total number of locations stacked in the order [inputs, inventories, outputs] and \(n\) is the number of material balances.
ObservedValues (ndarray): the observed values used to calculate SEMUF with shape \([j,l,n]\) where \(j\) is the number of iterations given as input, \(l\) is the total number of locations stacked in the order [inputs, inventories, outputs] and \(n\) is the number of material balances.
- Return type:
tuple (ndarray, ndarray, ndarray, ndarray)
- calcSITMUF()#
Calculates SITMUF using
StatsTests.SITMUF
. The result is returned and stored as an attribute after the calculation is complete. Automatically calculates MUF if not present as an attribute.- Returns:
SITMUF sequence with shape \([n,j]\) where \(n\) length equal to the maximum time based on the number of material balances that could be constructed given the user provided MBP and number of samples in the input data and \(j\) is the number of iterations given as input. As is the case with MUF, the term \(n\) is calculated by finding the minimum of each of the provided input times.
- Return type:
ndarray