GEMUF#
Historical Context#
There have been various schemes for analyzing the material balance sequence (i.e., MUF). The SITMUF approach attempts to develop a sequence of residuals wherein the MUF sequence is converted to a standardized sequence and monitored for trend changes.
In contrast, GEMUF (geschatzter, or estimated, MUF) attempts instead to develop a distance-based metric to detect anomalies. However, GEMUF is very sensitive to misspecified covariance matrices.
Theory#
Recall that the muf sequence is defined as follows:
With
The covariance matrix contains the covariance between different material balances in the sequence. For example, consider the entry
The simplest statistical test to detect a loss would be to simply test two hypotheses:
For all loss patterns,
With the test formulated as:
There’s two challenges with this test. First, the test doesn’t provide sequential decisions (not necessarily a problem considering the test can still be calculated sequentially, which we will do). This can be remedied by simply calculating the test statistic for each period and making decisions as such:
with decision process:
and for the final period:
Here all interim test statistic calculations are required to be below a threshold for
Second, and more problematic, is the requirement that the loss pattern,
The test statistic then becomes
The details behind the covariance calculation can be found in sitmuf theory
Siefert noted that using a single MUF value at each step (i.e., M_i$) can lead to significant variance. It was proposed to use a weighted value such that:
This approach has lower variance, but is no longer unbiased. In MAPIT we implement both approaches, the use of the single MUF value is V1 and the use of the weighted values is V5B3, following the notation in the original paper.
GEMUF implementation#
Since both GEMUF and SITMUF both require calculation of the covariance matrix, this is only done once and then used for any requested SITMUF or GEMUF calculations. Details of the covariance matrix calculation can be found in the SITMUF section and won’t be repeated here.
The GEMUF functions are responsible for calculating the GEMUF sequence (either V1 or V5B3), effectively performing the calculation below as it’s assumed the covariance matrix has already been calculated:
Unlike the SITMUF sequence, GEMUF is calculated sequentially for each balance period. That is, for a given MUF sequence, only a single test statistic is produced. Therefore, we have to loop over the entire MUF sequence:
StatsTests.py
769 for ZR in range(1, int(nMBP)):
770 IDs[ZR] = MUF[k,ZR*MBP]
771 tempID = IDs[:ZR]
772 tempcovmatrix = covmatrix[k,:ZR,:ZR]
773 ZG = np.matmul(np.matmul(np.transpose(tempID), np.linalg.inv(tempcovmatrix)), tempID)
774 GEMUFCalcsV1[k,int((ZR - 1) * MBP):int(ZR * MBP)] = np.ones((MBP,)) * ZG
We generate per-balance test statistics for GEMUF by considering subsections of the original sequence with increasing covariance matrix size:
StatsTests.py
769 for ZR in range(1, int(nMBP)):
770 IDs[ZR] = MUF[k,ZR*MBP]
771 tempID = IDs[:ZR]
772 tempcovmatrix = covmatrix[k,:ZR,:ZR]
773 ZG = np.matmul(np.matmul(np.transpose(tempID), np.linalg.inv(tempcovmatrix)), tempID)
774 GEMUFCalcsV1[k,int((ZR - 1) * MBP):int(ZR * MBP)] = np.ones((MBP,)) * ZG
For GEMUF-V1, the test statistic is straightforward to calculate:
StatsTests.py
769 for ZR in range(1, int(nMBP)):
770 IDs[ZR] = MUF[k,ZR*MBP]
771 tempID = IDs[:ZR]
772 tempcovmatrix = covmatrix[k,:ZR,:ZR]
773 ZG = np.matmul(np.matmul(np.transpose(tempID), np.linalg.inv(tempcovmatrix)), tempID)
774 GEMUFCalcsV1[k,int((ZR - 1) * MBP):int(ZR * MBP)] = np.ones((MBP,)) * ZG
Following the calculation of GEMUF-V1, the discrete sequence of GEMUF-V1 values are converted to a continuous time series.
The GEMUF-V5B3 calculation differs slightly from the GEMUF-V1 calculation. The GEMUF-V5B3 still iterates over the material balance periods, but a MUF ‘‘window’’ is created that corresponds to values needed in the weighting. Note the creation of the MUF window, generated from the continuous MUF values. The creation of this window is only valid when 5 consecutive MUF values are available:
StatsTests.py
848 if ZR>=2 and ZR+2<int(nMBP):
849 msc = 0
850 for ZR2 in range(ZR-2,ZR+3):
851 IDWindow[msc] = MUF[k, ZR2*MBP]
852 msc += 1
If there’s a valid window, then the GEMUF-V5B3 is calculated. It’s important to note that we store the calculated weighted MUF values as the GEMUF-V5B3 calculation requires a sequence of values, not just the instantaneous value (in the code, MSSeq
holds the weighted MUF values).
StatsTests.py
856 if ZR>=2 and ZR+2<int(nMBP):
857 MS = (1/7)*(IDWindow[0] + IDWindow[1] + 3*IDWindow[2] + IDWindow[3] + IDWindow[4])
858 MSSeq[k,ZR] = MS
859 ZG = np.matmul(np.matmul(np.transpose(MSSeq[k,:ZR].reshape((-1,))), np.linalg.inv(tempcovmatrix)), tempID)
After calculating the the discrete GEMUF-V5B3 sequence, it is converted to a continuous time series before being returned.
Important
The GEMUF-V5B3 test isn’t valid for the first and last two material balance periods because it requires a weighted average that includes the two previous and future balances. Seifert’s original paper didn’t describe what to do with the test on the tails of the balance sequence, so we choose to represent those values as np.nan
. It’s important to note this as sometimes plotting libraries will drop those values when plotting. For example, matplotlib will often drop np.nan
values resulting in a plotted sequence that appears to be missing the first two and last two sequence intervals.
Further reading#
-
Note this is 675 pages long as the original GEMUF test paper is part of the IAEA symposium collection from 1986.