GEMUF#

Historical Context#

There have been various schemes for analyzing the material balance sequence (i.e., MUF). The SITMUF approach attempts to develop a sequence of residuals wherein the MUF sequence is converted to a standardized sequence and monitored for trend changes.

In contrast, GEMUF (geschatzter, or estimated, MUF) attempts instead to develop a distance-based metric to detect anomalies. However, GEMUF is very sensitive to misspecified covariance matrices.

Theory#

Recall that the muf sequence is defined as follows:

muf={muf0,muf1,...mufn}

With

mufi=ll0t=MBPi1MBPiIt,lll1t=MBPi1MBPiOt,lll2(Ci,lCi1,l)

The covariance matrix contains the covariance between different material balances in the sequence. For example, consider the entry σ2n2 of the covariance matrix below. This term is the variance between material balance n and 2.

(1)#Σ=(σ112σ122σ1n2σ212σ222σ2n2σn12σn22σnn2)=(Σi1σi1σi1Tσi,i)

The simplest statistical test to detect a loss would be to simply test two hypotheses:

H0:E(mufi)=0 for i{1,2,...,N}H1:E(mufi)=Mi for i{1,2,...,N}whereMi=M>0

For all loss patterns, MNT={M1,M2,...MN}, where Mi is the loss in period i, the optimal test to compare H0 and H1 is a Neyman-Pearson test. Siefert showed the test statistic can be defined as:

Z=MNTΣN1mufN

With the test formulated as:

Z{>kα: reject H0kα: reject H1

There’s two challenges with this test. First, the test doesn’t provide sequential decisions (not necessarily a problem considering the test can still be calculated sequentially, which we will do). This can be remedied by simply calculating the test statistic for each period and making decisions as such:

ZGi=MiTΣN1mufi

with decision process:

ZGi{>s(i): reject H0s(i): no decision 

and for the final period:

ZGN{>s(N): reject H0s(N): no reject H1

Here all interim test statistic calculations are required to be below a threshold for H0 to be rejected.

Second, and more problematic, is the requirement that the loss pattern, MN is known. It is reasonable to approximate MN as MNmufN such that MN^=mufN by considering that E(mufi)=Mi.

The test statistic then becomes

Z=(MN^)TΣN1mufN=mufNTΣN1mufN

The details behind the covariance calculation can be found in sitmuf theory

Siefert noted that using a single MUF value at each step (i.e., M_i$) can lead to significant variance. It was proposed to use a weighted value such that:

Mi=17(mufi2+mufi1+3mufi+mufi+1+mufi+2)

This approach has lower variance, but is no longer unbiased. In MAPIT we implement both approaches, the use of the single MUF value is V1 and the use of the weighted values is V5B3, following the notation in the original paper.

GEMUF implementation#

Since both GEMUF and SITMUF both require calculation of the covariance matrix, this is only done once and then used for any requested SITMUF or GEMUF calculations. Details of the covariance matrix calculation can be found in the SITMUF section and won’t be repeated here.

The GEMUF functions are responsible for calculating the GEMUF sequence (either V1 or V5B3), effectively performing the calculation below as it’s assumed the covariance matrix has already been calculated:

Z=(MN^)TΣN1mufN=mufNTΣN1mufN

Unlike the SITMUF sequence, GEMUF is calculated sequentially for each balance period. That is, for a given MUF sequence, only a single test statistic is produced. Therefore, we have to loop over the entire MUF sequence:

StatsTests.py

769    for ZR in range(1, int(nMBP)):
770        IDs[ZR] = MUF[k,ZR*MBP]
771        tempID = IDs[:ZR]
772        tempcovmatrix = covmatrix[k,:ZR,:ZR]
773        ZG = np.matmul(np.matmul(np.transpose(tempID), np.linalg.inv(tempcovmatrix)), tempID)
774        GEMUFCalcsV1[k,int((ZR - 1) * MBP):int(ZR * MBP)] = np.ones((MBP,)) * ZG

We generate per-balance test statistics for GEMUF by considering subsections of the original sequence with increasing covariance matrix size:

StatsTests.py

769    for ZR in range(1, int(nMBP)):
770        IDs[ZR] = MUF[k,ZR*MBP]
771        tempID = IDs[:ZR]
772        tempcovmatrix = covmatrix[k,:ZR,:ZR]
773        ZG = np.matmul(np.matmul(np.transpose(tempID), np.linalg.inv(tempcovmatrix)), tempID)
774        GEMUFCalcsV1[k,int((ZR - 1) * MBP):int(ZR * MBP)] = np.ones((MBP,)) * ZG

For GEMUF-V1, the test statistic is straightforward to calculate:

StatsTests.py

769    for ZR in range(1, int(nMBP)):
770        IDs[ZR] = MUF[k,ZR*MBP]
771        tempID = IDs[:ZR]
772        tempcovmatrix = covmatrix[k,:ZR,:ZR]
773        ZG = np.matmul(np.matmul(np.transpose(tempID), np.linalg.inv(tempcovmatrix)), tempID)
774        GEMUFCalcsV1[k,int((ZR - 1) * MBP):int(ZR * MBP)] = np.ones((MBP,)) * ZG

Following the calculation of GEMUF-V1, the discrete sequence of GEMUF-V1 values are converted to a continuous time series.

The GEMUF-V5B3 calculation differs slightly from the GEMUF-V1 calculation. The GEMUF-V5B3 still iterates over the material balance periods, but a MUF ‘‘window’’ is created that corresponds to values needed in the weighting. Note the creation of the MUF window, generated from the continuous MUF values. The creation of this window is only valid when 5 consecutive MUF values are available:

StatsTests.py

848        if ZR>=2 and ZR+2<int(nMBP):
849          msc = 0
850          for ZR2 in range(ZR-2,ZR+3):
851            IDWindow[msc] = MUF[k, ZR2*MBP]
852            msc += 1

If there’s a valid window, then the GEMUF-V5B3 is calculated. It’s important to note that we store the calculated weighted MUF values as the GEMUF-V5B3 calculation requires a sequence of values, not just the instantaneous value (in the code, MSSeq holds the weighted MUF values).

StatsTests.py

856        if ZR>=2 and ZR+2<int(nMBP):
857          MS = (1/7)*(IDWindow[0] + IDWindow[1] + 3*IDWindow[2] + IDWindow[3] + IDWindow[4])
858          MSSeq[k,ZR] = MS
859          ZG = np.matmul(np.matmul(np.transpose(MSSeq[k,:ZR].reshape((-1,))), np.linalg.inv(tempcovmatrix)), tempID)

After calculating the the discrete GEMUF-V5B3 sequence, it is converted to a continuous time series before being returned.

Important

The GEMUF-V5B3 test isn’t valid for the first and last two material balance periods because it requires a weighted average that includes the two previous and future balances. Seifert’s original paper didn’t describe what to do with the test on the tails of the balance sequence, so we choose to represent those values as np.nan. It’s important to note this as sometimes plotting libraries will drop those values when plotting. For example, matplotlib will often drop np.nan values resulting in a plotted sequence that appears to be missing the first two and last two sequence intervals.

Further reading#

  • The GEMUF Test (Seifert)

    • Note this is 675 pages long as the original GEMUF test paper is part of the IAEA symposium collection from 1986.