GEMUF#

Historical Context#

There have been various schemes for analyzing the material balance sequence (i.e., MUF). The SITMUF approach attempts to develop a sequence of residuals wherein the MUF sequence is converted to a standardized sequence and monitored for trend changes.

In contrast, GEMUF (geschatzter, or estimated, MUF) attempts instead to develop a distance-based metric to detect anomalies. However, GEMUF is very sensitive to misspecified covariance matrices.

Theory#

Recall that the muf sequence is defined as follows:

muf = {{muf}_{0}, {muf}_{1}, . . . {muf}_{n}}

With

{muf}_{i} = \sum_{l \in l_{0}} \int_{t = {MBP}_{i - 1}}^{{MBP}_{i}} I_{t, l} - \sum_{l \in l_{1}} \int_{t = {MBP}_{i - 1}}^{{MBP}_{i}} O_{t, l} - \sum_{l \in l_{2}} (C_{i, l} - C_{i - 1, l})

The covariance matrix contains the covariance between different material balances in the sequence. For example, consider the entry $σ_{2 n}^{2}$ of the covariance matrix below. This term is the variance between material balance $n$ and $2$ .

(1)#

\begin{array}{r} \begin{aligned} Σ & = (\begin{array}{c} σ_{11}^{2} & σ_{12}^{2} & \dots & σ_{1 n}^{2} \\ σ_{21}^{2} & σ_{22}^{2} & \dots & σ_{2 n}^{2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ σ_{n 1}^{2} & σ_{n 2}^{2} & \dots & σ_{n n}^{2} \end{array}) = (\begin{array}{c} Σ_{i - 1} & σ_{i - 1} \\ σ_{i - 1}^{T} & σ_{i, i} \end{array}) \end{aligned} \end{array}

The simplest statistical test to detect a loss would be to simply test two hypotheses:

\begin{array}{r} \begin{aligned} H_{0} : E ({muf}_{i}) = 0 for i \in {1, 2, . . ., N} \\ H_{1} : E ({muf}_{i}) = M_{i} for i \in {1, 2, . . ., N} \\ where \\ \sum M_{i} = M > 0 \end{aligned} \end{array}

For all loss patterns, $M_{N}^{T} = {M_{1}, M_{2}, . . . M_{N}}$ , where $M_{i}$ is the loss in period $i$ , the optimal test to compare $H_{0}$ and $H_{1}$ is a Neyman-Pearson test. Siefert showed the test statistic can be defined as:

Z = M_{N}^{T} Σ_{N}^{- 1} {muf}_{N}

With the test formulated as:

\begin{array}{r} Z {\begin{cases} > k_{α} : reject H_{0} \\ \leq k_{α} : reject H_{1} \end{cases} \end{array}

There’s two challenges with this test. First, the test doesn’t provide sequential decisions (not necessarily a problem considering the test can still be calculated sequentially, which we will do). This can be remedied by simply calculating the test statistic for each period and making decisions as such:

{Z G}_{i} = M_{i}^{T} Σ_{N}^{- 1} {muf}_{i}

with decision process:

\begin{array}{r} {Z G}_{i} {\begin{cases} > s (i) : reject H_{0} \\ \leq s (i) : no decision \end{cases} \end{array}

and for the final period:

\begin{array}{r} {Z G}_{N} {\begin{cases} > s (N) : reject H_{0} \\ \leq s (N) : no reject H_{1} \end{cases} \end{array}

Here all interim test statistic calculations are required to be below a threshold for $H_{0}$ to be rejected.

Second, and more problematic, is the requirement that the loss pattern, $M_{N}$ is known. It is reasonable to approximate $M_{N}$ as $M_{N} \approx {muf}_{N}$ such that $\hat{M_{N}} = {muf}_{N}$ by considering that $E ({muf}_{i}) = M_{i}$ .

The test statistic then becomes

\begin{array}{r} \begin{aligned} Z & = {(\hat{M_{N}})}^{T} Σ_{N}^{- 1} {muf}_{N} \\ = {muf}_{N}^{T} Σ_{N}^{- 1} {muf}_{N} \end{aligned} \end{array}

The details behind the covariance calculation can be found in sitmuf theory

Siefert noted that using a single MUF value at each step (i.e., M_i$) can lead to significant variance. It was proposed to use a weighted value such that:

M_{i} = \frac{1}{7} ({muf}_{i - 2} + {muf}_{i - 1} + 3 {muf}_{i} + {muf}_{i + 1} + {muf}_{i + 2})

This approach has lower variance, but is no longer unbiased. In MAPIT we implement both approaches, the use of the single MUF value is V1 and the use of the weighted values is V5B3, following the notation in the original paper.

GEMUF implementation#

Since both GEMUF and SITMUF both require calculation of the covariance matrix, this is only done once and then used for any requested SITMUF or GEMUF calculations. Details of the covariance matrix calculation can be found in the SITMUF section and won’t be repeated here.

Covariance Matrix

The GEMUF functions are responsible for calculating the GEMUF sequence (either V1 or V5B3), effectively performing the calculation below as it’s assumed the covariance matrix has already been calculated:

\begin{array}{r} \begin{aligned} Z & = {(\hat{M_{N}})}^{T} Σ_{N}^{- 1} {muf}_{N} \\ = {muf}_{N}^{T} Σ_{N}^{- 1} {muf}_{N} \end{aligned} \end{array}

Unlike the SITMUF sequence, GEMUF is calculated sequentially for each balance period. That is, for a given MUF sequence, only a single test statistic is produced. Therefore, we have to loop over the entire MUF sequence:

StatsTests.py

    for ZR in range(1, int(nMBP)):
        IDs[ZR] = MUF[k,ZR*MBP]
        tempID = IDs[:ZR]
        tempcovmatrix = covmatrix[k,:ZR,:ZR]
        ZG = np.matmul(np.matmul(np.transpose(tempID), np.linalg.inv(tempcovmatrix)), tempID)
        GEMUFCalcsV1[k,int((ZR - 1) * MBP):int(ZR * MBP)] = np.ones((MBP,)) * ZG

We generate per-balance test statistics for GEMUF by considering subsections of the original sequence with increasing covariance matrix size:

StatsTests.py

    for ZR in range(1, int(nMBP)):
        IDs[ZR] = MUF[k,ZR*MBP]
        tempID = IDs[:ZR]
        tempcovmatrix = covmatrix[k,:ZR,:ZR]
        ZG = np.matmul(np.matmul(np.transpose(tempID), np.linalg.inv(tempcovmatrix)), tempID)
        GEMUFCalcsV1[k,int((ZR - 1) * MBP):int(ZR * MBP)] = np.ones((MBP,)) * ZG

For GEMUF-V1, the test statistic is straightforward to calculate:

StatsTests.py

    for ZR in range(1, int(nMBP)):
        IDs[ZR] = MUF[k,ZR*MBP]
        tempID = IDs[:ZR]
        tempcovmatrix = covmatrix[k,:ZR,:ZR]
        ZG = np.matmul(np.matmul(np.transpose(tempID), np.linalg.inv(tempcovmatrix)), tempID)
        GEMUFCalcsV1[k,int((ZR - 1) * MBP):int(ZR * MBP)] = np.ones((MBP,)) * ZG

Following the calculation of GEMUF-V1, the discrete sequence of GEMUF-V1 values are converted to a continuous time series.

The GEMUF-V5B3 calculation differs slightly from the GEMUF-V1 calculation. The GEMUF-V5B3 still iterates over the material balance periods, but a MUF ‘‘window’’ is created that corresponds to values needed in the weighting. Note the creation of the MUF window, generated from the continuous MUF values. The creation of this window is only valid when 5 consecutive MUF values are available:

StatsTests.py

        if ZR>=2 and ZR+2<int(nMBP):
          msc = 0
          for ZR2 in range(ZR-2,ZR+3):
            IDWindow[msc] = MUF[k, ZR2*MBP]
            msc += 1

If there’s a valid window, then the GEMUF-V5B3 is calculated. It’s important to note that we store the calculated weighted MUF values as the GEMUF-V5B3 calculation requires a sequence of values, not just the instantaneous value (in the code, MSSeq holds the weighted MUF values).

StatsTests.py

        if ZR>=2 and ZR+2<int(nMBP):
          MS = (1/7)*(IDWindow[0] + IDWindow[1] + 3*IDWindow[2] + IDWindow[3] + IDWindow[4])
          MSSeq[k,ZR] = MS
          ZG = np.matmul(np.matmul(np.transpose(MSSeq[k,:ZR].reshape((-1,))), np.linalg.inv(tempcovmatrix)), tempID)

After calculating the the discrete GEMUF-V5B3 sequence, it is converted to a continuous time series before being returned.

Important

The GEMUF-V5B3 test isn’t valid for the first and last two material balance periods because it requires a weighted average that includes the two previous and future balances. Seifert’s original paper didn’t describe what to do with the test on the tails of the balance sequence, so we choose to represent those values as np.nan. It’s important to note this as sometimes plotting libraries will drop those values when plotting. For example, matplotlib will often drop np.nan values resulting in a plotted sequence that appears to be missing the first two and last two sequence intervals.

GEMUF

Contents

GEMUF#

Historical Context#

Theory#

GEMUF implementation#

Further reading#