論文の公開元へ

書き出し

Refer/BibIX

RIS

BibTeX

TSV

Anomaly Detection using Adversarial Generative Networks in Multivariate Time Series

丸, 千尋お茶の水女子大学

2023.03.23

概要

博士論文
Anomaly Detection
using Adversarial Generative Networks
in Multivariate Time Series
（多次元時系列データにおける
敵対的生成ネットワークを用いた異常検知）

1770608
丸　千尋
指導教員　小林　一郎
大学院　人間文化創成科学研究科　理学専攻
お茶の水女子大学
This dissertation is submitted for the degree of
Doctor of Philosophy
March 2023

Acknowledgements

I am grateful to everyone for their support during my years in the Kobayashi lab.
First, to my supervisor, Ichiro Kobayashi. He spent much time discussing with me
on weekdays and weekends to advance my research. I am grateful to him for his
support in showing me where I should go, enabling me to progress in my research,
even though I had to reset my research theme from my doctoral studies.
I would also like to thank my supervisor Masato Oguchi, during my undergraduate and master’s studies. He gave me many opportunities to make presentations
at international and domestic conferences and taught me the joy of research. I
appreciate him for encouraging me to pursue my doctoral studies.
I would also like to express my gratitude to the members of my dissertation
committee: Prof. Hiroaki Yoshida, Ass. Prof. Kazue Kudo, and Lecturer. Nathanael
Aubert-Kato. Thanks to your comments and helpful advice, I could make this thesis
more solid.
I would also express my appreciation to Boris Brandherm. The beginning of this
research was my study abroad in Germany. He welcomed me willingly and gave
me a lot of support during my study abroad.
I would like to thank my colleagues at work for helping me to complete my
degree in parallel with my work smoothly.
I am glad I chose to study at the Department of Information Sciences, Faculty
of Sciences, Ochanomizu University. Teachers in the Department of Information
Sciences taught me through their classes that computer science is an exciting study.
Finally, I would like to thank my friends and family for their constant support.

Abstract

Anomaly detection in multivariate time series data is a technique for detecting
unusual observation values or behaviors from a series of data points consisting
of multiple variables obtained by continuously observing temporal changes in a
specific phenomenon. Capturing the signs of changes and anomalies in multivariate
time series data is essential in all fields, and studies on its automation have been
conducted. Anomalies are generally classified into three types: point anomalies,
in which observed values deviate significantly from the majority of data points;
contextual anomalies, in which anomalies are peculiar in the specific context behind
the observed values; and collective anomalies, in which individual data points are
regular, but the behavior is peculiar when multiple consecutive data points are
considered together. Point and contextual anomalies are relatively easy to detect by
handling individual data points. Therefore, many methods have been proposed. On
the other hand, with the recent development of measurement technology and the
availability of large amounts of time series data, there is a growing motivation to
detect collective anomalies peculiar to time series data. Developing a model that
captures the temporal dependence between observed data points is necessary to
detect collective anomalies.
With the development of information technology, methods using deep learning
have been proposed for anomaly detection in multivariate time series data. In
addition to anomaly detection for point and contextual anomalies, deep learning
can handle complex multivariate and time series data and is expected to be applied
to collective anomaly detection. However, this method has challenges in detection
accuracy for practical use, and there is room for improvement in detecting collective
anomalies.
In this thesis, I propose anomaly detection models based on Generative Adversarial Networks (GANs), one of the deep learning models, for multivariate time series
data combining sequence to sequence or Transformer. GANs consists of a generator,
which generates multidimensional data points from low-dimensional latent variables, and a discriminator, which judges whether a given data point is real. These

vi
networks are learned through minimax optimization called adversarial training. In
the proposed model of this thesis, an encoder that compresses multidimensional data
points into low-dimensional latent variables is introduced, and multidimensional
data points are converted into low-dimensional feature representations that retain
important information, making it possible to detect anomalies in multidimensional
data without lowering accuracy. On the other hand, GANs cannot capture the
temporal dependencies between data points in time series data. The proposed model
combines the encoder and generator of GANs with a new model called RNN or
Transformer and learns the encoder, generator, and discriminator by adversarial
training. In the model combining GANs and Transformer, sparse attention is utilized
as an attention mechanism to learn long-term temporal dependencies between data
points, increasing the influence of strongly related data points in time series data
and improving the accuracy of detecting anomalies that occur over a long period.

Table of contents
List of figures

xi

List of tables
1

xvii

Introduction
1.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Anomaly Detection
2.1 Diverse Types of Anomaly . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Point anomalies . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Contextual anomalies . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Collective anomalies . . . . . . . . . . . . . . . . . . . . . . .
2.2 Performance Evaluation of Anomaly Detection . . . . . . . . . . . .
2.2.1 Normal sample accuracy . . . . . . . . . . . . . . . . . . . .
2.2.2 Anomalous sample accuracy . . . . . . . . . . . . . . . . . .
2.2.3 F1-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Area under the ROC curve . . . . . . . . . . . . . . . . . . .
2.3 Anomaly Detection over Time Series Data . . . . . . . . . . . . . . .
2.3.1 Univariate time series data vs. Multivariate time series data
2.3.2 Learning based on the availability of labels . . . . . . . . . .
3

Related Work
3.1 Deep Anomaly Detection . . . . . . . . . . . .
3.1.1 RNN . . . . . . . . . . . . . . . . . . . .
3.1.2 LSTM . . . . . . . . . . . . . . . . . . . .
3.1.3 GRU . . . . . . . . . . . . . . . . . . . .
3.2 Deep Learning for Feature Extraction . . . . .
3.3 Learning Feature Representations of Normality

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

1
5
5

.
.
.
.
.
.
.
.
.
.
.
.

7
7
7
8
9
9
10
10
11
12
13
13
14

.
.
.
.
.
.

17
18
20
21
22
23
24

Table of contents

viii

3.4

3.3.1 Generic normality feature learning . . . . . . .
3.3.2 Anomaly measure-dependent feature learning
End-to-End Anomaly Score Learning . . . . . . . . . .
3.4.1 End-to-end one-class classification models . .

4 Anomaly Detection Model Combining GANs and RNN
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
4.2 Problem Formulation . . . . . . . . . . . . . . . . . .
4.3 Data Preprocessing . . . . . . . . . . . . . . . . . . .
4.4 Multivariate Time Series Anomaly Detection Model
4.4.1 Overall architecture . . . . . . . . . . . . . .
4.4.2 Encoder-decoder model . . . . . . . . . . . .
4.5 Experiments and Results . . . . . . . . . . . . . . . .
4.5.1 Public datasets . . . . . . . . . . . . . . . . .
4.5.2 Generating collective anomalies . . . . . . .
4.5.3 Comparative methods . . . . . . . . . . . . .
4.5.4 Experimental settings . . . . . . . . . . . . .
4.5.5 Experimental results . . . . . . . . . . . . . .
4.5.6 Analysis on latent space . . . . . . . . . . . .
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

25
38
46
46

.
.
.
.
.
.
.
.
.
.
.
.
.
.

51
51
52
53
53
53
54
59
59
60
62
63
64
66
70

5 Anomaly Detection Model Combining GANs and Transformer
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Multivariate Time Series Anomaly Detection Model . . . . . . . . . .
5.2.1 Overall architecture . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Differences between Transformer and encoder-decoder model
with RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 Sparse attention . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Training of TDAD and STDAD . . . . . . . . . . . . . . . . . .
5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Public datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 Effects of parameters . . . . . . . . . . . . . . . . . . . . . . . .
5.3.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . .
5.3.6 Evaluation of sparse attention . . . . . . . . . . . . . . . . . . .
5.3.7 Roles of sparse attention . . . . . . . . . . . . . . . . . . . . . .

71
71
72
72
72
74
75
79
80
81
81
86
86
89
90

Table of contents
5.4
6

ix

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

Conclusions and Future Work
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97
97
98

Bibliography

101

Appendix A
111
A.1 F-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.2 GANs Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

List of figures
2.1

2.2

2.3
2.4

2.5

3.1

3.2

3.3
3.4

Contextual anomaly t2 in a temperature time series [11]. Although t1
and t2 are the same temperature, t2 is a contextual anomaly when the
time is considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Collective anomaly in a human electrocardiogram output [11]. Low
values are continuous for a specific time, and the normal periodicity
is lost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

Accuracy at the break-even point (break-even accuracy) where normal
and anomalous sample accuracies coincide. . . . . . . . . . . . . . . .

11

ROC curve with the false positive rate on the horizontal axis and
anomalous sample accuracy on the vertical axis when the threshold is
varied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

Anomaly detection based on the nearest neighbor distance. Determine
whether x′ is anomalous by the number of data points that are within
a sphere of radius r centered at x′ . . . . . . . . . . . . . . . . . . . . .

15

Categorization of deep anomaly detection methods. We classified the
deep anomaly detection methods into three main categories and six
subcategories regarding characteristics. . . . . . . . . . . . . . . . . .

17

Three main deep anomaly detection approaches deep learning for
feature extraction, learning feature representations of normality, and
end-to-end anomaly score learning [59]. . . . . . . . . . . . . . . . . .

18

Neural networks with an input layer, one or more middle layers, and
an output layer, each consisting of multiple units. . . . . . . . . . . .

19

A unit of neural networks. Each unit receives an input xi (i = 1, 2, · · · , M)
and computes one output y using the weights wi (i = 1, 2, · · · , M) corresponding to the input and the bias b. . . . . . . . . . . . . . . . . . . .

19

List of figures

xii
3.5

3.6

3.7

3.8

3.9

RNN with three layers: an input layer, a middle layer, and an output
layer. The middle layer receives not only the output xt from the input
layer but also the hidden state ht−1 of the middle layer at the last time
(t − 1) through a loop structure and computes yt . . . . . . . . . . . . .

21

Autoencoder architecture. Autoencoder consists of an encoder and a
decoder. The encoder and decoder are learned to extract the feature z
that preserves important information for reconstruction. . . . . . . .

26

VAE architecture. An encoder outputs a mean µ and a standard
deviation σ of a predefined distribution in the latent space. A decoder
reconstructs data points corresponding to x using z sampled from the
distribution with µ and σ. . . . . . . . . . . . . . . . . . . . . . . . . .

27

Left: Two-layer fully connected autoencoder. Right: Two-layer
sparsely-connected autoencoder. Unlike the left figure, in the right
figure, the units in each layer are not all connected. . . . . . . . . . .

28

USAD architecture. In the first stage, the objective is to learn AE1
and AE2 to reconstruct the input window W. In the second stage, the
objective is to learn AE1 to distinguish the real window W from the
time window AE1 (W) coming from AE1 , and to learn AE1 to fool AE2 . 29

3.10 GANs architecture. A standard GANs consists of a generator and a
discriminator. The neural networks are learned through a two-player
minimax game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.11 AnoGAN architecture in the test phase. The anomaly score consists
of the reconstruction loss and the discrimination loss computed from
the learned generator and discriminator. . . . . . . . . . . . . . . . . .

31

3.12 MAD-GAN architecture in the training phase. GANs-based anomaly
detection is performed on time series data by introducing LSTM into
a generator and a discriminator of GANs. . . . . . . . . . . . . . . . .

32

3.13 MAD-GAN architecture in the test phase. When computing the
reconstruction loss, find a latent variable z corresponding to a test
window W through invert mapping. . . . . . . . . . . . . . . . . . . .

32

3.14 BiGAN architecture. It introduces an encoder that maps data points
to the latent space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.15 Efficient GAN-based anomaly detection architecture. In anomaly
detection, G receives the latent variable E(x) of x, transformed by the
learned E. D receives the pair of x and its latent variable E(x). . . . . .

34

3.16 Transformer architecture [86]. . . . . . . . . . . . . . . . . . . . . . . .

36

List of figures

xiii

3.17 Computation of attention scores. An attention score is obtained by
computing the inner product of the target query for which the feature
is to be obtained and all the keys. . . . . . . . . . . . . . . . . . . . . .

37

3.18 Acquisition of query features. Computing the weighted sum of the
transformed attention scores and values by applying the softmax
function produces a feature of the target query. . . . . . . . . . . . . .

37

3.19 OC-NN architecture. OC-NN learns a neural networks transformation
ϕ(·; Θ) from the input space X to the latent space Z so that the distance
from the origin to the hyperplane is maximized. . . . . . . . . . . . .

40

3.20 Deep SVDD architecture. Deep SVDD learns a neural networks
transformation ϕ(·; Θ) from the input space X to the latent space Z
so that the features are enclosed in a hypersphere with center c and
radius R of minimum volume. Normal data points are mapped near
the center of the hypersphere, while anomalies are mapped to the
outside of the hypersphere [69]. . . . . . . . . . . . . . . . . . . . . . .

41

3.21 GMVAE architecture in the training phase. VAE-based GMM is
utilized for anomaly detection. Learning VAE and GMM enables the
extraction of features useful for anomaly detection. . . . . . . . . . .

43

3.22 GMVAE architecture in the test phase. Reconstruction probabilities of
test data points are obtained using the learned VAE and compared
with a given threshold to determine anomalies. . . . . . . . . . . . . .

44

3.23 DAGMM architecture. DAGMM consists of a compression network
and an estimation network. DAGMM utilizes autoencoder to extract
features of input data points and GMM to perform anomaly detection. 44
3.24 Adversarially learned one-class classification in the training phase. it
consists of two modules, the network R and the network D, and the
two networks are learned using adversarial training. The network
R is optimized to reconstruct data points belonging to the normal
class, while the network D tries to classify input data points as normal
and non-normal classes. The network D outputs the likelihood of the
given input data point belonging to the normal class. . . . . . . . . .

48

List of figures

xiv

3.25 Adversarially learned one-class classification in the test phase. An
anomaly score is computed using the learned networks R and D in the
test phase. Given a test data point x, R(x) is reconstructed using the
network R, and R(x) is given as input to the network D. The network
D outputs the probability D(R(x)) that x belongs to the normal class,
and if this value is less than a predefined threshold, an anomaly is
declared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
4.2

4.3

4.4

4.5

4.6

4.7
4.8

4.9

Time windows of length L consisting of L data points transformed
from a time series data. This figure is shown for L = 3. . . . . . . . . .
Proposed architecture in anomaly detection. It consists of an encoder,
a generator, and a discriminator. These three neural networks are
learned by adversarial training. . . . . . . . . . . . . . . . . . . . . . .
Encoder-decoder model with RNN architecture. An encoder receives
an input time window ABC and produces WXY as the output time
window at a decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Encoder architecture. When E receives a time window W, it outputs a
latent variable E(W) with compressed features of W using RNN with
three hidden layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(1) (2) (3)
Generator architecture. G receives z = (s1 , s1 , s1 ) and sets them as
the first hidden state of RNN with three hidden layers at t = 1. G
generates data point x′t at each time t from the output data point and
hidden state of the previous time. . . . . . . . . . . . . . . . . . . . . .
Discriminator architecture. D consists of RNN and FNN. RNN
generates a latent variable with compressed features of the given
time window, and FNN computes the probability that the given time
window is real. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Process of creating collective anomalies. Normal data points at
different times are swapped. . . . . . . . . . . . . . . . . . . . . . . . .
t-SNE visualization of the WADI test dataset in the latent space. The
blue and green dots represent normal and anomalous time windows,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
t-SNE visualization of the latent space with false positives and false
negatives. Red and yellow dots represent false positives and false
negatives, respectively. It can be seen that the parts with a mixture
of normal and anomalous time windows have more errors than the
other parts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

52

54

55

56

57

58
61

67

68

List of figures
4.10 t-SNE visualization of the latent space with anomaly scores. The
anomaly scores of false positives and false negatives in Figure 4.9 are
around the threshold, the borderline between normal and anomalous
time windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1

5.2

5.3

5.4

5.5

xv

69

Proposed architecture in anomaly detection. It consists of an encoder,
a generator with an attention mechanism, and a discriminator. These
three neural networks are learned by adversarial training. . . . . . .

72

Encoder training flow. When a tuple (WE , E(WE )) of the input time
window WE and its latent variable E(WE ) are given to the discriminator,
the encoder model is learned to minimize the loss function so that WE
is determined to be false by the discriminator. . . . . . . . . . . . . . .

76

Generator training flow. When a tuple (G(z), z) consisting of the time
window G(z) generated by the generator and latent variable z are
given to the discriminator, the generator is learned to minimize the
loss function so that G(z) is determined to be real. . . . . . . . . . . .

77

Number of layers in TransformerBlock module N. The larger N, the
better F1-score, and when N = 10, F1-score is 0.0109 better than when
N = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

Number of heads on multi-head attention h. F1-score increases from
h = 1 to 10, but it becomes worse when h = 20. . . . . . . . . . . . . . .

83

5.6

Dimension of the latent space in Feed Forward module d f f . F1-score
is the highest when d f f = 512, but as d f f increases, F1-score decreases. 83

5.7

Dropout in the encoder and the generator Ddrop . F1-score is the
highest when Ddrop = 0.4, but as Ddrop increases, F1-score decreases.

84

Dimension of the latent space in the encoder and the generator dk .
The larger dk , the higher F1-score. . . . . . . . . . . . . . . . . . . . . .

84

Anomaly detection results with various window sizes L for the SWaT
dataset using TDAD and STDAD. TDAD has higher F1-scores up
to L = 30, but when L = 40, STDAD reverses the trend. For longer
window sizes such as L = 100, 200, STDAD has better results. . . . . .

89

5.10 Attention pattern for the cell with Layer 0, Head 0 in STDAD. The
lines represent the attention from one data point (left side) to another
(right side) in the time window. . . . . . . . . . . . . . . . . . . . . . .

93

5.8
5.9

xvi

List of figures
5.11 Attention pattern between data points in the anomalous time window
per attention head in STDAD. Compared to TDAD, STDAD has
higher attention scores for strongly relevant data points, specifically
anomalous data points. . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.12 Attention pattern between data points in the anomalous time window
per attention head in TDAD. Compared to STDAD, TDAD has overall
attention scores between data points. . . . . . . . . . . . . . . . . . . .

94

95

List of tables
1.1

1.2
1.3
4.1
4.2
4.3
4.4
4.5

5.1

5.2

5.3

5.4

A mixed matrix of determination results of anomaly detection. The
results are classified into four types: true negative, false positive, false
negative, and true positive. . . . . . . . . . . . . . . . . . . . . . . . .
Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important abbreviations. . . . . . . . . . . . . . . . . . . . . . . . . . .
Characteristics of datasets. We utilized two multivariate time series
datasets to evaluate the proposed method. . . . . . . . . . . . . . . .
Hyperparameters of MARU-GAN. . . . . . . . . . . . . . . . . . . . .
Experimental settings. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Experimental results (SWaT dataset). MARU-GAN received the
highest F1-score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Experimental results (WADI dataset). MARU-GAN received the
highest F1-score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Characteristics of datasets. We utilized four multivariate time series
datasets and a univariate time series dataset to evaluate the proposed
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F1-score, false positives, and false negatives with various computations of anomaly score. Stronger effect of the reconstruction loss
results in fewer FPs, while the stronger effect of the discrimination
loss results in fewer FNs. . . . . . . . . . . . . . . . . . . . . . . . . . .
The average performance over all datasets. P, R, AUC, and F1 represent
precision, recall, area under the ROC curve, and F1-score, respectively.
The highest F1-score and AUC are in bold fonts. In F1-score, TDAD
was the best. STDAD performed second best. . . . . . . . . . . . . . .
Performance comparison. TDAD performed better on all datasets
except the SMD dataset, which has the second-best F1-score. . . . . .

3
6
6

61
64
64
65
65

80

85

87
88

Chapter 1
Introduction
An anomaly is a behavior that differs from the usual pattern. Anomalies are
also called outliers, abnormalities, or deviants [1]. Detecting anomalies is called
anomaly detection. Anomaly detection has been used in various applications,
including security, network operation, medical care, and marketing.
An anomaly detection model is constructed from three steps: estimating a
model, defining a method for computing anomaly score, and setting the threshold.
Model, anomaly score, and threshold are the three significant elements of anomaly
detection. In the training phase, these elements are determined from the training
data using machine learning, deep learning, or other methods.
Step 1: Estimation of model Create a model that reflects the properties of
training data.
Step 2: Definition of a method for computing anomaly score Define a method for
computing anomaly score, which is the degree of deviation from normality
(i.e., the degree of anomaly).
Step 3: Setting threshold Find a threshold that determines an anomaly if the
anomaly score is greater than a specific value.
Consider the Body Mass Index (BMI) in a physical examination as a simple
example. BMI is defined as
BMI(w, r) =

w
,
r2

(1.1)

2

Introduction
where w is weight and h is height. According to the World Health Organization
criteria, a BMI of 25 or higher is considered obese. In this example, BMI(·)
in Eq. (1.1) is the model, the BMI computed using Eq. (1.1) is the anomaly score,
and 25 is the predefined threshold for determining anomalies. The model, method
for computing BMI, and threshold have been defined using clinical data from
many people worldwide.
In the anomaly detection phase for an unseen given data, after the above
three steps, an anomaly score for the data is computed to determine whether it is
normal or anomalous.
Step 4: Computation of anomaly score Compute anomaly score defined in Step
2 for an unseen given data.
Step 5: Anomaly detection Anomaly is declared when the anomaly score in
Step 4 exceeds the predefined threshold in Step 3.
The above example computes a person’s BMI (anomaly score) using Eq. (1.1).
If this value is greater than 25 (predefined threshold), the person is determined to
be obese (i.e., anomaly).
Anomaly detection in multivariate time series data is a technique for detecting
unusual observation values or behaviors from a series of data points consisting
of multiple variables obtained by continuously observing temporal changes in
a particular phenomenon. Anomalies in multivariate time series data can be
classified into point anomalies, contextual anomalies, and collective anomalies [11].
Point anomalies are individual data points whose values deviate significantly
from most. Contextual anomalies are individual data points that are specified in a
particular context. Collective anomalies are a set of data points whose individual
data points are normal but whose behavior is anomalous when considered
collectively.
In this thesis, F1-score was used as the criterion for anomaly detection; F1-score
is the harmonic mean of normal and anomalous standard accuracies. In the
determination results of anomaly detection, there are four patterns: correctly
determined as normal, wrongly determined as anomalous, wrongly determined
as normal, and correctly determined as anomalous. TN, FP, FN, and TP in
Table 1.1 represent the number of samples corresponding to the determination

3
results, respectively. TN and TP, are correct, while the remaining two patterns,
FN and FP, are incorrect.
Table 1.1: A mixed matrix of determination results of anomaly detection. The
results are classified into four types: true negative, false positive, false negative,
and true positive.
Determination results
Normal
Anomalous
TN
FP
Normal
True Negative False Positive
Real labels
FN
TP
Anomalous
False Negative True Positive
From this determination results, the normal and anomalous sample accuracies
can be computed. Normal sample accuracy is defined as
(Number of samples that are correctly determined to be normal)
(Total number of normal samples)
TN
,
=
TN + FP

(1.2)

and anomalous sample accuracy is defined as
(Number of samples that are correctly determined to be anomalous)
(Total number of anomalous samples)
TP
=
.
FN + TP

(1.3)

Here, consider the examination for infectious diseases. Low normal sample
accuracy increases the risk that a normal sample will be determined to be
anomalous. In this case, a person is determined to be positive even though
the person does not have an infectious disease and is imposed with needless
behavioral restrictions. On the other hand, low anomalous sample accuracy
increases the risk that a sample that is an anomaly will not be correctly determined
to be anomalous. In this case, the possibility of spreading the infection increases
due to not restricting behavior despite being infected. Therefore, it is necessary to
improve normal and anomalous sample accuracies in anomaly detection. In order
to treat these two indices as a single index, the F1-score, which is the harmonic

4

Introduction
mean of the normal and anomalous sample accuracies, is widely used. I used
F1-score as an index for evaluating the performance of anomaly detection.
Among the anomalies in multivariate time series data, many methods for
detecting point and contextual anomalies have been proposed since anomalies
can be easily detected by treating each data point individually. Recent advances
in measurement technology and the ease of obtaining time series data have
increased the motivation to detect collective anomalies unique to time series data.
For the detection of collective anomalies, multiple data points must be handled
simultaneously.
In particular, deep learning methods have been proposed for anomaly detection
in time series data. Deep learning makes it possible to handle multiple data points
simultaneously. Therefore, it is expected to detect anomalies in time series data
with high accuracy. LSTM-NDT is an LSTM-based neural network model [30].
LSTM learns the relationships between past data and current data. Given a set of
data points as input, the data point at the next time is predicted, and the difference
between the predicted and actual values of the data point is used to determine
anomalies. MAD-GAN is a GANs-based neural network model [42].
Introducing LSTM into two neural networks generator and discriminator
enables anomaly detection in time series data using GANs. OmniAnomaly is a
probabilistic recurrent neural network model that combines a gated regression
unit with a variational autoencoder [78]. Its core idea is to capture the normal
robust representations of multivariate time series data, reconstruct input data
by the representations and use the reconstruction probabilities to determine
anomalies. USAD is an autoencoder-based neural network model [5]. It is
possible to detect anomalies by reconstructing the input time series data using
two different autoencoders with different decoders. TranAD is a Transformerbased neural network model [84]. It combines two transformers, learns by
adversarial training, and uses reconstruction loss to determine anomalies, as in
USAD. Anomaly detection consists of two stages. The reconstruction loss of the
input time series data in the first stage is given as input to the second stage, and
another decoder is used to reconstruct that input time series data using the loss.
Anomaly detection methods for multivariate time series data, including these
methods, are less accurate in detection than other types of anomalies, considering
their practical use, and there is still room for improvement. We propose an

1.1 Contribution
anomaly detection method that will result in a higher F1-score for time series
data containing both point and contextual anomalies and collective anomalies.

1.1

Contribution

All the contributions of the thesis are the following:
1. Adversarial training using GANs discriminator can determine anomalies
with high accuracy in anomaly detection.
2. RNN and Transformer’s attention mechanism allow us to better capture the
characteristics of temporal dependencies in time series data.
3. Sparse attention increases the influence of strongly relevant data points and
improves the interpretability of the patterns of time series data.

1.2

Thesis Structure

This thesis is structured as follows.
• Chapter 2 explains preliminary knowledge of anomaly detection and is
divided into three main parts. The first one presents diverse types of
anomalies. The second one is dedicated to the performance evaluation
of anomaly detection. Finally, the third one presents anomaly detection
learning approaches based on the existence of labels.
• Chapter 3 introduces unsupervised anomaly detection methods based on
deep learning. We present the methods classified into three categories: deep
learning for feature extraction, learning feature representations of normality,
and end-to-end anomaly score learning.
• Chapter 4 introduces the GANs-based anomaly detection method, which
incorporates RNN to detect multivariate collective anomalies and shows the
performance of the method, which focuses on collective anomaly detection
for two real-world open datasets.

5

Introduction

6

• Chapter 5 presents a method combining GANs and Transformer to detect
anomalies in multivariate time series data. Furthermore, we describe a
method that introduces sparse attention in Transformer’s attention mechanism to detect anomalies that occur over a long time.
• Finally, Chapter 6 summarizes the main contributions presented in this
work and presents the possible continuation.
The notation used in the thesis is summarized in Table 1.2. Moreover, the
abbreviations used in the thesis are summarized in Table 1.3.
Table 1.2: Notation.
Notation
Description
x
Data point with a single variable
x
Data point with M(M ≥ 2) dimensional variables
y
Label
xt /xt
Data point at time t
Wt
Time window of length L at time t
Θ
Weight matrix
L
Loss function
A
Anomaly score function
E(·)
Expected value
sigm
Sigmoid function

Table 1.3: Important abbreviations.
Abbreviation
Full name
RNN
Recurrent neural networks
LSTM
Long short term memory
GRU
Gated recurrent unit
GANs
Generative adversarial networks

Chapter 2
Anomaly Detection
This chapter presents preliminary knowledge of anomaly detection. First, diverse
types of anomalies are introduced. Next, the performance evaluation of anomaly
detection is discussed. Finally, anomaly detection learning approaches based on
the existence of labels are presented.

2.1

Diverse Types of Anomaly

Anomalies in anomaly detection are classified into point, contextual, and collective
anomalies [11]. Point and contextual anomalies utilize similar approaches because
both anomalies are detected by dealing with an individual data point. On the
other hand, detecting collective anomalies utilizes approaches that deal with
multiple data points. Collective anomalies are specific to time series data. We
aim to detect these three anomalies in time series data with high performance.

2.1.1

Point anomalies

The first category is point anomalies for detecting individual data points that
deviate significantly from most data points. It is the easiest type of anomaly to
determine and is the focus of most research on anomaly detection. For instance, a
data point with 100 degrees in the annual temperature changes is determined as a
point anomaly. Generally, the temperature never reaches 100 degrees, regardless
of time or space.

Anomaly Detection

8

Point anomalies can be detected by comparing each data point with a predefined threshold.

2.1.2

Contextual anomalies

The second category is contextual anomalies (conditional anomalies) for detecting
anomalous data points in a specific context. This context varies widely in realworld applications, such as time and space. Figure 2.1 shows an example of
contextual anomalies in a particular area’s temperature time series data. Although
t1 and t2 are the same temperature, t2 is a contextual anomaly when considering
time as a context. This temperature is a typical winter temperature, and t1 is
determined to be normal because it was observed in winter, but t2 is determined
to be a contextual anomaly because it was observed in summer.
Contextual anomalies can be detected by setting a threshold for each context.
In the above example, the threshold is set for each month. However, the larger
the number of variables to be considered simultaneously, the more difficult it
becomes to detect anomalies.

Figure 2.1: Contextual anomaly t2 in a temperature time series [11]. Although
t1 and t2 are the same temperature, t2 is a contextual anomaly when the time is
considered.

2.2 Performance Evaluation of Anomaly Detection

2.1.3

Collective anomalies

The third category is collective anomalies (group anomalies) for detecting a subset
of data points whose behavior significantly differs from the other data points. The
individual data points in a collective anomaly may not be regarded as anomalies
by themselves. Collective anomalies are specific to time series data because
they are detected by dealing with consecutive data points. Figure 2.2 shows
an example of collective anomalies in a human electrocardiogram output. This
time series data has periodicity. However, the highlighted region is a collective
anomaly because the low values are continuous for a specific time, and the
behavior differs from the others.

Figure 2.2: Collective anomaly in a human electrocardiogram output [11]. Low
values are continuous for a specific time, and the normal periodicity is lost.

2.2

Performance Evaluation of Anomaly Detection

When evaluating the performance of anomaly detection, it is important to divide
the dataset into two types: training dataset Dtrain and test dataset Dtest (validation
dataset). Using the training data Dtrain , the properties of data points in the dataset
are learned to create a model for anomaly detection. At the same time, a threshold

9

Anomaly Detection

10

Ath for anomaly score is defined for determining anomalies. The test dataset Dtest
is utilized to evaluate the performance of the learned model.
Another important point is that the threshold Ath is directly related to anomaly
detection performance. It is necessary to evaluate performance from two perspectives: normal sample accuracy and anomalous sample accuracy. In anomaly
detection of unseen data point x′ ∈ Dtest , if the anomaly score A(x′ ) is greater than
Ath , x′ is determined to be anomalous. If A(x′ ) is less than Ath , x′ is determined
to be normal. Although the test dataset includes both normal and anomalous
samples, the datasets in the world contain far fewer anomalous samples than
normal samples. In this case, if Ath is made infinitely big, it is possible to create a
model that never determines any anomalies. It is a poor-performing model in
terms of missing all anomalous samples. However, the number of anomalous
samples is very few compared to normal samples, so accuracy is high. Conversely,
if Ath is made infinitely small, creating a model that always determines anomalies
is possible. The model is reliable in that it can detect all anomalous samples.
However, the model is useless for most normal samples because it makes incorrect
determinations. Therefore, there are two conflicting perspectives in evaluating
anomaly detection performance, and it is necessary to clarify whether normal or
anomalous samples should be utilized to evaluate.

2.2.1

Normal sample accuracy

Normal sample accuracy focuses on normal samples and is defined as
(Number of samples that are correctly determined to be normal)
.
(Total number of normal samples)

2.2.2

(2.1)

Anomalous sample accuracy

Unlike normal sample accuracy, anomalous sample accuracy focuses on anomalous samples and is defined as
(Number of samples that are correctly determined to be anomalous)
.
(Total number of anomalous samples)
Anomalous sample accuracy refers to recall in binary classification.

(2.2)

2.2 Performance Evaluation of Anomaly Detection

2.2.3

F1-score

One metric for evaluating the performance of anomaly detection is break-even
accuracy. Normal sample accuracy rN and anomalous sample accuracy rA vary
greatly depending on the predefined threshold Ath for anomaly detection. A
graph with Ath on the horizontal axis and rN or rA on the vertical axis is shown
in Figure 2.3. The relations between Ath and the limit are summarized below.
• When Ath is infinitely small, all samples are determined to be anomalous.
Normal samples are determined wrongly to be anomalous (rN = 0), while
anomalous samples are correctly determined (rA = 1).
• When Ath is infinitely big, all samples are determined to be normal. Normal
samples are correctly determined (rN = 1), and anomalous samples are
determined wrongly to be normal (rA = 0).

Figure 2.3: Accuracy at the break-even point (break-even accuracy) where normal
and anomalous sample accuracies coincide.
Using the accuracy at the break-even point (break-even accuracy) where the
normal and the anomalous sample accuracies coincide, the performance for
anomaly detection can be expressed by a single value. F1-score is utilized for
practical use rather than strictly finding the break-even point. F1-score is the
harmonic mean f of rN and rA , defined as

11

Anomaly Detection

12

f=

2rN rA
.
rN + rA

(2.3)

In this thesis, we used the F1-score in Eq. (2.3) as an index of performance for
anomaly detection.

2.2.4

Area under the ROC curve

The area under the curve (AUC) is an index of the model’s goodness for anomaly
detection, computed using the receiver operating characteristic (ROC) curve.
The ROC curve is defined as the curve of the anomalous sample accuracy rA
expressed by the false positive rate (1 − rN ). The ROC curve is a set of coordinates
(1 − rN (Ath ), rA (Ath )) for a certain threshold Ath , as shown in Figure 2.4. The area
of the region between the ROC curve and the horizontal axis is the AUC.

Figure 2.4: ROC curve with the false positive rate on the horizontal axis and
anomalous sample accuracy on the vertical axis when the threshold is varied.

2.3 Anomaly Detection over Time Series Data

2.3
2.3.1

Anomaly Detection over Time Series Data
Univariate time series data vs. Multivariate time series
data

Time series data is the sequence of data points obtained by observing changes in
a phenomenon over time. Many data in the world, such as temperature, blood
pressure, and stock prices, change over time. Time series data can be classified
into univariate and multivariate time series data, depending on the number of
different types of values observed simultaneously. The number of types of values
is called a variable. For example, if air pressure, temperature, wind speed, and
rainfall are observed simultaneously at a particular area, the multivariate time
series data has four variables.
A time series data is represented as a series of data points indexed in time
order. There is an interrelationship between the data points at consecutive times,
and this relationship cannot be ignored in anomaly detection. For example, if
the time of each data point is rearranged appropriately, anomaly detection will
not be possible. The temporal relationships with the previous data points are
important for anomaly detection.
Definition 2.3.1 (Univariate time series data). A time series data consisting of N
data points is defined as
x1 , x2 , . . . , xN ,
where xt is a data point with a single variable at a specific time t.
Definition 2.3.2 (Multivariate time series data). A time series data consisting of N
data points is represented as a vertical vector of values for each variable, defined
as
 1
 xt 
 
 x2 
 
x1 , x2 , . . . , xN , xt =  .t  ∈ RM ,
 .. 
 
 M
xt
where xt is a data point with M(M ≥ 2) variables at a specific time t.

13

Anomaly Detection

14

2.3.2

Learning based on the availability of labels

The anomaly detection approaches are classified into three categories: supervised anomaly detection, semi-supervised anomaly detection, and unsupervised
anomaly detection. It is a classification based on whether a label is simultaneously observed in the dataset for creating the model, in addition to x with M
variables. The labels associated with a data point denote whether that data point
is normal or anomalous. Typically, it is not easy to collect data points labeled as
anomalous. Therefore, the model is created using a dataset that either does not
contain anomalous data points or, if it does, contains an overwhelming minority
of anomalous data points.

Supervised anomaly detection
In supervised anomaly detection, a model is created using data points labeled as
normal or anomalous. A multivariate time series data
D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}
consisting of N pairs of a M dimensional data point xt and a label y representing
normal or anomalous is utilized in this case.
Usually, models reflect the properties of normal and anomalous data points,
respectively. However, the difficulty of collecting data points labeled as anomalous
makes supervised anomaly detection impractical compared to unsupervised and
semi-supervised anomaly detection. Data points labeled as normal and those
labeled as anomalous are unbalanced, making the model’s performance suboptimal. In addition, the types of anomalies that can be reflected in the model are
limited, and the model cannot reflect the properties of newly occurring anomalies.

Semi-supervised anomaly detection
In semi-supervised anomaly detection, a model is created using only data points
labeled as normal. This technique is widely utilized in anomaly detection because
normal data points are easier to collect than anomalies. In this case, a multivariate

2.3 Anomaly Detection over Time Series Data
time series data
D = {x1 , x2 , . . . , xN }
consists of only N data points with M variables is utilized. Semi-supervised
anomaly detection creates models that reflect only properties of normal data
points. When determining whether unseen given data points are anomalous,
deviations from this model are determined to be anomalous. This thesis proposes
a method based on a semi-supervised anomaly detection approach.
Unsupervised anomaly detection
Unsupervised anomaly detection does not require any data points labeled normal
or anomalous. It determines anomalies based solely on the intrinsic properties
of the data points. This approach assumes that there are far more normal than
anomalous data points. For example, a typical example is anomaly detection based
on the nearest neighbor distance. As a time series data, there is D = {x1 , x2 , . . . , xN }
consisting of N data points with M variables, where normal or anomalous is
not known. We consider determining whether an unseen given data point x′ is
anomalous. As shown in Figure 2.5, consider an M dimensional sphere centered
at x′ . As a criterion for determining anomalies, the radius r of the sphere is
determined in advance, and x′ is determined to be anomalous when the number
of data points within the sphere is less than a particular value. This technique is
often utilized to detect point and contextual anomalies.

Figure 2.5: Anomaly detection based on the nearest neighbor distance. Determine
whether x′ is anomalous by the number of data points that are within a sphere of
radius r centered at x′ .

15

Chapter 3
Related Work
We classified the semi-supervised anomaly detection methods using deep learning
(called deep anomaly detection methods) into three main categories and six subcategories in terms of characteristics [59]. The overall results of the classification
are shown in Figure 3.1. In particular, the three main categories consist of deep
learning for feature extraction, learning feature representations of normality, and
end-to-end anomaly score learning.

Figure 3.1: Categorization of deep anomaly detection methods. We classified the
deep anomaly detection methods into three main categories and six subcategories
regarding characteristics.
In deep learning for feature extraction, neural networks are used only for
feature extraction, and deep learning and anomaly detection are completely
independent, presented in Figure 3.2. As shown in Figure 3.2, learning feature
representations of normality aims to learn highly expressive normal represen-

Related Work

18

tations using neural networks, where deep learning and anomaly detection are
somehow dependent on each other. This approach is further divided into two
subcategories based on whether or not the anomaly measure used for anomaly
detection is introduced during learning neural networks. End-to-end anomaly
score learning integrates deep learning and anomaly detection, as represented
in Figure 3.2, and learns to compute the anomaly score directly using neural
networks.

Figure 3.2: Three main deep anomaly detection approaches deep learning for
feature extraction, learning feature representations of normality, and end-to-end
anomaly score learning [59].

3.1

Deep Anomaly Detection

Neural networks are models that imitate the information processing by neurons
(nerve cells) in the human cranial nervous system. A single neuron is also called a
unit. Neural networks consist of several units and their combinations. As shown
in Figure 3.3, neural networks generally consist of multiple layers: an input layer,
one or more middle layers (hidden layers), and an output layer, each consisting
of many units with no mutual relations.
Neural networks with two or more middle layers are called deep neural
networks. Units are only connected between adjacent layers; units belonging to
the same layer are not connected. The units connected between adjacent layers

3.1 Deep Anomaly Detection

19

Figure 3.3: Neural networks with an input layer, one or more middle layers, and
an output layer, each consisting of multiple units.
are assigned weights that indicate the degree of the connection, and these weights
are learned using the training dataset.
Each unit receives multiple inputs x1 , x2 , · · · , xM and computes one output y as
shown in Figure 3.4. The output is defined as
M
X
y = f(
wi xi − b),

(3.1)

i=1

where wi (i = 1, 2, . . . , M) is the weight of the connection to xi (i = 1, 2, . . . , M) and b is
the bias. The function f (·) is called the activation function.

Figure 3.4: A unit of neural networks. Each unit receives an input xi (i = 1, 2, · · · , M)
and computes one output y using the weights wi (i = 1, 2, · · · , M) corresponding to
the input and the bias b.

Related Work

20

It is difficult to approximate the relations between inputs and outputs using
only linear transformations when the relations between the inputs and outputs
are nonlinear. Following a linear transformation, the activation function can be
applied to perform a nonlinear transformation, making it possible to express
relations between inputs and outputs with nonlinear relationships. By introducing x0 = 1 and setting w0 = −b, x = x0 , x1 , · · · , xM , w = w0 , w1 , · · · , wM , Eq. (3.1) is
transformed to
M
X
y = f(
wi xi ) = f (w⊤ x).

(3.2)

i=0

For an input space X ⊆ RM and a latent space Z ⊆ RK (K ≪ M), deep anomaly
detection learns either a feature mapping function ϕ(·) : X 7→ Z that projects
original data points into the latent space Z or an anomaly score function τ(·) :
X 7→ R that directly computes anomaly scores from original data points so
that anomalies can be classified from normal data points, where both ϕ and τ
are neural networks with H(H ∈ N) middle layers and their weight matrices
Θ = {M1 , M2 , · · · , MH }. In the case of learning the feature mapping function ϕ,
after obtaining a set of features (also named a feature vector) from initial data
points using ϕ, an additional step is to compute anomaly scores in the latent
space Z.

3.1.1

RNN

Recurrent neural networks (RNN) [19] is a special type of neural networks with a
loop structure that can temporarily store past information and can handle time
series information and variable-length input/output as shown in Figure 3.5. The
basic structure of RNN is similar to that of a feedforward neural network with
three layers in Figure 3.3: an input layer, a middle layer, and an output layer,
with the significant difference being that the middle layer has a loop structure
and weight parameters for the connections. RNN receives one input xt at each
time t of the input time series data D = {x1 , x2 , · · · , xN } and generates an output
yt sequentially. The middle layer then receives not only the output xt from the
input layer but also the hidden state ht−1 of the middle layer at the last time (t − 1)
through a loop structure and computes yt . Similarly, the computation at time

3.1 Deep Anomaly Detection
(t − 1) takes into account the information at time (t − 2), so that yt theoretically
reflects the inputs x1 , x2 , · · · , xt−1 at all times received so far.

Figure 3.5: RNN with three layers: an input layer, a middle layer, and an output
layer. The middle layer receives not only the output xt from the input layer but
also the hidden state ht−1 of the middle layer at the last time (t − 1) through a loop
structure and computes yt .

3.1.2

LSTM

RNN with complex structures, such as long short-term memory (LSTM) [28,
21] and gated recurrent unit (GRU) [14, 15], have been proposed to predict
and learn longer time series data. The standard RNN can handle short-term
temporal information but cannot handle long-term temporal information well.
The propagation computation of RNN is expanded in the time direction. Therefore,
RNN that handles too long-term temporal information, as well as neural networks
that are too deep, can cause technical problems such as the gradient vanishing
problem.
LSTM is realized by replacing units in the middle layer of RNN with LSTM
blocks with a memory cell and three gates. LSTM uses the cell state ct as
the internal memory of the memory cell in addition to the hidden state ht of
RNN to hold the short-term information for a more extended period. The flow
of information in and out of memory cells is regulated by using three gate
mechanisms: a forget gate ft , an input gate it , and an output gate ot . The forget

21

Related Work

22

gate ft controls the degree that ct−1 of one last time is stored in the cell state ct , the
input gate it controls the amount of updating of the cell state ct , and the output
gate ot controls the degree of output from the cell state ct to determine the hidden
state ht .

ft = sigm W f x xt + W f h ht−1 + b f ,
it = sigm (Wix xt + Wih ht−1 + bi ) ,
ot = sigm (Wox xt + Woh ht−1 + bo ) ,

(3.3)

ct = ft ⊙ ct−1 + it ⊙ tanh (Wcx xt + Wch ht−1 + bc )
ht = ot ⊙ tanh (ct ) ,
where Wx ∈ Rh×M and Wh ∈ Rh×h are weights, and b ∈ Rh is the bias. They are
learned during the training phase. M and h are the dimensions of the input data
points and the number of cells in the LSTM unit, respectively. ⊙ represents the
Hadamard product.

3.1.3

GRU

GRU determines the hidden state ht using two types of gates, a reset gate rt and
an update gate zt , to store and forget past time series information. In Eq. (3.4),
fewer parameters than LSTM in Eq. (3.3) can reduce the computational cost. The
reset gate rt controls the degree that past hidden states are ignored. If rt = 0, then
the new hidden state h˜ is determined from only the input xt and past hidden
states are completely ignored. The update gate zt controls the degree that the
hidden state is updated. The structure of GRU plays the two roles of the forget
gate and the input gate in LSTM. It is the Whh (rt ⊙ ht−1 ) part in h˜ t that acts as the
forget gate, eliminating information to be forgotten from the past hidden state. It
is the Whx xt part in h˜ t that acts as the input gate and the weighting of the input
gate is performed for the newly added information.

3.2 Deep Learning for Feature Extraction

23

zt = sigm (Wzx xt + Wzh ht−1 ) ,
rt = sigm (Wrx xt + Wrh ht−1 ) ,
h˜ t = tanh (Whx xt + Whh (rt ⊙ ht−1 )) ,

(3.4)

ht = zt ⊙ ht−1 + (1 − zt ) ⊙ h˜ t ,
where the parameters in Eq. (3.4) are the same as in Eq. (3.3).

3.2

Deep Learning for Feature Extraction

Deep learning for feature extraction leverages neural networks for feature extraction and uses the extracted features for anomaly detection. Therefore, deep
learning and anomaly detection are completely independent. The features
preserve discriminative and representative information that helps to separate
anomalies from normal data. Feature extraction is a dimensionality reduction
technique that transforms the initial data from a high-dimensional space to a
low-dimensional latent space while preserving the properties of the original data.
Feature extraction creates new features as latent variables that cannot be directly
observed.
Formally, this approach can be represented as
Training of Neural Networks:
z = ϕ(x; Θ),

(3.5)

where ϕ(·) parameterized by Θ maps the initial data to the latent space. A function
that computes the anomaly score has no connection to ϕ(·).
Feature extraction is expected to reduce computational overhead, prevent
overfitting on the training dataset, and improve performance by reducing noise.
Feature extraction techniques include linear and nonlinear approaches. Unlike
linear data points, nonlinear data points cannot be fitted to a linear function,
making it difficult to capture the structure of the data points. However, since
most real-world data are nonlinear, various feature extraction techniques have

Related Work

24

been proposed to extract nonlinear features. In particular, deep learning methods
are effective for the extraction of nonlinear features [6]. In [94], variational
autoencoder (VAE) is used for feature extraction, and methods such as a local
outlier factor and a one-class support vector machine are used for anomaly
detection, respectively. Specifically, VAE is learned in advance to reconstruct
using only normal data points. In anomaly detection, when unseen data points are
given to the learned VAE, low-dimensional features are obtained by the encoder
of the VAE, and traditional anomaly detection methods compute anomaly scores
from these features to detect anomalies.

3.3

Learning Feature Representations of Normality

Learning feature representations of normality aims to learn normal expressive representations using neural networks, where deep learning and anomaly detection
depend on each other in some way. The most critical problem in deep learning
is extracting important features from the original data that help solve a task.
Usually, when the dimensionality is reduced, it is impossible to represent much
information. Suppose the properties of the original data are similar. In that case,
the information to be retained is reduced, and only the important information can
be extracted using neural networks for efficient learning. High-dimensional data
has many problems solving tasks, such as containing unnecessary or unorganized
important information. Therefore, learning feature representations that can
appropriately extract the information necessary to solve tasks is very important.
We further categorized learning feature representations of normality into
generic normality feature learning and anomaly measure-dependent feature
learning. The loss function of generic normality feature learning is specialized
for learning neural networks for feature extraction but not anomaly detection.
On the other hand, anomaly measure-dependent feature learning introduces
an anomaly measure used for anomaly detection to learn neural networks that
perform feature extraction.

3.3 Learning Feature Representations of Normality

3.3.1

25

Generic normality feature learning

Generic normality feature learning learns neural networks in advance to extract
normal features using only normal data in deep learning. If the neural networks
receive anomalous data points, they fail to extract the features. As a result,
anomalous data points are classified as anomalies in anomaly detection.
Formally, this approach can be represented as
Training of Neural Networks:
{Θ∗ , W∗ } = argmin
Θ,W

X

L ψ ϕ (x; Θ) ; W ,

x∈X

(3.6)

Anomaly Detection:
sx = A (x; Θ∗ , W∗ ) ,
where ϕ(·) parameterized by Θ maps the original data to the latent space, ψ(·)
parameterized by W is a surrogate learning that operates on the latent space,
and L (·) is a loss function to improve neural networks. A (·) is a function that
computes anomaly scores using Θ∗ and W∗ obtained during training phase.
Autoencoder
The purpose of autoencoder is to reconstruct data points such that it accurately
recovers the original data points [27]. Autoencoder for anomaly detection is
learned in advance using only normal data to reconstruct normal data points.
Given unseen data points, anomalies are declared when the reconstruction of the
data points does not work. If normal data points are given as input, the learned
autoencoder reflects the properties of the normality and thus reconstructs data
points that are close to the input. On the other hand, if anomalous data points are
given, the output data points cannot be reconstructed successfully because the
learned autoencoder has not been learned to extract the anomalous properties.
Autoencoder consists of an encoder and a decoder. The encoder realizes
dimensionality reduction and feature extraction, while the decoder realizes
data generation based on the extracted features. Specifically, the encoder maps
the input data points to the low-dimensional latent space while the decoder
reconstructs the data points from the projected latent space. The parameters of the

Related Work

26

Figure 3.6: Autoencoder architecture. Autoencoder consists of an encoder and
a decoder. The encoder and decoder are learned to extract the feature z that
preserves important information for reconstruction.
two networks are learned with a reconstruction loss function. When mapping the
input data points to the latent space, two networks are learned to create features
that preserve important information for reconstruction. The difference between
the unseen input and the reconstructed data points computes an anomaly score.
The formulation of this method is given as follows.

Training of Neural Networks:
X

{Θ∗E , Θ∗D } = argmin

x − ϕD
ΘE ,ΘD x∈X

2

ϕE (x; ΘE ) ; ΘD

,
(3.7)

Anomaly Detection:

2
∗
∗

sx =
x − ϕD ϕE x; ΘE ; ΘD

,
where ϕE (·) parameterized by ΘE is an encoding function and ϕD (·) parameterized
by ΘD is a decoding function. sx is an anomaly score computed by reconstruction
loss using Θ∗E and Θ∗D obtained during training phase.
If anomalous data points are given, data points that deviate from initial data
points are reconstructed, resulting in larger anomaly scores.
Several autoencoder architectures have been proposed with promising results
in anomaly detection. Sparse autoencoder introduces sparsity constraints on
the middle layer units [52]. Limiting the number of active units in the middle
layer, rather than reconstructing the input from all middle layer units, can
extract valuable features for reconstructing the input data points. In the k sparse
autoencoder, the k largest middle layer units are used for reconstruction [46].

3.3 Learning Feature Representations of Normality
Denoising autoencoder is learned to recover initial data points from corrupted
input data points by adding noise [88]. Instead of mapping data points to the
low-dimensional latent space, the encoder of VAE outputs parameters (mean
µ and standard deviation σ) of a predefined distribution in the latent space for
every input [40, 18].

Figure 3.7: VAE architecture. An encoder outputs a mean µ and a standard
deviation σ of a predefined distribution in the latent space. A decoder reconstructs
data points corresponding to x using z sampled from the distribution with µ and
σ.
Replicator neural network (replicator NN) [25] first introduced the concept of
reconstruction to anomaly detection. Replicator NN consists of a feedforward
multi-layer perceptron with three middle layers. RandNet [12] learns multiple
independent autoencoders (called autoencoder ensemble) by ensemble learning
and integrates these results in anomaly detection. Then, the autoencoder ensemble
consists of multiple sparsely-connected autoencoders with different network
structures as shown in Figure 3.8. It is because each autoencoder can be similar if
a fully connected autoencoder is employed. Even if autoencoders are integrated,
there is no guarantee that they will always perform better than using individual
autoencoders.
The autoencoder ensemble has also been extended to time series data as
S-RNNs [36].
OmniAnomaly [78] is a probabilistic RNN model combining GRU and VAE.
Its core idea is to capture the robust normal representations of multivariate time
series data, reconstruct the input time window by the representations, and use
the reconstruction probabilities to determine anomalies.
USAD [5] is based on two autoencoders and learned within adversarial
training. As shown in Figure 3.9, in the two autoencoders (AE1 and AE2 ), an

27

Related Work

28

Figure 3.8: Left: Two-layer fully connected autoencoder. Right: Two-layer
sparsely-connected autoencoder. Unlike the left figure, in the right figure, the
units in each layer are not all connected.
encoder is the same while decoders are different. By reconstructing the input
time window twice using two different autoencoders with different decoders,
anomalies similar to normal windows can also be detected. In the first stage, the
objective is to learn each autoencoder to reconstruct the input time window. In
the second stage, the objective is to learn AE1 to distinguish the real time window
from the time window coming from AE1 , and to learn AE1 to fool AE2 .

Generative Adversarial Networks
Generative Adversarial Networks (GANs) is an unsupervised deep learning
model based on a two-player minimax game between neural networks. As
shown in Figure 3.10, the standard GANs consists of two neural networks: i.e., a
generator (G) and a discriminator (D) [23]. G learns the data distribution over
real data points x by mapping from random latent variable z in the latent space to
the input space. If z is fed to G, realistic data point G(z) is generated. D classifies
whether a data point is real (authentic) or fake, i.e., generated data points by G.
The final goal of G is to generate realistic data points that can be classified as real
by D. On the other hand, D aims to distinguish the data points as either real or
fake. G and D are learned simultaneously and in competition with each other.

3.3 Learning Feature Representations of Normality

29

Figure 3.9: USAD architecture. In the first stage, the objective is to learn AE1
and AE2 to reconstruct the input window W. In the second stage, the objective is
to learn AE1 to distinguish the real window W from the time window AE1 (W)
coming from AE1 , and to learn AE1 to fool AE2 .
G does not have direct access to real data points. Therefore, they learn together
through the classification results of D.
Let pdata be a data distribution over initial data points and pz be a fixed latent
distribution over latent variables z. To learn G distribution pG over data points
x, G(z; ΘG ) represents the mapping of z to the input space X ⊆ RM , where G(·)
is differentiable function parameterized by ΘG . D(x; ΘD ) represents a second
differentiable function parameterized by ΘD . D(x) outputs the probabilities that
x came from pdata (close to one) rather than pG (close to zero). D is learned to
maximize the probability of assigning the correct label to x or G(z) from G, and G
is simultaneously learned to minimize log(1 − D(G(z))). In other words, D and G
play the two-player minimax game with loss function min max L (D, G):
G

D

L (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))],
where E(·) represents the expectation.

(3.8)

Related Work

30

Figure 3.10: GANs architecture. A standard GANs consists of a generator and a
discriminator. The neural networks are learned through a two-player minimax
game.

In the training phase, while either ΘD and ΘG are being updated, the other
parameter is fixed. When G and D each perform ideally, pG = pdata , which is
equivalent to D(x) = 0.5 (see Appendix).
GANs is utilized for anomaly detection [73, 42, 66, 67, 96, 2, 64, 24, 97, 3, 62,
72, 99, 38, 61, 50, 77, 34, 95]. AnoGAN performs anomaly detection using the
GANs model pre-learned with only normal data points [73]. Given data points
that do not follow the model, it is possible to classify them as anomalies. The
anomaly score in Eq. (3.9) for each unseen data point x is computed as

A (x) = αLG (x) + (1 − α)LD (x).

(3.9)

A (x) consists of two losses: one is the reconstruction loss LG (x) and the other
is discrimination loss LD (x). The larger A (x) becomes, the more likely to be
anomalous. α is the coefficient for considering weights for multiple losses. The
larger the value of α, the greater the effect of the reconstruction loss, while the
larger the value of (1 − α), the greater the effect of the discrimination loss. LG (x)
is defined as
LG (x) = ∥x − G(z)∥1 ,

(3.10)

while LD (x) is defined as
LD (x) = σ(D(x), 1).

(3.11)

3.3 Learning Feature Representations of Normality
LG (x) is the L1 norm between a data point x and the reconstructed data point
G(z). After enough iteration to that the similarity between G(z) and x becomes
high, a latent variable z corresponding to the data point x is obtained, and G(z) is
generated by feeding z to G. Since the GANs model is learned using only normal
data points, it can generate data points that reflect the properties of normal data
points. On the other hand, when anomalous data points are received, they are
transformed into realistic normal data points to fool D. Therefore, if a normal
data point x is given to the learned GANs model, G(z) should become similar to x
and LG (x) will be small. If an anomalous data point x is given, the reconstruction
will not work, and the loss LG (x) will be large.
LD (x) is the cross-entropy loss σ(·) between the probability that D classifies a
data point x as real (i.e., a normal data point) and class one, where the class one
implies that the data point x is real. As the discrimination loss D(x) approaches
zero, in other words, as D classifies x as fake, LD (x) becomes larger. Because D is
learned to correctly classify normal data points, given an anomalous data point x,
it is classified as fake, and D(x) approaches zero.
Anomaly scores for all data point x are computed by using Eq. (3.9), the data
points with high A (x) are classified as anomalies.

Figure 3.11: AnoGAN architecture in the test phase. The anomaly score consists
of the reconstruction loss and the discrimination loss computed from the learned
generator and discriminator.
MAD-GAN is a GANs-based anomaly detection method for dealing with
multi-dimensional time series data [42]. In MAD-GAN, LSTM is employed for
both G and D of the standard GANs model to handle time series data, respectively.
G generates a time window W when it receives a latent variable z following the
Gaussian distribution as an input. D distinguishes the given time window W

31

32

Related Work

Figure 3.12: MAD-GAN architecture in the training phase. GANs-based anomaly
detection is performed on time series data by introducing LSTM into a generator
and a discriminator of GANs.

Figure 3.13: MAD-GAN architecture in the test phase. When computing the
reconstruction loss, find a latent variable z corresponding to a test window W
through invert mapping.
as either real or fake. Reconstruction and discrimination losses are utilized to
compute anomaly scores for anomaly detection. As shown in Figure 3.13, the
latent variable z corresponding to the test time window is obtained by invert
mapping. The latent variable z is randomly sampled from the latent space and
given to G to reconstruct G(z) to compute the reconstruction loss of W. The latent
variables are updated from the latent space with the gradients obtained from the
loss function defined with W and G(z).
In addition to G and D in the standard GANs, bidirectional GAN (BiGAN)
introduces an encoder (E) that maps a data point x to a latent variable z. It avoids
the computationally expensive step of recovering latent variables during anomaly
detection. To learn E distribution pE over latent variables, E(x; ΘE ) represents
the mapping of x to the latent space Z ⊆ RK , where E(·) is differentiable function
parameterized by ΘE . Furthermore, D uses not only a data point (x or G(z)) but
also pairs of the data point and its latent variable (tuples (x, E(x)) or (G(z), z)) to

3.3 Learning Feature Representations of Normality
classify whether the data point is real or fake. Here, the latent variable is either
the output E(x) of E or the input z to G. Figure 3.14 shows the BiGAN architecture.

Figure 3.14: BiGAN architecture. It introduces an encoder that maps data points
to the latent space.
The final goal of E is to generate latent variables that create given data points.
In other words, E learns to map data points to latent space so that D erroneously
classifies the real data points as fake.
D(x, z; ΘD ) represents a differentiable function parameterized by ΘD . D(x, z)
outputs the probability that x is from pdata (real, close to 1) rather than pG (fake,
close to 0). D learns to maximize the probability of assigning the correct label
to (x, E(x)) and (G(z), z) from G. G is learned to minimize log(1 − D(G(z), z)).
Furthermore, E is learned to minimize log(D(x, E(x))) simultaneously with D and
G. In other words, D, E, and G play the two-player minimax game with loss
function min max L (D, E, G):
E,G

D

L (D, E, G) = Ex∼pdata (x) [log(D(x, E(x)))] + Ez∼pz (z) [log(1 − D(G(z), z))](3.12)
If E and G are optimal, then E=G−1 , i.e., G(E(x)) = x and E(G(z)) = z. E and G
cannot access each other; E cannot access data points generated by G; similarly, G
cannot access latent variables generated by E. However, to fool D, E and G are
learned to invert.

33

Related Work

34

As shown in Figure 3.15, Efficient GAN-based anomaly detection is a BiGANbased approach for anomaly detection [96].

Figure 3.15: Efficient GAN-based anomaly detection architecture. In anomaly
detection, G receives the latent variable E(x) of x, transformed by the learned E. D
receives the pair of x and its latent variable E(x).

D, E, and G of BiGAN are learned using only normal data points defined in
Eq. (3.12). Therefore, the learned GANs model can be reflected in the normal
data points. Given data points that do not follow the model as input, they are
classified as anomalies.
The anomaly score A (x) in Eq. (3.13) for each unseen data point x is computed
as

A (x) = αLG (x) + (1 − α)LD (x).

(3.13)

Similar to Eq. (3.9), A (x) consists of two losses: one is the reconstruction loss
LG (x) and the other is discrimination loss LD (x). The larger A (x) becomes, the
more likely to be anomalous. α is the coefficient for considering weights for two
losses. LG (x) is defined as
LG (x) = ∥x − G(E(x))∥1 ,

(3.14)

while LD (x) is defined as
LD (x) = σ(D(x, E(x)), 1).

(3.15)

3.3 Learning Feature Representations of Normality
Eq. (3.14) differs from Eq. (3.10) in that the unseen data point is mapped into
the latent space using the learned E and then reconstructed using the learned G.
Eq. (3.15) also differs from Eq. (3.11) in that the learned D is given a pair of the
unseen data point and its latent variable.

Transformer
Transformer is an encoder-decoder model in which the recursion and convolution
mechanisms are substituted only by the attention mechanism, which allows for
efficiently learning temporal dependencies [86]. Transformer is widely used in
the field of natural language processing. The architecture of Transformer is shown
in Figure 3.16. RNN and LSTM, frequently used to learn temporal dependencies,
cannot compute in parallel because they use the results computed at the previous
time to compute the next time sequentially. Transformer achieves efficient parallel
computation through the attention mechanism.
Scaled dot-product attention
Attention is a mechanism for expressing where to focus attention on input
or for generating output according to focused data points. The inputs to scaled
dot-product attention, one of the attention mechanisms used in Transformer, are
queries and keys of dimension dk and values of dimension dv [86]. The inner
product of the query for which the feature (latent representation) is to be obtained
and all the keys are computed to obtain an attention score representing the degree
of association between the query and the keys as shown in Figure 3.17. In this
case, the inner product is scaled by the square root of dk dimensions of queries
and keys. If the dimensions are large, the inner product may become too large,
and learning may not be successful. Then, the attention score is converted so that
the sum is one using the softmax function. As shown in Figure 3.18, computing
the weighted sum of the converted attention score and values yields the feature
of the query.
Let Q be a matrix of queries, K be a matrix of keys, and V be a matrix of values.
The output of the scaled dot-product attention is presented as
!
QK⊤
attention(Q, K, V) = softmax p V.
(3.16)
dk

35

Related Work

36

Figure 3.16: Transformer architecture [86].
Multi-head attention While a single attention mechanism is used in Eq. (3.16),
multi-head attention divides Q, K and V into h Qi , Ki , Vi (1 ≤ i ≤ h) and uses
h attention mechanisms (head) in parallel, each of which can obtain helpful
information from a different space of features [86]. Finally, they are converted
into a single vector using the projection weight WO shown as follows:
MultiHeadAtt(Q, K, V) = Concat(head1 , head2 , . . . , headh )WO ,
where headi = attention(Qi , Ki , Vi ).

(3.17)

3.3 Learning Feature Representations of Normality

Figure 3.17: Computation of attention scores. An attention score is obtained by
computing the inner product of the target query for which the feature is to be
obtained and all the keys.

Figure 3.18: Acquisition of query features. Computing the weighted sum of
the transformed attention scores and values by applying the softmax function
produces a feature of the target query.
TranAD [84] is a transformer-based anomaly detection method. It combines
two transformers, learns by adversarial training, and uses the reconstruction loss
to detect anomalies. Anomaly detection consists of two stages. The reconstruction
loss of the input time window in the first stage is given as input to the second
stage, and another decoder is used to reconstruct that input time window.

37

Related Work

38

3.3.2

Anomaly measure-dependent feature learning

Anomaly measure-dependent feature learning learns neural networks to extract
features optimized for a measure of anomaly score. In generic normality feature
learning, neural networks to extract features are learned independently of the
anomaly measure. This approach uses the anomaly measure for anomaly detection
and learning feature extraction, as shown in Eq. (3.18). Therefore, feature
extraction that is more specialized for anomaly detection is expected.
Formally, this approach can be represented as

Training of Neural Networks:
∗

∗

{Θ , W } = argmin
Θ,W

X

L f ϕ (x; Θ) ; W ,

x∈X

(3.18)

Anomaly Detection:

sx = f ϕ (x; Θ∗ ) ; W∗ ,
where ϕ(·) parameterized by Θ maps original data points to the latent space, f (·)
parameterized by W is an anomaly measure that computes the anomaly score
on the latent space using Θ∗ and W∗ obtained during the training phase, L is
the loss function to improve the neural networks. In this section, we describe
methods that use three widely used anomaly measures: distance-based, one-class
classification-based, and clustering-based.
Distance-based anomaly detection
Distance-based anomaly detection learns neural networks to extract features
optimized for distance-based anomaly measures. In this approach, k-nearest
neighbor-based approach [65], average k-nearest neighbor-based approach [4],
local distance-based approach [98], and random nearest neighbor-based approach [57, 60, 79] have been proposed. While these traditional approaches
deal with the original data points as they are, the deep distance-based anomaly
detection approach computes the anomaly score using the distance between
samples (L1 norm or L2 norm) as the anomaly measure after transforming highdimensional data points to the low-dimensional latent space. Suppose data points

3.3 Learning Feature Representations of Normality

39

with large dimensions are handled as they are. In that case, it is difficult to detect
anomalies because the distance between data points becomes closer due to the
curse of dimensionality.
The nearest neighbor-based approach in [89] learns neural networks for feature
extraction using the distance between two types of features of the same data
point: optimized and randomly projected. The anomaly measure used in learning
neural networks for feature extraction is also used to compute anomaly scores.
The formulation of this method is given as follows.

Training of Neural Networks:
Θ∗ = argmin
Θ

X
2
ϕ (x; Θ) − ϕ′ (x) ,
x∈X

(3.19)

Anomaly Detection:

2
sx = ϕ (x; Θ∗ ) − ϕ′ (x) ,
where ϕ(·) parameterized by Θ maps original data points to the lower-dimensional
latent space, ϕ′ (·) is a random mapping function that is ϕ(·) with fixed Θ. Using
a fixed ϕ′ (·), minimizing loss function helps learn the frequency of underlying
patterns in the data. As a result, anomalies can have a larger anomaly score than
normal data points, and this value is used directly for anomaly detection.

One-class classification-based anomaly detection
One-class classification-based anomaly detection learns neural networks to extract
features optimized for one-class classification [51, 68, 74, 81]. This approach
assumes that the training dataset consists of normal data points. It determines a
discriminative boundary so that the normal data points belong to a single class
and detects outliers based on that boundary. Support vector machines(SVM)
inspire most one-class classification models [16]. One-class SVM (OC-SVM or
v-SVC) [74] and support vector data description (SVDD) [81] are widely used in
one-class classification models.
OC-SVM learns a hyperplane to separate all the data points from the origin
in a reproducing kernel Hilbert space (RKHS) and maximizes the distance from

Related Work

40

this hyperplane to the origin. It allows data points located near the origin to
be considered anomalous. In particular, one-class neural networks (OC-NN),
which combines OC-SVM and neural networks, finds a hyperplane in the lowdimensional latent space rather than the high-dimensional input space [10, 54, 90].
The neural networks is learned so that the transformed features maximize the
distance (margin) from the origin to the hyperplane, as shown in Figure 3.19.

Figure 3.19: OC-NN architecture. OC-NN learns a neural networks transformation ϕ(·; Θ) from the input space X to the latent space Z so that the distance from
the origin to the hyperplane is maximized.
For the input space X ⊆ RM and the latent space Z ⊆ RK , let ϕ(·; Θ) : X → Z
be a neural networks with the weight matrix Θ from input to middle layers.
That is, ϕ(x; Θ) ∈ Z is the feature of x ∈ X given by network ϕ(·) with parameters
Θ. OC-NN aims to learn the parameters Θ of the neural networks and find the
hyperplane with a margin of maximization in the latent space Z. Given training
data points x1 , x2 , . . . , xN ∈ X, the formulation of OC-NN is defined as

Training of Neural Networks:
1
1
{w∗ , Θ∗ , r∗ } = argmin ∥w∥22 + ∥Θ∥2F
2
w,Θ,r 2
1 X
+
max{0, r − w⊤ ϕ(xi ; Θ)} − r
νN
x∈X

Anomaly Detection:
sx = r∗ − w∗⊤ ϕ(x; Θ∗ ),

(3.20)

3.3 Learning Feature Representations of Normality
where w is the norm perpendicular to the hyperplane, ∥ · ∥F denotes the Frobenius
norm. r is the bias of the hyperplane. ν ∈ (0, 1] is a hyperparameter that controls the
number of data points allowed to cross the hyperplane to the origin side. Because,
in practice, the training dataset may contain a small number of anomalies, ν is
introduced to loosen the condition. For a given test data point x ∈ X, an anomaly
score sx is computed using learned w∗ , Θ∗ , and r∗ . If sx > 0, x is determined to be
anomalous.
In the approach in [54, 90], autoencoder is utilized for neural networks to
enhance the representativeness of the features generated by ϕ(·; Θ) and the
reconstruction loss is added into Eq. (3.21).
One-class SVDD minimizes the volume of the hypersphere so that the normal
data points are enclosed within the hypersphere. As with OC-SVM, deep SVDD,
which combines SVDD and neural networks, finds a hypersphere in the lowdimensional latent space rather than the high-dimensional input space [69, 70].
The neural networks is learned to minimize the volume of the sphere in which
the transformed features are enclosed, as shown in Figure 3.20.

Figure 3.20: Deep SVDD architecture. Deep SVDD learns a neural networks
transformation ϕ(·; Θ) from the input space X to the latent space Z so that the
features are enclosed in a hypersphere with center c and radius R of minimum
volume. Normal data points are mapped near the center of the hypersphere,
while anomalies are mapped to the outside of the hypersphere [69].
For the input space X ⊆ RM and the latent space Z ⊆ RK , let ϕ(·; Θ) : X → Z be
neural networks with H ∈ N middle layers and set of weights Θ = {W1 , W2 , . . . , WH }
where Wh is a weight of layer h ∈ {1, 2, . . . , H}. That is, ϕ(x; Θ) ∈ Z is a feature of
x ∈ X given by the neural networks ϕ(·) with parameters Θ. Deep SVDD aims
to learn the parameters Θ of the neural networks and minimize the volume of a

41

Related Work

42

hypersphere with center c ∈ Z and radius R > 0 in the latent space Z, enclosing
features. Given training data points x1 , x2 , . . . , xN ∈ X, the formulation of deep
SVDD is defined as

Training of Neural Networks:
n

1 X
{R , Θ } = argmin R +
max{0, ∥ϕ(xi ; Θ) − c ∥2 − R2 }
νN
r,Θ
∗

∗

2

i=1

+

λ
2

L
X

W l ∥2F ,
∥W

l=1

Anomaly Detection:
sx = ∥ϕ(x; Θ∗ ) − c ∥2 ,
(3.21)
where ν ∈ (0, 1] is a hyperparameter that controls the number of data points
mapped outside the hypersphere. The last term is a weight decay regularizer
on the neural networks parameters Θ with hyperparameter λ > 0, where ∥ · ∥F
denotes the Frobenius norm. For a given test data point x ∈ X, an anomaly score
sx is computed using learned Θ∗ . If sx > 0, x is determined to be anomalous.
Clustering-based anomaly detection
Deep clustering-based anomaly detection learns neural networks that extract
features so that anomalies deviate from clusters in the learned latent space. When
clustering is performed on normal data points and anomalies, anomalies tend
to deviate from the clusters compared to normal data points. In this approach,
anomaly detection is based on clustering results using the distance between
cluster centers [33], cluster size and distance from the cluster center [26], cluster
density [75, 9]. Another approach that has been proposed is to utilize a minimum
spanning tree that reflects the clustering results to perform anomaly detection [32].
Gaussian mixture model (GMM)-based anomaly detection is a type of this
approach, which uses a model combining multiple Gaussian distributions for
anomaly detection [20, 45]. Unlike the standard clustering that assigns a single
data point to a single cluster, such as k-means, clustering with GMM assumes

3.3 Learning Feature Representations of Normality
that it comprises multiple clusters. Also, each cluster is generated based on a
Gaussian distribution, and the parameters of the Gaussian distribution (mean,
variance) are estimated using the expectation–maximization (EM) algorithm [29].
Deep clustering learns neural networks that compress the input data point
into the low-dimensional feature and cluster in the compressed features simultaneously. Since clustering results are strongly dependent on the dataset,
learning neural networks to extract latent features with compressed properties of
input data points optimized for clustering guarantees clustering performance
on different datasets. Many methods based on deep clustering have been proposed [8, 17, 22, 82, 91–93, 76, 100, 43].
Anomaly detection using unsupervised Gaussian mixture VAE (GMVAE)
consists of an encoder that generates features of input data points and a decoder
that reconstructs the data points [43]. This method combines VAE and GMM,
as shown in Figure 3.21. The encoder learns a posterior that generates features
of input data points to follow a prior, the standard Gaussian distribution. The
decoder learns a likelihood distribution to reconstruct the data points. In the test
phase in Figure 3.22, a mean µ and a standard deviation σ in the latent space
are obtained by feeding a test data point xtest to the learned encoder, and then
a feature z is sampled L times from the Gaussian distribution with µ and σ. A
reconstruction probability is obtained by feeding xtest to the Gaussian distribution
ˆ which is reconstructed by feeding each z to the
with mean µˆ and variance σ,
learned decoder. The average of the L reconstruction probabilities is an anomaly
score of xtest , and an anomaly is declared when the value is smaller than a
predefined threshold.

Figure 3.21: GMVAE architecture in the training phase. VAE-based GMM is
utilized for anomaly detection. Learning VAE and GMM enables the extraction
of features useful for anomaly detection.

43

44

Related Work

Figure 3.22: GMVAE architecture in the test phase. Reconstruction probabilities
of test data points are obtained using the learned VAE and compared with a given
threshold to determine anomalies.

Figure 3.23: DAGMM architecture. DAGMM consists of a compression network
and an estimation network. DAGMM utilizes autoencoder to extract features of
input data points and GMM to perform anomaly detection.

Deep autoencoding GMM (DAGMM) utilizes autoencoder to extract features
of input data points and GMM to perform anomaly detection[100]. DAGMM can
extract features useful for anomaly detection by learning autoencoder and GMM
simultaneously. DAGMM consists of two main elements: a compression network
and an estimation network, as shown in Figure 3.23.
In the compression network, two types of processing are performed: extraction of the low-dimensional features by autoencoder and computation of the
reconstruction error. For the input space X ⊆ RM and the latent space Z ⊆ RK ,
given a data point x ∈ X, the compression network computes its low-dimensional

3.3 Learning Feature Representations of Normality
feature z ∈ Z shown as
zc = h(x; Θe ),
x′ = g(zc ; Θd ),
zr = f (x, x′ ),
z = [zc , zr ],
where h(·; Θe ) : X → Z parameterized by Θe and g(·; Θd ): Z → X parameterized
by Θd are an encoder and a decoder of autoencoder, respectively. zc is the
low-dimensional feature extracted by the encoder, and x′ is the counterpart of x
reconstructed by the decoder. f (·) is a function that computes the reconstruction
error, and its result is zr . Finally, zc and zr are concatenated and given as input to
the estimation network.
Given z, density estimation is performed under the GMM framework in the
estimation network. In the training phase with unknown Gaussian mixture
distribution, mixture means, and mixture covariances, the estimation network
estimates the parameters of the GMM without using the EM algorithm by using
multi-layer neural networks. Given a low-dimensional feature z and an integer
K as the number of Gaussian distributions, the estimation network uses the
multi-layer neural networks and the softmax function to compute the probability
that x corresponding to z belongs to each K Gaussian distribution shown as

p = MLN(z; Θm ),
γˆ = softmax(p),
where γˆ is a K-dimensional vector representing the probability of belonging to
each K Gaussian distribution, obtained by applying the softmax function to the
output p of the multi-layer neural networks MLN(·; Θm ) : Rp → RK parameterized
by Θm . By using γˆ i (i = 1, 2, . . . , N) computed for N training data points x1 , x2 , . . . , xN ,
the parameters of K Gaussian distributions are estimated. An anomaly score
is computed using the estimated parameters after obtaining a low-dimensional
feature z in the test phase. An anomaly is declared if this value exceeds a
predefined threshold.

45

Related Work

46

3.4

End-to-End Anomaly Score Learning

End-to-end anomaly score learning aims to learn scalar anomaly scores in an
end-to-end fashion. Compared to anomaly measure-dependent feature learning,
it computes anomaly scores without relying on an anomaly measure, which means
it learns neural networks that output anomaly scores directly. Furthermore, this
approach learns to extract features and compute anomaly scores simultaneously.
In contrast, deep learning for feature extraction uses some heuristics to compute
anomaly scores after features are extracted.
For the input space X ⊆ RM , let τ(·; Θ) : X → R be a neural networks parameterized by Θ to compute anomaly scores. That is, τ(x·; Θ) ∈ R is the anomaly
score of x ∈ X given by the neural networks τ. Formally, this approach can be
defined as

Training of Neural Networks:
∗

Θ = argmin
Θ

X

L (τ (x; Θ)) ,

x∈X

(3.22)

Anomaly Detection:
sx = τ (x; Θ∗ ) .
For a given test data point x ∈ X, an anomaly score sx is computed using learned
τ with Θ∗ .
The critical points in this approach are the synthesis of the computation of
anomaly scores and the neural networks to extract features and the design of
the loss function for learning to compute anomaly scores directly. This section
describes end-to-end one-class classification models applicable to time series
data.

3.4.1

End-to-end one-class classification models

End-to-end one-class classification models learn a one-class classifier that determines whether a given data point is normal or not [71]. Instead of learning
neural networks to extract features to optimize one-class classification, such as
OC-SVM or SVDD, as in the approach in Section 3.3.2, this approach learns neural

3.4 End-to-End Anomaly Score Learning

47

networks in an end-to-end fashion to output anomaly scores directly. Most of
this approach combines a GANs-style adversarial training framework with an
end-to-end framework for one-class classification, where a discriminator learns
to discriminate normal data points from adversarially generated anomalies. The
first difference from GANs-based anomaly detection in Section 3.3.1 is that the
generative model, which corresponds to a generator in GANs, generates new
data points when normal data points are given as input. While the discriminator
distinguishes between normal data points and data points generated by the
generative model, the generative model generates realistic data points to fool the
discriminator. Another difference is that GANs uses a different indicator that
employs the output of the discriminator to determine anomalies, whereas, in this
approach, the discriminator directly determines anomalies.

The adversarially learned one-class classification shown in Figure 3.24 consists
of two modules, a network R and a network D, and the two networks are learned
using adversarial learning [71]. The network R generates R(x) to reconstruct the
input data point x and tries to fool the network D into determining R(x) to be
the initial data. The network D tries to determine that x belongs to the normal
class and R(x) to the other than the normal class. The output of the network D is
the likelihood that a given input data point belongs to the normal class. In the
training phase, the input data point with Gaussian noise is given to the network
R so that the model of these two networks is robust to noisy or corrupted inputs.

In contrast to the standard GANs, instead of mapping the latent space Z to a
data point with the distribution pdata , the network R maps

x˜ = (x ∼ pdata ) + ζ ∼ N 0, σ2 1 → R(˜x) ∼ pdata ,

(3.23)

where ζ is an added
noise
sampled from the normal distribution with standard

2
deviation σ, N 0, σ 1 .

Related Work

48

The network R and the network D play the two-player minimax game with
loss function min max L (D, R):
R

D

Training of Neural Networks:
L (D, R) = Ex∼pdata (x) [log D(x)]
+ Exˆ ∼pxˆ (x+N(0,σ2 1)) [log(1 − D(R(ˆx)))]
+ ∥x − R(ˆx)∥22 ,

(3.24)

Anomaly Detection:
sx = D(R(x)).
where pdata represents a real data distribution. The network R generates data
points with the probability distribution of pdata , and as a result its own distribution
is given by pR ∼ R(x ∼ pdata ; ΘR ), where ΘR is the parameter of the network R. For
a given test data point x ∈ X, an anomaly score sx is computed using the learned
networks R and D, and if it is less than a predefined threshold, it is determined to
be anomalous. The flow of the anomaly detection is shown in Figure 3.25.

Figure 3.24: Adversarially learned one-class classification in the training phase. it
consists of two modules, the network R and the network D, and the two networks
are learned using adversarial training. The network R is optimized to reconstruct
data points belonging to the normal class, while the network D tries to classify
input data points as normal and non-normal classes. The network D outputs the
likelihood of the given input data point belonging to the normal class.
Fence GAN (FGAN) has different loss functions of a generator (G) and a
discriminator (D) from the standard GANs [53]. Whereas the standard GANs
aims to generate pG = pdata , that is, to generate data points at regions of high data
density, the loss function of G in FGAN is to generate data points around the

3.4 End-to-End Anomaly Score Learning

49

Figure 3.25: Adversarially learned one-class classification in the test phase. An
anomaly score is computed using the learned networks R and D in the test phase.
Given a test data point x, R(x) is reconstructed using the network R, and R(x) is
given as input to the network D. The network D outputs the probability D(R(x))
that x belongs to the normal class, and if this value is less than a predefined
threshold, an anomaly is declared.
boundary of X. Similar to the standard GANs, D in FGAN tries to classify real
data points correctly or classify generated data points from G correctly. FGAN
does not need to rely on the reconstruction loss from G and does not require
modifications to the standard GANs architecture, such as the introduction of the
encoder. The loss function of G consists of an encirclement loss and a dispersion
loss. In the encirclement loss, G is learned to generate data points around the
boundary. In addition, the dispersion loss is introduced to the loss function to
avoid the mode collapse problem that generates only certain types of data in
GANs. The dispersion loss maximizes the distance of generated data points from
their center. G and D play the minimax game with loss functions min L (G) and
G

max L (D):
D

Training of Neural Networks:
L (G) = Ez∼pz [log(|α − D(G(z; ΘG ); ΘD )|)]
1
+β
,
Ez∼pz ∥G(z; ΘG ) − µ∥2
L (D) = Ex∼pdata [log(D(x; ΘD ))]
+ Ez∼pz [γ log(1 − D(G(z; ΘG ); ΘD ))],
Anomaly Detection:
sx = D(G(x; Θ∗G ); Θ∗D ),

(3.25)

50

Related Work
where α ∈ (0, 1) is used for G to generate data points around the boundary, β ∈ R+
is the dispersion hyperparameter, µ ∈ RM is the center of generated data points.
γ ∈ (0, 1] is an anomaly hyperparameter. If γ is less than one, D focuses more
on correctly classifying the real data points, and conversely, D focuses more on
correctly classifying generated data points. For a given test data point x, an
anomaly score sx is computed using Θ∗G and Θ∗D obtained during the training
phase.

Chapter 4
Anomaly Detection Model
Combining GANs and RNN
In this chapter, we propose a method that combines GANs and RNN to detect
anomalies in multivariate time series data with high accuracy. First, we describe
the architecture of the proposed method based on the features of RNN. We experimentally evaluate the proposed and existing methods’ performance on datasets
containing collective anomalies and confirm the effectiveness of combining GANs
and RNN for anomaly detection.

4.1

Introduction

In anomaly detection in multivariate data, it is necessary to look at a large
number of data simultaneously, complicating the detection criteria and making
it challenging to determine anomalies. GANs, one of the deep learning models,
can handle complex data, such as multivariate data. However, GANs cannot
handle time series data. In this chapter, we propose a method to apply RNN to
the encoder, generator, and discriminator of BiGAN, a type of GANs, for anomaly
detection of multivariate time series data. We perform several experiments on
artificial datasets with collective anomalies created by swapping data points at
different times.

Anomaly Detection Model Combining GANs and RNN

52

4.2

Problem Formulation

In this thesis, we deal with multivariate time series data. A time series data T is
a series of T consecutive data points.
T = {x1 , x2 , . . . , xT },
where xt is the data point measured at a particular time t, and each data point
has one or more variables. In particular, a single-dimensional time series data
x1 , x2 , . . . , xt consists of data points xt with one variable. In order to reflect the
relationship between the data point xt at time t and the data points at earlier times
in the anomaly detection model, data points are treated as a window Wt of length
L consisting of L data points, rather than individually.
Wt = {xt−L+1 , xt−L+2 , . . . , xt }.
Instead of the original time series data T , consecutive time windows W =
{W1 , W2 , . . . , WT } are given to the anomaly detection model as a training dataset
to learn this model. In the training phase, unsupervised learning is employed. In
other words, by learning the anomaly detection model in advance using only W,
the set of normal time windows Wt , it is possible to detect an anomaly when an
ˆ t (t > T) (corresponding to Wt above) different
unseen anomalous time window W
from this learned model is given.

Figure 4.1: Time windows of length L consisting of L data points transformed
from a time series data. This figure is shown for L = 3.

4.3 Data Preprocessing

53

ˆ (corresponding to W above)
In anomaly detection, unseen time windows W
are given to the learned anomaly detection model as a test dataset. For each
ˆ t , an anomaly label yt (yt ∈ {0, 1}) is predicted. Where yt = 1, it
time window W
ˆ T is anomalous, and yt = 0, it means that it is normal. The anomaly
means that W
detection model is used to compute an anomaly score st , which represents the
degree of the anomaly to predict an anomaly label yt for each unseen time window
ˆ t . Suppose st exceeds the predefined threshold. In that case, it means that W
ˆ t is
W
significantly different from the features of W and is determined to be anomalous
(yt = 1). Hereafter, for simplicity, let W be the training time window used in the
ˆ be the unseen window used in the test phase.
training phase and W

4.3

Data Preprocessing

In order to construct a highly robust model, each data point xt in the training and
test dataset was normalized to the range [0, 1).
xt ←

xt − min(T )
,
max(T ) − min(T ) + ϵ

where min(T ) and max(T ) are the element-wise minimum and maximum vectors.
The ϵ is a small constant vector to avoid zero-division.

4.4

Multivariate Time Series Anomaly Detection Model

4.4.1

Overall architecture

We developed a method of multivariate anomaly detection with recurrent unitGAN (MARU-GAN) consisting of three neural networks, an encoder(E), a
generator(G), and a discriminator(D). The adversarial training is performed on
the three neural networks. Introducing RNN to the three neural networks makes
it possible to reflect temporal dependencies in time series data.
The overview of MARU-GAN is shown in Figure 4.2. The training process is
summarized in Algorithm 1.

Anomaly Detection Model Combining GANs and RNN

54

Figure 4.2: Proposed architecture in anomaly detection. It consists of an encoder,
a generator, and a discriminator. These three neural networks are learned by
adversarial training.
Algorithm 1 MARU-GAN training algorithm
Input: Encoder E, Generator G, Discriminator D,
Normal windows for training W = {W1 , W2 , . . . , WT },
Number of epochs N
Output: Learned E, G, D
E, G, D ← initialize weights
n←0
while n < N do
for t = 1 to T do
WtE ← E(Wt )
z ← random latent variable
WtG ← G(z)
L = EWt ∼pdata (Wt ) [log(D(Wt , WtE ))] + Ez∼pz (z) [log(1 − D(WtG , z))]
update weights of E, G, D using L
end for
n ← n+1
end while

4.4.2

Encoder-decoder model

Encoder-decoder model [80] transforms an input time window into another. It
is commonly used for machine translations and media conversions. It consists
of an encoder and a decoder. Using RNN in the encoder and decoder makes it

4.4 Multivariate Time Series Anomaly Detection Model

Figure 4.3: Encoder-decoder model with RNN architecture. An encoder receives
an input time window ABC and produces WXY as the output time window at a
decoder.

possible to learn temporal dependencies among data points in the time window
and reconstruct the time window based on such features. First, the encoder
compresses the input time window into a low-dimensional hidden state vector h.
The compressed vector h includes essential features of the time window. Based
on the received hidden state vector h, the decoder gives the output and hidden
state of the previous time as input for the next time and then generates the data
points of the output time window.

Training of MARU-GAN
When receiving a random latent variable z, G generates a realistic time window
G(z). E maps a time window W to the latent space and obtains a latent variable
E(W), which is the opposite role of G. When D receives a tuple of a time window
and a latent variable ((W, E(W))or(G(z), z)) from E or G, it determines whether the
time window is real or fake. In the training phase, D, E, and G play the two-player
minimax game with loss function min max L (D, E, G):
E,G

D

L (D, E, G) = EW∼pdata (W) [log(D(W, E(W)))] + Ez∼pz (z) [log(1 − D(G(z), z))].(4.1)

55

Anomaly Detection Model Combining GANs and RNN

56

Figure 4.4: Encoder architecture. When E receives a time window W, it outputs
a latent variable E(W) with compressed features of W using RNN with three
hidden layers.
Encoder training
E generates a hidden state that compresses the features of the input time window.
RNN is used for E in MARU-GAN to reflect temporal dependencies among
data points. Figure 4.4 shows the encoder network in MARU-GAN. E obtains
a time window W of fixed length T as input, and outputs the last hidden states
(1) (2)
(M)
(hT , hT , . . . , hT ) as E(W). The features of the time window are compressed in
E(W). In this thesis, we implemented RNN with M = 3 hidden layers because it
is clear that the more middle layers, the better the accuracy.
E is learned to minimize the loss function of Eq. (4.2) so that W is determined
to be false by D.
L (E) = EW∼pdata (W) [log(D(W, E(W)))].

(4.2)

Generator training
G is learned to generate a time window such that D is erroneously determined
to be real. RNN is used for G in MARU-GAN to reflect temporal dependencies
in time series data. Figure 4.5 shows the generator network in MARU-GAN. G
(1) (2)
(M)
obtains latent variables z = (s1 , s1 , . . . , s1 ) and sets them as the first hidden
states at t = 1. Then, the output data point and hidden state of the previous time

4.4 Multivariate Time Series Anomaly Detection Model

(1)

57

(2)

(3)

Figure 4.5: Generator architecture. G receives z = (s1 , s1 , s1 ) and sets them as
the first hidden state of RNN with three hidden layers at t = 1. G generates data
point x′t at each time t from the output data point and hidden state of the previous
time.
are given as input for the next time, and the data point x′i is generated at each
time t. The output time windows x′1 , x′2 , . . . , x′T are defined as G(z). In this thesis,
we implemented RNN with M = 3 hidden layers as in E.
G is learned to minimize the loss function of Eq. (4.3) so that G(z) is determined
to be real by D.
L (G) = Ez∼pz (z) [log(1 − D(G(z), z))].

(4.3)

Discriminator training
D outputs the probability that the given time window is real. RNN and FNN are
used for D in MARU-GAN. D receives a tuple of a time window and a latent
variable from E or G. Inputting the latent variable simultaneously with the time
window improves discrimination accuracy. RNN generates a latent variable with
compressed features of the given time window. FNN computes the output of
D using the latent variable from RNN and the input latent variable. Figure 4.6
shows the discriminator network in MARU-GAN.
In RNN, the last hidden state hdisT , is obtained by inputting time window W
from E or G(z) from G. The features of the given time window are compressed in
the hdisT . A combination of the obtained hdisT and either latent variables E(W)

58

Anomaly Detection Model Combining GANs and RNN

Figure 4.6: Discriminator architecture. D consists of RNN and FNN. RNN
generates a latent variable with compressed features of the given time window,
and FNN computes the probability that the given time window is real.
Algorithm 2 MARU-GAN anomaly detection algorithm
Input: Learned Encoder E, Generator G, Discriminator D,
ˆ = {Wˆ 1 , Wˆ 2 , . . . , WˆT′ },
Unseen windows for testing W
Parameters α, r
Output: Anomaly labels yt (t = 1, 2, . . . , T′ , yt ∈ {0, 1})
for t = 1 to T′ do
ˆ E ← E(Wˆ t )
W
t
ˆ G ← G(W
ˆ E)
W
t
t
ˆ t ) = α∥Wˆ t − W
ˆ G ∥1 + (1 − α)σ(D(W
ˆ t , E(W
ˆ E )), 1)
A (W
t
t
end for
ˆ t ) as anomalous(yt ← 1)
Define r% of the time windows with the highest A (W
Define other time windows as normal (yt ← 0)
from E or z from G is input to FNN. FNN outputs the probability that the given
time window is real. The results are the outputs of D.
The discriminator is learned to correctly distinguish (W, E(W)) received from E
or (G(z), z) received from G as either real or fake by maximizing the loss function
of Eq. (4.4).
L (D) = EW∼pdata (W) [log(D(W, E(W)))] + Ez∼pz (z) [log(1 − D(G(z), z))]. (4.4)

Anomaly detection using learned model
E, G, and D are learned using a training dataset with only normal time windows,
and anomaly scores for unseen input time windows are computed using Eq. (4.5).
The anomaly detection process is summarized on Algorithm 2.

4.5 Experiments and Results

59

ˆ of a time window W
ˆ is computed as the sum of two
An anomaly score A (W)
losses: one is the reconstruction loss, and the other is the discrimination loss.
ˆ is defined as
A (W)
ˆ = α∥W
ˆ − G(E(W))∥
ˆ 1 + (1 − α)σ(D(W,
ˆ E(W)),
ˆ 1),
A (W)

(4.5)

where α is the coefficient for considering weights for multiple losses. The larger
ˆ becomes, the more likely to be anomalous. In the reconstruction loss of
A (W)
MAD-GAN in Section 3.3.1, the latent variable z corresponding to the input
ˆ is obtained with enough iterations that G(z) and W
ˆ are highly
time window W
ˆ
similar [42]. On the other hand, MARU-GAN can obtain a latent variable as E(W)
ˆ is a feature representation that captures the
directly from the learned E. E(W)
ˆ because E has a distribution that maps time windows to
important features of W
the latent space through the training phase. Moreover, G generates a time window
ˆ from the E(W)
ˆ rather than random latent variables of MAD-GAN. In the
G(E(W))
discrimination loss, the input to D is not only a time window but also its latent
variables. Because there is more information to determine whether a given time
window is real or fake, the discrimination accuracy of D is improved compared
to MAD-GAN.

4.5

Experiments and Results

This section evaluates whether MARU-GAN can detect collective anomalies in
time series data. As anomaly datasets used in experiments, we created datasets
with collective anomalies by swapping the data points at different times for SWaT
and WADI datasets [49]. We used Efficient GAN-based AD using GANs, EncDecAD with an encoder-decoder model, LSTM-AD with LSTM, and MAD-GAN
using GANs as comparative methods.

4.5.1

Public datasets

SWaT dataset
The SWaT dataset consists of 51-dimensional data points collected from a replica
of the latest water treatment plant. The water treatment consists of a six-stage

Anomaly Detection Model Combining GANs and RNN

60

process. Stage 1 of the physical process begins by taking in raw water, followed
by chemical dosing (Stage 2), filtering it through an ultrafiltration (UF) system
(Stage 3), dechlorination using UV lamps (Stage 4), and then feeding it into a
reverse osmosis (RO) system (Stage 5). A backwash process (Stage 6) cleans the
membranes in UF using the RO permeate. In each process, sensors and actuators
were used to measure some indicators, such as level indication transmitter,
flow indication transmitter, conductivity, pH, oxidation-reduction potential,
differential pressure indicator transmitter, and pressure indicator transmitter per
second. As a result, 51 kinds of variables were measured in each process. In
this thesis, we conducted experiments on each method using 475,200 normal
data points of the dataset collected from an operational water treatment plant for
seven days.

WADI dataset
The WADI dataset consists of 123-dimensional data points collected from a WADI
testbed. WADI distributes water after water treatment processes by SWaT. WADI
consists of a three-stage process. The first process is raw water intake from SWaT,
the Public Utility Board inlet or the return water from WADI, and the raw water
storage in two tanks (Stage 1). Water from two elevated reservoir tanks and six
consumer tanks is distributed based on a demand pattern (Stage 2). Further,
water is recycled and sent back to Stage 1 (Stage 3). In each process, sensors
and actuators were used to measure some indicators, such as level indication
transmitter, flow indication transmitter, conductivity, pH, oxidation-reduction
potential, turbidity, total residual chlorine, pressure meter, flow totalizer, and
modulating value per second. As a result, 123 kinds of variables were measured
in each process. In this study, we conducted experiments on each method using
1,048,560 normal data points of the dataset collected from an operational WADI
testbed.

4.5.2

Generating collective anomalies

We created experimental datasets by artificially generating collective anomalies
and mixing them into the SWaT and WADI datasets.

4.5 Experiments and Results
Table 4.1: Characteristics of datasets. We utilized two multivariate time series
datasets to evaluate the proposed method.
Datasets Train Validation
Test
Dimension
SWaT 380,160
47,520 47,520
51
WADI 838,848
104,856 104,856
123
As a preprocessing step, each dataset consisting of normal data points was
divided into a training, validation, and test dataset in a ratio of 8:1:1. Table 4.1
summarizes the characteristics of each dataset. Each of the three datasets was
transformed into a set of time windows of length L consisting of L data points.
Training datasets were used to learn the model. Validation datasets were used for
evaluation after each training epoch, and the model with the highest F1-score
was finally selected. Test datasets were used to evaluate anomaly detection with
the selected model using the validation dataset in the training phase.
Unlike the training dataset, validation and testing datasets are used for
evaluation and need to contain collective anomalies. We generated collective
anomalies, where the data points themselves are normal. However, the behavior
of the data points is anomalous by swapping normal data points with other normal
data points at different times. The process of generating collective anomalies is
shown in Figure 4.7.

Figure 4.7: Process of creating collective anomalies. Normal data points at
different times are swapped.

1. Randomly select r% of the time windows in each validation and test dataset.
2. Randomly swap data points among the selected time windows.
3. Define selected r% of the time windows in 1. as anomalous.

61

Anomaly Detection Model Combining GANs and RNN

62

4.5.3

Comparative methods

Efficient GAN-based AD [96], EncDec-AD [47], LSTM-AD [48], and MADGAN [42] were used as comparative methods for the proposed MARU-GAN.
Efficient GAN-based AD detects point and contextual anomalies, whereas other
methods aim to detect collective anomalies.

Efficient GAN-based AD
As described in Section 3.3.1, Efficient GAN-based AD detects point and contextual
anomalies because it handles individual data points. Efficient GAN-based AD
determines anomalies at each data point. r% of the data points with the highest
anomaly score are classified as anomalous.

EncDec-AD
EncDec-AD is a method for anomaly detection using an encoder-decoder model.
Anomaly detection is performed from the reconstruction capability by the encoderdecoder model. The encoder compresses the time window W to a fixed lowdimensional latent variable. The decoder reconstructs the time window W ′ from
the compressed latent variable.
In the training phase, EncDec-AD is learned to minimize the loss function
′ 2
t=1 ∥xt − xt ∥ for the training time window W = {x1 , x2 , . . . , xL } and the reconstructed time window W ′ = {x′1 , x′2 , . . . , x′L }.
PL

In the test phase, the learned model computes the error vector et = |xt − xt ′ | of
data point xt in the test dataset and reconstructed data point x′t by the learned
model. The error vectors of the test dataset estimate the mean µ and the standard
deviation Σ of the Gaussian distribution by a maximum likelihood estimation
method. In anomaly detection, the data point is determined to be anomalous if
the error vector is located at the edge of the estimated Gaussian distribution. It
means the rarity of the data point is high. For example, when the Mahalanobis
distance is used to determine anomaly, the anomaly score of data point xt is
A (xt ) = (et − µ)T Σ−1 (et − µ). r% of the data points with the highest anomaly score
are determined to be anomalous.

4.5 Experiments and Results
LSTM-AD
LSTM-AD uses LSTM to predict future data points and detects anomalies using
the difference from actual values.
In the training phase, LSTM is learned to predict l data points from d using a
P
normal training dataset. It is learned to minimize the loss function Lt=1 ∥xt − xt ′ ∥2
for the predicted l data points {x′1 , x′2 , . . . , x′l } and input data points {x1 , x2 , . . . , xl }.
In the test phase, the learned model computes the prediction error vector
et = |xt − xt ′ | of data point xt in the test dataset and predicted data point x′t by the
learned model. The error vectors of the test dataset estimate the mean µ and
the standard deviation Σ of the Gaussian distribution by a maximum likelihood
estimation method. In anomaly detection, the data point is determined to be
anomalous if the error vector is located at the edge of the estimated Gaussian
distribution. It means the rarity of the data point is high. For example, when
the Mahalanobis distance is used to determine anomaly, the anomaly score of
data point xt is A (xt ) = (et − µ)T Σ−1 (et − µ). r% of the data points with the highest
anomaly score are determined to be anomalous.

MAD-GAN
MAD-GAN is GAN-based anomaly detection for multivariate time series data.
As described in Section 3.3.1, anomalies are detected using the standard GANs,
which consist of a generator and a discriminator.

4.5.4

Experimental settings

In this section, we discuss the hyperparameters that were used in the experiments.
Table 4.2 and Table 4.3 show the details of the encoder, the generator, and the
discriminator of MARU-GAN and the experimental settings, respectively. The
hyperparameters in each table were set based on Efficient GAN-based AD and
MAD-GAN. After training each of the 1000 epochs, the evaluation metrics were
computed using the validation dataset. The model with the highest F1-score
was used to evaluate the model on the test dataset. F1-score, accuracy, and false
positive rate were computed by comparing r=20% time windows determined to

63

Anomaly Detection Model Combining GANs and RNN

64

Table 4.2: Hyperparameters of MARU-GAN.
Number of units Number of layers Dropout rate
Encoder
E(W): RNN
Generator
G(z): RNN
Discriminator
D(W): RNN
D(W, z): FNN

100

3

0.0

100

3

0.0

100
1

1
1

0.2
0.0

Table 4.3: Experimental settings.
Gradient method
Adam
Hyperperameters
α = 1e − 5, β1 = 0.5, β2 = 0.999, ε = 1e − 8
Window size L
12
Batch size
50
Number of epoch
1000
Dimension of latent variables
100

be anomalous using MARU-GAN with the time windows defined as anomalous
in Section 4.5.2.

4.5.5

Experimental results

MARU-GAN achieved the highest value in all evaluation metrics over the existing
methods. Table 4.4 and Table 4.5 show the experimental results.
It was impossible to detect collective anomalies by using Efficient GAN-based
AD. In Efficient GAN-based AD, the F1-score was almost the same as 20%,
which is the percentage of anomalies in the experimental datasets, indicating that
anomalies were randomly selected. In order to detect collective anomalies, it is
necessary to utilize neural networks that handle multiple data points.
EncDec-AD, LSTM-AD, and MAD-GAN failed to detect collective anomalies
despite being methods that handle multiple data points. EncDec-AD could not
detect collective anomalies because anomaly scores for anomalous data points
are almost the same as that for normal data points. EncDec-AD was learned to

4.5 Experiments and Results

65

Table 4.4: Experimental results (SWaT dataset). MARU-GAN received the highest
F1-score.
Methods
F1-score
Efficient GAN-based AD
0.18
EncDec-AD
0.19
LSTM-AD
0.55
MAD-GAN
0.46
MARU-GAN
0.62

Accuracy
0.67
0.68
0.82
0.67
0.85

False positive rate
0.21
0.20
0.11
0.33
0.09

Table 4.5: Experimental results (WADI dataset). MARU-GAN received the highest
F1-score.
Methods
F1-score
Efficient GAN-based AD
0.20
EncDec-AD
0.19
LSTM-AD
0.45
MAD-GAN
0.32
MARU-GAN
0.61

Accuracy
0.68
0.68
0.78
0.59
0.85

False positive rate
0.20
0.20
0.14
0.38
0.10

P
minimize the loss function Lt=1 ∥xt − xt ′ ∥2 for the time window W = {x1 , x2 , . . . , xL }
and the reconstructed time window W ′ = {x′1 , x′2 , . . . , x′L } using the normal training
dataset. After optimizing the loss function, EncDec-AD could almost completely
reconstruct anomalies not present in the training dataset, even though the model
was learned using only normal data points. EncDec-AD reconstructs the time
series on the decoder side using the vector of compressed features of the input
time series on the encoder side. Regardless of whether normal or anomalous, the
compressed vector contains information on the time window to be reconstructed,
making it possible to reconstruct the time window almost completely, even in the
case of anomalies. Therefore, anomalies could not be detected because the error
vector e(i) = |xt − xt ′ | used to compute the anomaly score was not different between
normal data points and anomalies. MARU-GAN could more clearly discriminate
between normal and anomalous data using GANs. As in EncDec-AD, anomaly
ˆ − G(E(W))∥
ˆ 1 . However,
detection of MARU-GAN use the reconstruction loss ∥W
unlike EncDec-AD, the loss function of MARU-GAN is learned to generate time
windows that the discriminator determines to be normal. After the training

Anomaly Detection Model Combining GANs and RNN

66

phase, when a normal time window is an input to the encoder, a similar time
window is an output from the decoder. It is because if the output time window is
similar to the real input data, there is a high probability that the discriminator
judges it to be real. On the other hand, when an anomalous time window is an
input to the encoder, it is converted into a realistic time window rather than the
input anomalous time window. It is because if the time window is similar to the
anomalous time window, there is a high probability that the discriminator will
judge it to be not real. As a result, for anomalous time windows, the difference
between G(E(W)) and W becomes larger, and the reconstruction loss worked
successfully, which enabled MARU-GAN to detect collective anomalies.
LSTM-AD failed to determine the data points in the first half of the anomalous
time window as anomalous. For data points in the first half of the time window,
it is challenging to predict subsequent data points because fewer data points
are input to LSTM, and less information is needed for prediction. In addition,
when randomness, such as anomalies, existed between d normal data points, the
following l data points could not be predicted and were incorrectly determined
to be anomalous.
All evaluation metrics of MAD-GAN were lower than MARU-GAN. MADGAN does not necessarily ensure that the latent variables used to reconstruct the
unseen input time windows represent them. MAD-GAN cannot directly obtain
the latent variables corresponding to time windows because it does not have an
encoder. On the other hand, MARU-GAN employs the encoder, which makes it
possible to directly obtain latent variables with compressed features for the input
time window.

4.5.6

Analysis on latent space

Finally, we analyze the evaluation results of MARU-GAN. Figure 4.8 shows
the t-SNE [85] visualization for the latent variables obtained using the learned
encoder on the WADI test dataset. The blue dots represent normal time windows,
and the green dots represent anomalous ones. From Figure 4.8, it is clear that
normal and anomalous time windows have different features on the latent space.
We see that the anomalous time windows are located near the center, and the
normal ones are enclosed around them. Furthermore, there are some parts where

4.5 Experiments and Results

Figure 4.8: t-SNE visualization of the WADI test dataset in the latent space. The
blue and green dots represent normal and anomalous time windows, respectively.
normal and anomalous time windows are mixed, which makes anomaly detection
complicated.

67

68

Anomaly Detection Model Combining GANs and RNN

Figure 4.9: t-SNE visualization of the latent space with false positives and false
negatives. Red and yellow dots represent false positives and false negatives,
respectively. It can be seen that the parts with a mixture of normal and anomalous
time windows have more errors than the other parts.
Figure 4.9 shows the t-SNE [85] visualization obtained by adding false positives
and false negatives to Figure 4.8. The red dots indicate false positives, and the
yellow dots indicate false negatives. False positives mean that normal time
windows are erroneously determined to be anomalous, while false negatives
mean that anomalous time windows are erroneously determined to be normal.
In both cases, it can be seen that the errors are higher in the complex parts where
normal and anomalous time windows are mixed than in other areas.

4.5 Experiments and Results

Figure 4.10: t-SNE visualization of the latent space with anomaly scores. The
anomaly scores of false positives and false negatives in Figure 4.9 are around the
threshold, the borderline between normal and anomalous time windows.
Figure 4.10 shows the t-SNE visualization in the latent space with anomaly
scores. Visually, the anomaly scores of the time windows near the center are
higher than those around the center. Anomaly scores for false positives and
false negatives are near the threshold that is the boundary between normal and
anomalous time series. When MARU-GAN is used in actual operation, it is
necessary to present time windows near the threshold to operators and ask them
to determine normality or anomaly.

69

Anomaly Detection Model Combining GANs and RNN

70

4.6

Conclusions

In this chapter, we proposed an anomaly detection method for multivariate time
series data by combining GANs and RNN and conducted evaluation experiments
using datasets containing collective anomalies. It was found that the anomaly
detection model is required to reflect the relationship between consecutive data
points in order to detect collective anomalies. In addition, it was found that the
reconstruction loss in computing the anomaly score using GANs works even
when anomalies are given as input. Since the standard reconstruction loss aims
to reconstruct the input data completely, anomalies can be reconstructed as well
as normal data. On the other hand, since the aim of the reconstruction loss
using GANs is to generate realistic normal data that fools a discriminator, the
reconstruction fails for anomalies, and the reconstruction loss is effective. Finally,
we confirmed the effectiveness of introducing an encoder into the anomaly
detection model. Compared to GANs consisting only of a generator and a
discriminator, it is possible to directly obtain latent variables with compressed
features of the input data using the learned encoder.

Chapter 5
Anomaly Detection Model
Combining GANs and Transformer
In this chapter, we propose a method that combines GANs and Transformer to
detect anomalies with high accuracy for multidimensional time series data. First,
we describe the difference between Transformer and RNN and the architecture
of the proposed method based on the features of Transformer. Furthermore, to
detect anomalies that occur over a long period, we introduce sparse attention
into the attention mechanism of Transformer. Next, we experimentally evaluate
the proposed and existing methods’ performance using real-world data. Finally,
we perform experiments with different window sizes to confirm the effectiveness
of sparse attention in detecting anomalies over time.

5.1

Introduction

RNN loses information on the first half of the data points in the time window.
Transformer, one of the deep learning models, dynamically determines the
weights for each data point so that it can focus on the critical data points in
the time window. In this chapter, we propose a method that introduces the
Transformer’s encoder and decoder into the GANs’ encoder and generator for
anomaly detection in multidimensional time series data.
The longer the time series, the smaller the weights of the data points of
interest. Therefore, we introduce sparse attention as the attention mechanism

Anomaly Detection Model Combining GANs and Transformer

72

Figure 5.1: Proposed architecture in anomaly detection. It consists of an encoder,
a generator with an attention mechanism, and a discriminator. These three neural
networks are learned by adversarial training.

of the Transformer, which can eliminate the importance of the data points that
should be ignored in anomaly detection. The experiment for sparse attention
evaluates by changing the window sizes of the time window and checking the
data points of interest for anomaly detection.

5.2

Multivariate Time Series Anomaly Detection Model

5.2.1

Overall architecture

Our proposed Transformer with a discriminator for anomaly detection in multivariate time series(TDAD) and sparse TDAD(STDAD) consist of three neural
networks, an encoder, a generator, and a discriminator. Transformer’s encoder
and decoder are introduced for the encoder and the generator in our method,
respectively, and adversarial training is performed on the three neural networks.
Figure 5.1 shows the flow of anomaly detection using the learned TDAD and
STDAD. The training process is summarized in Algorithm 3.

5.2.2

Differences between Transformer and encoder-decoder
model with RNN

The differences between Transformer and encoder-decoder model with RNN are
summarized in two points [41].

5.2 Multivariate Time Series Anomaly Detection Model
Algorithm 3 TDAD and STDAD training algorithm
Input: Encoder E, Generator G, Discriminator D,
T Normal time windows for training W = {W1 , W2 , . . . , WT },
Number of epochs N
Output: Learned E, G, D
E, G, D ← initialize weights
n←0
while n < N do
for t = 1 to T do
WtE ← E(Wt )
z ← random latent variable
WtG ← G(z)
L = EWt ∼pdata (Wt ) [log(D(Wt , WtE ))] + Ez∼pz (z) [log(1 − D(WtG , z))]
update weights of E, G, D using L
end for
n ← n+1
end while

• With or without attention mechanism Since the attention scores for each
data point are determined dynamically, it is possible to focus on important
data points in the time window regardless of position. The process is
repeated as many times as the number of layers in each encoder and
decoder, and multiple feature vectors are handled using the multi-head
attention mechanism. RNN tends to lose information on the first half of
the data points in a time window because more and more information
on the data points is added when computing the output. In addition,
encoder-decoder model with RNN handles a single hidden state vector only
once per input time window.

• Positional information for an input time window Transformer has an
independent Positional Encoding module, which adds absolute positional
information to the input time window. RNN handles the relative positional
relationship of data points in a time window since the data points are
sequentially input.

73

Anomaly Detection Model Combining GANs and Transformer

74

5.2.3

Sparse attention

We introduce two types of attention mechanisms: scaled dot-product attention [86]
and sparse attention [63][7]. With sparse attention, we can improve interpretability
and achieve better results on time series data with long-term dependencies by
increasing the influence of strongly relevant data points. Specifically, we introduce
the softmax function to TDAD in Section 3.3.1 and the 1.5-entmax to STDAD for
computing attention scores, respectively. When computing the attention score,
the softmax function cannot assign zero as a weight to less relevant query/key
combinations. On the other hand, the 1.5-entmax can assign a weight of zero,
which assigns a higher weight to relevant query/key combinations and is expected
to improve interpretability for time series data with long-term dependencies.
Let Q be a matrix of queries, K be a matrix of keys, and V be a matrix of values.
The attention score is computed using the α-entmax expressed by Eq. (5.1):
!
QK⊤
(5.1)
attention(Q, K, V) B α-entmax p V,
dk
where dk is the dimensions of queries and keys. The α-entmax is defined as
α-entmax(z) B arg max p⊤ z + H⊤
α (p),

(5.2)

p∈∆d

P
where ∆d B {p ∈ Rd : p ≥ 0, i pi = 1} is the probability simplex, for α ≥ 1, H⊤
α (p) is
the Tsallis family of entropies expressed by Eq. (5.3):

P
1
α



j (p j − p j ) (α , 1)
α(α−1)
H⊤
(p)B
(5.3)

P
α

 − j p j log p j
(α = 1).
Computing the weighted sum of the attention scores and values yields the query’s
feature (latent representation). We introduce the 1-entmax and the 1.5-entmax
in the computation of the attention score with α as 1 and 1.5. In Eq. (5.3),
H⊤
α (p) becomes the well-known Shannon entropy [83] if α → 1 and the unique
solution of Eq. (5.2) is the softmax function proposed in the scaled dot-product
attention [86]. The 1-entmax is formalized as:
exp(z)
1-entmax(z) = softmax(z) = P
.
j exp(z j )

(5.4)

5.2 Multivariate Time Series Anomaly Detection Model

75

If α is 1.5, the unique solution of Eq. (5.2) is formalized as:
2
z
1.5-entmax(z) = − τ1 ,
2
+

(5.5)

where [a]+ is the positive part as [a]+ B max{a, 0}, and by [a]+ its elementwise
application to vectors. 1 denotes the vector of all ones, and τ is a unique threshold
P
for any z and is the Lagrange multiplier corresponding to i pi = 1.
While a single attention mechanism is used in Eq. (5.1), the multi-head
attention divides Q, K and V into h Qi , Ki , Vi (1 ≤ i ≤ h) and uses h attention
mechanisms (head) in parallel, each of which can obtain useful information from
a different space of features [86]. Finally, they are converted into a single vector
using the projection weight WO shown as follows:
MultiHeadAtt(Q, K, V) = Concat(head1 , head2 , . . . , headh )WO ,
where headi = attention(Qi , Ki , Vi ).

(5.6)

Transformer stacks multiple layers (i.e., TrasnformerBlock modules), and
each layer except the first layer operates on the previous layer’s output as
input. Repeating the latent variables’ composition, the models obtain richer
representations as the layers get deep.

5.2.4

Training of TDAD and STDAD

Encoder training
The encoder generates a latent variable that compresses the features of the input
time windows. Fig. 5.2 shows the encoder training flow.
The input time window WE is dimensionally compressed in the Input Embedding module. The injected relative or absolute position of the data points in the
time window is added in the Positional Encoding module. Let WE1 be the input to
the TransformerBlock module obtained in this way, and the module performs
the operations of Eq. (5.7). The TransformerBlock module has two sub-layers.
The first is a multi-head attention mechanism, and the second is a feed-forward

76

Anomaly Detection Model Combining GANs and Transformer

Figure 5.2: Encoder training flow. When a tuple (WE , E(WE )) of the input time
window WE and its latent variable E(WE ) are given to the discriminator, the
encoder model is learned to minimize the loss function so that WE is determined
to be false by the discriminator.

network. The TransformerBlock module consists of N identical layers.
WE2 = LayerNorm(WE1 + MultiHeadAttE (WE1 , WE1 , WE1 )),
WE3 = LayerNorm(WE2 + FeedForwardE (WE2 )),

(5.7)

where the LayerNorm function and the FeedForward function represent layer
normalization and a feed-forward network, respectively. In layer normalization
of each sub-layer, a residual connection is employed. In other words, the matrix
addition of the input and output of each sub-layer is the input to the LayerNorm
function. The input matrices of queries, keys, and values of the MultiHeadAtt
function are all WE1 . In the encoder, when a tuple (WE , WE3 ) of the input time
window WE and its latent variable E(WE ) = WE3 are given to the discriminator, the
encoder model is learned to minimize the loss function of Eq. (5.8) so that WE is
determined to be false by the discriminator.
LEncoder = EWE ∼pdata (WE ) [log(D(WE , WE3 ))].

(5.8)

Generator training
The generator is learned to generate a time window such that the discriminator
erroneously determines it to be real. Figure 5.3 shows the generator training flow.
When a time window representing the start < START > is given, relative
or absolute position of the data points in the time window is injected in the
1 be the input to the TransformerBlock
Positional Encoding module. Let WG

5.2 Multivariate Time Series Anomaly Detection Model

77

Figure 5.3: Generator training flow. When a tuple (G(z), z) consisting of the time
window G(z) generated by the generator and latent variable z are given to the
discriminator, the generator is learned to minimize the loss function so that G(z)
is determined to be real.

module, which performs the operations of Eq. (5.9). The TransformerBlock
module of the generator consists of two sub-layers in each layer of the encoder.
The TransformerBlock module consists of N identical layers.
2
1
1
WG
= LayerNorm(WG
+ MultiHeadAttG (WG
, z, z)),
3
2
2
WG
= LayerNorm(WG
+ FeedForwardG (WG
)).

(5.9)

As with the encoder, the residual connection is employed in layer normalization of each sub-layer. The input matrix of the queries of the Multi-HeadAtt
1 , and the input matrices of values and keys are the latent variable z.
function is WG
3 , z) consisting of the time window G(z) = W 3 generated by the
When a tuple (WG
G
generator and the latent variable z are given to the discriminator, the generator is
learned to minimize the loss function of Eq. (5.10) so that G(z) is determined to
be real.
3
LGenerator = Ez∼p(z) [1 − log(D(WG
, z))].

(5.10)

Discriminator training
The discriminator receives a tuple of an unseen input time window and its latent
variable from the encoder or the generator. It is learned to determine whether the
time window is real or fake.

78

Anomaly Detection Model Combining GANs and Transformer
For the time window, a latent variable that compresses the features of the
time window using GRU is concatenated with the latent variables received
from the encoder or the generator and input to the linear network. The linear
network outputs the probability that the input time window is real, which is the
output of the discriminator. The discriminator is learned to correctly determine
3 , z) from the generator is real or fake,
whether (WE , WE3 ) from the encoder or (WG
by maximizing the loss function of Eq. (5.11).
3
LDiscriminator = EWE ∼pdata (WE ) [log(D(WE , WE3 ))] + Ez∼p(z) [1 − log(D(WG
, z))]. (5.11)

Anomaly detection using learned model
ˆ is computed using the anomaly
The anomaly score of an unseen time window W
detection model learned using only normal time series data. The anomaly
detection process is summarized in Algorithm 4.
The time window is considered anomalous if the anomaly score exceeds a
predefined threshold. To automatically set the threshold, we employ the Peak
Over Threshold method [84].
ˆ is computed as the sum of two losses: one is the reconstruction loss,
A (W)
ˆ is defined as
and the other is the discrimination loss. A (W)
ˆ = α∥W
ˆ − G(E(W))∥
ˆ 1 + βσ(D(W,
ˆ E(W)),
ˆ 1).
A (W)

(5.12)

α and β(= 1 − α) are the coefficients for considering weights for multiple losses.
The larger the value of α, the greater the effect of the reconstruction loss, while
the larger the value of β, the greater the effect of the discrimination loss. The
ˆ
effects of different coefficients are discussed in Section 5.3.3. The larger A (W)
becomes, the more likely to be anomalous.
ˆ and the
The reconstruction loss is the L1 norm between a time window W
ˆ
reconstructed time window G(E(W)).
Since the anomaly detection model is
learned using only normal time series data, it can generate a time window with
its features. On the other hand, since the features of unseen time windows are
not reflected in this model, it is difficult to reconstruct them. Based on this, if
a normal time window is given, the reconstruction loss will be small, and if an

5.3 Experiments and Results
Algorithm 4 TDAD and STDAD anomaly detection algorithm
Input: Learned Encoder E, Generator G, Discriminator D,
ˆ = {Wˆ 1 , Wˆ 2 , . . . , WˆT′ },
T′ unseen time windows for test W
Parameters α, β, Threshold λ
Output: Anomaly labels yt (t = 1, 2, . . . , T′ , yt ∈ {0, 1})
for t = 1 to T′ do
ˆ E ← E(Wˆ t )
W
t
ˆ G ← G(W
ˆ E)
W
t
t
ˆ t ) = α∥Wˆ t − W
ˆ G ∥1 + βσ(D(W
ˆ t , E(W
ˆ E )), 1)
A (W
t
t
ˆ t ) ≥ λ then
if A (W
yt ← 1 (yt is an anomaly)
else
yt ← 0 (yt is normal)
end if
end for

anomalous time window is given, the reconstruction will not work, and the loss
will be large.
ˆ into the learned discrimiThe discrimination loss is computed by feeding W
ˆ is a real time window (i.e., a normal
nator and letting it determine whether W
ˆ is determined to be fake, the greater the value
time window) or fake. The more W
of the discrimination loss. σ is the cross-entropy loss of the probability that the
ˆ to be real and one (real class).
discriminator determines W

5.3

Experiments and Results

This section describes publicly available datasets, experimental setup, and results
in our evaluation experiments. All the experiments are implemented in Python
3.8.5 and PyTorch 1.8.11 . We learned our methods on one machine with 8 NVIDIA
A100-PCIE-40GB GPUs.
1 https://pytorch.org/

79

Anomaly Detection Model Combining GANs and Transformer

80

5.3.1

Public datasets

Five publicly available datasets were used in the experiment. Each dataset
contains different proportions of point, contextual, and collective anomalies.
Table 5.1 summarizes the characteristics of each dataset, and the following is a
brief overview of the datasets.
Table 5.1: Characteristics of datasets. We utilized four multivariate time series
datasets and a univariate time series dataset to evaluate the proposed method.
Datasets
Train
Test
Dimension Anomalous
SWaT 496800 449919
51
11.98 %
WADI 1048571 172801
123
5.99 %
SMD 708405 708420
38
4.16 %
SMAP
135183 427617
25
13.13 %
UCR
1600
5900
1
1.88 %

• Secure Water Treatment (SWaT) Dataset [49]: This dataset is collected from
a replica of the latest water treatment plant. This dataset consists of 11
days of continuous operation: seven under normal operation and four days
with attack scenarios. All the values are obtained from all 51 sensors and
actuators.
• Water Distribution (WADI) dataset [49]: This dataset is collected from a
water distribution testbed. This dataset consists of 16 days of continuous
operation: 14 under normal operation and two days with attack scenarios.
All the values are obtained from all 123 sensors and actuators.
• Server Machine Dataset (SMD) [78]: This dataset is a five-week-long dataset
collected from a large Internet company. It is made up of data from 28
different machines, each one monitored by 38 metrics.
• Soil Moisture Active Passive (SMAP) dataset [30]: This dataset consists of
spacecraft telemetry data from NASA. It contains anomalous data labeled
by experts.
• UCR dataset [35]: This dataset comprises multiple univariate time series. It
is used in KDD 2021 cup.

5.3 Experiments and Results

5.3.2

Evaluation metrics

To evaluate the performance of the proposed methods compared to other methods,
precision (P), recall (R), F1-score (F1 ), and area under the ROC curve (AUC) were
used as evaluation indices.

5.3.3

Effects of parameters

In this section, we conducted experiments with different parameters to investigate
the items affecting TDAD. All experiments were performed using the SWaT
dataset.
First, we study the effect of the number of layers in the TransformerBlock
module N. Bigger models with more layers are expected to perform better.
Figure 5.4 shows the results of the experiments with five different numbers of
layers N ∈ {1, 2, 4, 8, 10}. The results show that the larger N, the better F1-score,
and when N = 10, F1-score is 0.0109 better than N = 1. Since a bigger model by
N does not significantly improve performance, it is necessary to determine the
parameters considering the computation time.
Second, we consider the effect of the number of heads on multi-head attention.
Figure 5.5 shows the results of experiments using five different parameters
h ∈ {1, 2, 4, 10, 20}. The results show that F1-score increases as the number of heads
increases, but the performance worsens when the number of heads exceeds a
certain large number. F1-score increases from h = 1 to 10, but it becomes worse
when h = 20. It is because the smaller the dimension handled by each head, the
less sufficient information can be obtained to compute the attention score. When
choosing the number of heads, it is better to choose the setting of the number of
heads.
The third item we investigate is the dimension of the latent space in the
Feed Forward module. Increasing the dimension is expected to improve performance, but if the model is too big, information on the input may be missing.
Figure 5.6 shows the results of the experiment with five different dimension
d f f ∈ {128, 256, 512, 1024, 2048}. The results show that if d f f is too large compared to
the dimension of the input, the information of the input is missing, leading to poor
performance. F1-score is the highest when d f f = 512, but as d f f increases, F1-score
decreases. When choosing d f f , using a value in the middle is recommended.

81

82

Anomaly Detection Model Combining GANs and Transformer
As a fourth item, we examine the effect of dropout in the encoder and the
generator. Dropout avoids overfitting by not using a certain percentage of neurons
in the model and reducing dependence on specific neurons. Figure 5.7 shows
the results of experiments with four different dropout Ddrop ∈ {0.1, 0, 3, 0.4, 0.6}.
The results show that some larger dropout is more effective, but too large a value
reduces performance because the information is missing. Dropout is an effective
tool for training, but it should be adjusted so that its value does not become too
large.
As the fifth item, we consider the dimension of the latent space in the encoder
and the generator. Figure 5.8 shows the experimental results using five different
dimensions dk ∈ {40, 100, 400, 800, 1000}. The results show that the larger dk , the
higher F1-score. In Transformer, it is also shown that the larger the dimension of
the model, the better the performance [86]. However, since the computation time
increases as the dimension increases, it is necessary to determine the dimension
concerning the computation time.

Figure 5.4: Number of layers in TransformerBlock module N. The larger N, the
better F1-score, and when N = 10, F1-score is 0.0109 better than when N = 1.

5.3 Experiments and Results

Figure 5.5: Number of heads on multi-head attention h. F1-score increases from
h = 1 to 10, but it becomes worse when h = 20.

Figure 5.6: Dimension of the latent space in Feed Forward module d f f . F1-score
is the highest when d f f = 512, but as d f f increases, F1-score decreases.

83

84

Anomaly Detection Model Combining GANs and Transformer

Figure 5.7: Dropout in the encoder and the generator Ddrop . F1-score is the
highest when Ddrop = 0.4, but as Ddrop increases, F1-score decreases.

Figure 5.8: Dimension of the latent space in the encoder and the generator dk .
The larger dk , the higher F1-score.

5.3 Experiments and Results

85

Table 5.2: F1-score, false positives, and false negatives with various computations
of anomaly score. Stronger effect of the reconstruction loss results in fewer FPs,
while the stronger effect of the discrimination loss results in fewer FNs.
α
0.0
0.3
0.5
0.7
0.9
1.0

β
1.0
0.7
0.5
0.3
0.1
0.0

FP
1540
1350
1027
500
190
55

FN
15346
15642
16090
17214
17229
18018

F1
0.8231
0.8243
0.8179
0.8081
0.8110
0.8009

Finally, we examine the effect of the coefficients used in the computation
of the anomaly score in Eq. (5.12). The larger α is, the stronger the effect
of the reconstruction loss, while the larger β is, the stronger the effect of the
discrimination loss. The results of F1-score (F1 ), false positive FPs (FP), and false
negatives FNs (FN) when α and β are varied are shown in Eq. (5.12).
FPs are normal time windows that are falsely determined to be anomalous.
FNs are anomalous time windows that are falsely determined to be normal. In
actual operation, FPs and FNs are sensible indicators for performance evaluation.
Operators can only respond to a few incidents per day. The larger the number of
FPs, the more difficult it is to respond to critical incidents; the larger the number
of FNs, the more likely they are to miss critical incidents. Also, in unsupervised
learning, it is desirable to have few FPs because the training data is normal. If a
normal time window is determined to be anomalous, it means that the model has
not been learned well.
The result shows that the larger the value of α, i.e., the more the reconstruction
loss is emphasized, the fewer FPs are, while the larger the value of β, i.e., the
more the discrimination loss is emphasized, the fewer FNs are.
There is a trade-off between FPs and FNs. Therefore, it is desirable to select
parameters to meet the requirements in actual operation, such as focusing on the
reconstruction loss when an anomaly is unlikely to lead to a critical incident and
focusing on the discrimination loss when the number of operators responding to
an incident is large and anomaly-sensitive.

Anomaly Detection Model Combining GANs and Transformer

86

5.3.4

Experimental setup

The number of epochs was set to 70 for SWaT and WADI datasets and 250 for
other datasets for comparison with previous studies [5]. We used the Adam [39]
optimizer β1 = 0.5, β2 = 0.999 with the above learning rates, and set the batch
size as 128. We set the learning rate for the encoder and the generator to 0.0001
and the discriminator to 0.000025, respectively. Hyperparameters other than the
above in our method were determined by grid search. The parameters shown
below are common to all datasets.
• Number of layers in TransformerBlock module N = 2
• Number of heads on multi-head attention h = 4
• Dimension in Feed Forward module d f f = 256
• Dropout in encoder and generator Ddrop = 0.1
• Dimension in encoder and generator dk = 100
• Window size L = 10
For the weights of the reconstruction loss and the discrimination loss in the
computation of anomaly scores, we apply α = 0.3, β = 0.7 for the SWaT dataset
and α = 0.1, β = 0.9 for the other datasets.
We performed the evaluation five times in all experiments with the same
setup. All values in the table after Section 5.3.5 are the average of five times.

5.3.5

Experimental results

To evaluate the performance of TDAD and STDAD, we compared them with five
unsupervised methods for anomaly detection in time series data. Those are
LSTM-NDT [30], MAD-GAN [42] OmniAnomaly [78], USAD [5], and TranAD [84].
Table 5.3 results from the average evaluation value across all datasets. P, R, AUC,
and F1 represent precision, recall, area under the ROC curve, and F1-score,
respectively. The highest F1-score and AUC are in bold fonts. In F1-score, TDAD
was the best, improving values by 0.0304 than the other methods. STDAD, which

5.3 Experiments and Results

87

Table 5.3: The average performance over all datasets. P, R, AUC, and F1 represent
precision, recall, area under the ROC curve, and F1-score, respectively. The
highest F1-score and AUC are in bold fonts. In F1-score, TDAD was the best.
STDAD performed second best.
Methods
LSTM-NDT
MAD-GAN
OmniAnomaly
USAD
TranAD
TDAD
STDAD

P
0.8241
0.7903
0.8994
0.8190
0.9025
0.9458
0.9506

R
0.7423
0.9006
0.8770
0.9100
0.8984
0.9080
0.8810

AUC
0.8642
0.9449
0.9357
0.9463
0.9463
0.9528
0.9393

F1
0.6946
0.8200
0.8907
0.8379
0.8917
0.9221
0.9102

introduces sparse attention to the attention mechanism of TDAD, performed
second best.
Table 5.4 shows the performance results for TDAD, STDAD, and the other
methods on all datasets. TDAD performed better on all datasets except the SMD
dataset, which has the second-best F1-score.
As for the WADI dataset, the performance of TDAD was the best, while
the performance of the other methods is generally low. WADI dataset is highdimensional, and generally, the higher the dimensionality of the data, the more
difficult it is to detect anomalies. LSTM-NDT predicts data points at each time
using LSTM. If the training dataset contains noise, as in the WADI dataset, it
is difficult to construct an anomaly detection model with good performance.
Also, the anomaly is similar to the trend of normal data points, as in the SMD
dataset. In that case, it is impossible to determine the anomaly by reconstructing
the anomalous data points. MAD-GAN, like TDAD, uses the GANs framework
and introduces LSTM in the generator and the discriminator to handle time
series data. As with LSTM-NDT, it is difficult to construct a good-performing
anomaly detection model if the training dataset contains noise. USAD, which uses
autoencoder for reconstruction, performs well on datasets other than the WADI
dataset. At the decoder side, time windows are reconstructed with the feature
vectors of the time windows compressed at the encoder side. Therefore, because
this vector contains the information on the time window to be reconstructed,
it can be thought that anomalous time windows are also almost completely

Anomaly Detection Model Combining GANs and Transformer

88

Table 5.4: Performance comparison. TDAD performed better on all datasets
except the SMD dataset, which has the second-best F1-score.
Methods
LSTM-NDT
MAD-GAN
OmniAnomaly
USAD
TranAD
TDAD
STDAD

P
0.9947
0.9958
0.9153
0.9720
0.9907
0.9906
0.9681

SWaT
R
AUC
0.6764 0.8380
0.6759 0.8377
0.7324 0.8615
0.7229 0.8600
0.6624 0.8308
0.7131 0.8561
0.7186 0.8576

P
0.9964
1.0000
0.9903
0.9787
0.9978
1.0000
1.0000

SMD
R
AUC
0.2077 0.6038
0.9974 0.9987
0.9985 0.9987
0.9974 0.9976
1.0000 0.9999
0.9974 0.9987
0.9974 0.9987

F1
0.3438
0.9987
0.9944
0.9879
0.9989
0.9987
0.9987

P
1.0000
0.6768
0.9098
0.9737
0.9487
1.0000
1.0000

UCR
R
AUC
1.0000 1.0000
1.0000 0.9954
1.0000 0.9990
1.0000 0.9997
1.0000 0.9995
1.0000 1.0000
1.0000 1.0000

F1
1.0000
0.8073
0.9528
0.9867
0.9736
1.0000
1.0000

Methods
LSTM-NDT
MAD-GAN
OmniAnomaly
USAD
TranAD
TDAD
STDAD
Methods
LSTM-NDT
MAD-GAN
OmniAnomaly
USAD
TranAD
TDAD
STDAD

F1
0.8053
0.8052
0.8137
0.8292
0.7939
0.8292
0.8243

P
0.2564
0.4467
0.7814
0.3300
0.7472
0.7938
0.8358

WADI
R
AUC
0.8296 0.8864
0.8296 0.9026
0.6541 0.8249
0.8296 0.8949
0.8296 0.9115
0.8296 0.9122
0.6892 0.8430

F1
0.3917
0.5807
0.7121
0.4722
0.7862
0.8113
0.7541

P
0.8728
0.8320
0.9001
0.8404
0.8283
0.9444
0.9444

SMAP
R
AUC
1.0000 0.9930
1.0000 0.9903
1.0000 0.9946
1.0000 0.9908
1.0000 0.9900
1.0000 0.9972
1.0000 0.9972

F1
0.9321
0.9083
0.9474
0.9133
0.9061
0.9714
0.9714

reconstructed. However, USAD performs better overall because it uses two
different autoencoders to perform the reconstruction. However, its F1-score on
the WADI dataset, a high-dimensional dataset, is significantly lower than that of
TDAD.

5.3 Experiments and Results

Figure 5.9: Anomaly detection results with various window sizes L for the SWaT
dataset using TDAD and STDAD. TDAD has higher F1-scores up to L = 30,
but when L = 40, STDAD reverses the trend. For longer window sizes such as
L = 100, 200, STDAD has better results.

5.3.6

Evaluation of sparse attention

In this section, we evaluated whether the introduction of sparse attention can better
capture long-term dependencies on the patterns of time series data. Experiments
were conducted using six different window sizes L ∈ {10, 30, 40, 50, 100, 200}. The
SWaT dataset was used as the dataset for evaluation. The longer the window size,
the easier it is to detect anomalies that occur over a long period and cannot be
detected with a shorter window size. However, capturing the features between
data points in the time window is more complicated. Figure 5.9 presents the
obtained results in six different window sizes. As long as the window size is
not too long, TDAD is an effective method, as in the experiment in Section 5.3.5.
As the window size increases, the performance of STDAD is better than TDAD.
TDAD has higher F1-scores up to L = 30, but when L = 40, STDAD reverses the
trend, and even for longer window sizes such as L = 100, 200, STDAD has better

89

Anomaly Detection Model Combining GANs and Transformer

90

results. Sparse attention makes it possible to assign zero as the attention score,
which increases the impact of strongly relevant data points in the time window. It
improves the interpretability of the patterns of time windows and achieves better
results. Since each system has different characteristics of anomalies likely to occur,
such as the duration of anomalies, it is necessary to select the appropriate model
and window size for each.

5.3.7

Roles of sparse attention

We confirmed that sparse attention assigned high attention scores between strongly
relevant data points. Figure 5.11 and Figure 5.12 show the attention pattern
for each head on the multi-head attention mechanism of the TransformerBlock
module at the encoder side when the same anomalous time window that consists
of 50 data points is given to STDAD or TDAD, respectively [87]. Each cell in the
figures shows the attention pattern for a particular head; indexed rows represent
layers, and columns represent heads. Figure 5.10 is the attention pattern of the
cell with Layer 0 and Head 0 in Figure 5.11, and the lines represent the attention
from one data point (left side) to another (right side) in the same time window.
The color intensity of the lines reflects the attention score (ranges from zero to
one), with a lighter color when the score is close to one and a darker color when
the score is close to zero. We employed STDAD and TDAD with two layers and
four heads, which resulted in eight independent attention patterns as shown in
Figure 5.11 and Figure 5.12.
The time window consisted of the first 40 normal data points and the following
ten anomalous data points. STDAD assigns higher attention scores between data
points than TDAD. Sparse attention can assign zero, assigning higher weights to
the strongly relevant data points. In particular, the value of the attention score for
anomalous data points is high. Therefore, when an anomalous time window is
given as input, the multi-head attention mechanism pays attention to anomalous
data points.
Focusing on layers, in Layer 0, both STDAD and TDAD concentrate on the
relationship between anomalous data points and each data point. On the other
hand, in Layer 1, the overall relationship between data points is considered
rather than the anomalous data points. In multi-head attention, when multiple

5.3 Experiments and Results
layers are stacked, each layer except the first layer operates on the previous
layer’s output as input. In Layer 0, the representation of each data point in the
time window contains no information about other data points than its own. As
shown in Eq. (5.1), the representation of each data point, which is the input of
Layer 1, contains information on the inner product results between the attention
score in Layer 0 and all data points in the time window. In addition, Layer 1 in
STDAD inherits more strongly the influence of anomalous data points, which
Layer 0 pays attention to, than TDAD because of the residual connection in the
TransformerBlock module.
Focusing on heads per layer, we see that each head obtains a different attention
pattern. The multi-head attention allowed the models to capture a broader range
of relationships between data points, which is impossible with the single attention
mechanism.

91

Anomaly Detection Model Combining GANs and Transformer

92

5.4

Conclusions

In this chapter, we proposed an anomaly detection method for multidimensional
time series data by combining GANs and Transformer and conducted evaluation
experiments using several publicly available datasets. Transformer’s mechanism,
including multi-head attention, enables us to extract temporal dependencies in
a time window with high accuracy. In addition, we proposed a method that
introduces sparse attention into the attention mechanism of Transformer. It allows
us to accurately detect longer anomalies with long window sizes by increasing
the influence of strongly relevant data points in a time window with long-term
dependencies.

5.4 Conclusions

Figure 5.10: Attention pattern for the cell with Layer 0, Head 0 in STDAD. The
lines represent the attention from one data point (left side) to another (right side)
in the time window.

93

94

Anomaly Detection Model Combining GANs and Transformer

Figure 5.11: Attention pattern between data points in the anomalous time window
per attention head in STDAD. Compared to TDAD, STDAD has higher attention
scores for strongly relevant data points, specifically anomalous data points.

5.4 Conclusions

Figure 5.12: Attention pattern between data points in the anomalous time window
per attention head in TDAD. Compared to STDAD, TDAD has overall attention
scores between data points.

95

Chapter 6
Conclusions and Future Work
In this thesis, we described two topics for anomaly detection in multivariate time
series data using models based on Generative Adversarial Networks (GANs) to
detect anomalies accurately.
1. We proposed GANs-based anomaly detection models for multivariate time
series data combining RNN or Transformer and verified the effectiveness of
the models through experiments.
2. In the model combining GANs and Transformer, sparse attention was
utilized for the attention mechanism of Transformer to detect anomalies that
occur over a long period with high accuracy, and the model’s effectiveness
was verified experimentally.

6.1

Conclusions

Chapter 2 summarized the foundations of anomaly detection. First, the types of
anomalies were outlined, and the metrics used in the performance evaluation of
anomaly detection were described. Finally, the learning approach for anomaly
detection in time series data was described.
Chapter 3 described the related work on the semi-supervised anomaly detection methods using deep learning. Three main categories were classified, and the
usual methods in each category were described.

Conclusions and Future Work

98

In Chapters 4 and 5, we proposed GANs-based anomaly detection methods
for multivariate time series data combining RNN or Transformer, respectively.
In Chapter 4, we conducted experiments specifically on detecting collective
anomalies. The experiments showed that the method ignoring temporal dependencies between data points could not detect collective anomalies. The proposed
method could detect collective anomalies with the highest F1-score compared to
the four existing methods.
In Chapter 5, we investigated the effect of changing the values of parameters
in the proposed method by adjusting the parameters and conducting experiments.
Then, by conducting experiments with five public datasets and comparing the
performance with five existing methods, the proposed method was shown to be
the highest for the F1-score. Furthermore, we proposed a method that introduces
sparse attention as the attention mechanism of Transformer. Finally, we conducted
experiments by changing the window size and confirmed that standard attention
had a better F1-score at a particular window size. However, sparse attention had
a better F1-score at a longer window size.

6.2

Future Work

The following are three points for the future development of this thesis. In this
thesis, we aimed to construct an anomaly detection method to improve F1-score.
However, in practical use, detection accuracy and the training time of the anomaly
detection method is essential. In [5], the training time of the anomaly detection
method is used as an evaluation metric in addition to the detection accuracy.
Further research on optimization methods considering the proposed method’s
training time will provide more practical knowledge.
In anomaly detection, much effort is also spent on identifying the causes of
anomalies [55, 56, 31, 58, 37]. In this thesis’s anomaly detection for multivariate
time series data, only whether an anomaly occurred or not is detected. A
method to discover specific variables or combinations of variables that cause
anomalies will enable us to shorten the time required to identify the causes of
anomalies, which is expected to shorten the time required for recovery from
anomaly occurrence.

6.2 Future Work
Finally, as an advanced anomaly detection system, studies have also been
conducted on predictive detection, which detects the signs of an anomaly before
it occurs [44]. There is a demand for early detection of signs of an anomaly
before the anomaly occurs in the field, and research on the practical application
of predictive detection has great utility. Although the objective of this thesis was
to detect anomalies after they have occurred, extending the findings to predictive
detection is also a future challenge.

99

Bibliography
[1] Aggarwal, C. C. (2017). An introduction to outlier analysis. In Outlier analysis,
pages 1–34. Springer.
[2] Akcay, S., Atapour-Abarghouei, A., and Breckon, T. P. (2019). Ganomaly:
Semi-supervised anomaly detection via adversarial training. In Computer
Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia,
December 2–6, 2018, Revised Selected Papers, Part III 14, pages 622–637. Springer.
[3] Akçay, S., Atapour-Abarghouei, A., and Breckon, T. P. (2019). Skip-ganomaly:
Skip connected and adversarially trained encoder-decoder anomaly detection.
In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8.
IEEE.
[4] Angiulli, F. and Pizzuti, C. (2002). Fast outlier detection in high dimensional
spaces. In European conference on principles of data mining and knowledge discovery,
pages 15–27. Springer.
[5] Audibert, J., Michiardi, P., Guyard, F., Marti, S., and Zuluaga, M. A. (2020).
Usad: Unsupervised anomaly detection on multivariate time series. In ACM,
editor, KDD 2020, 26th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining, August 23-27, 2020, San Diego, USA (Virtual Conference).
[6] Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A
review and new perspectives. IEEE transactions on pattern analysis and machine
intelligence, 35(8):1798–1828.
[7] Blondel, M., Martins, A., and Niculae, V. (2019). Learning classifiers with
fenchel-young losses: Generalized entropies, margins, and algorithms. In
The 22nd International Conference on Artificial Intelligence and Statistics, pages
606–615. PMLR.
[8] Caron, M., Bojanowski, P., Joulin, A., and Douze, M. (2018). Deep clustering
for unsupervised learning of visual features. In Proceedings of the European
conference on computer vision (ECCV), pages 132–149.
[9] Çelik, M., Dada¸ser-Çelik, F., and Dokuz, A. S¸ . (2011). Anomaly detection in
temperature data using dbscan algorithm. In 2011 international symposium on
innovations in intelligent systems and applications, pages 91–95. IEEE.

102

Bibliography

[10] Chalapathy, R., Menon, A. K., and Chawla, S. (2018). Anomaly detection
using one-class neural networks. arXiv preprint arXiv:1802.06360.
[11] Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A
survey. ACM computing surveys (CSUR), 41(3):1–58.
[12] Chen, J., Sathe, S., Aggarwal, C., and Turaga, D. (2017). a. In Proceedings of
the 2017 SIAM international conference on data mining, pages 90–98. SIAM.
[13] Chinchor, N. (1992). MUC-4 evaluation metrics. In Fourth Message Uunderstanding Conference (MUC-4): Proceedings of a Conference Held in McLean, Virginia,
June 16-18, 1992.
[14] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F.,
Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078.
[15] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation
of gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555.
[16] Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning,
20(3):273–297.
[17] Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H.,
Arulkumaran, K., and Shanahan, M. (2016). Deep unsupervised clustering with
gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648.
[18] Doersch, C. (2016). Tutorial on variational autoencoders. arXiv preprint
arXiv:1606.05908.
[19] Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2):179–211.
[20] Emmott, A. F., Das, S., Dietterich, T., Fern, A., and Wong, W.-K. (2013).
Systematic construction of anomaly detection benchmarks from real data. In
Proceedings of the ACM SIGKDD workshop on outlier detection and description,
pages 16–21.
[21] Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget:
Continual prediction with lstm. Neural computation, 12(10):2451–2471.
[22] Ghasedi Dizaji, K., Herandi, A., Deng, C., Cai, W., and Huang, H. (2017).
Deep clustering via joint convolutional autoencoder embedding and relative
entropy minimization. In Proceedings of the IEEE international conference on
computer vision, pages 5736–5745.

Bibliography

103

[23] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. Advances
in neural information processing systems, 27.
[24] Haloui, I., Gupta, J. S., and Feuillard, V. (2018). Anomaly detection with
wasserstein gan. arXiv preprint arXiv:1812.02463.
[25] Hawkins, S., He, H., Williams, G., and Baxter, R. (2002). Outlier detection
using replicator neural networks. In International Conference on Data Warehousing
and Knowledge Discovery, pages 170–180. Springer.
[26] He, Z., Xu, X., and Deng, S. (2003). Discovering cluster-based local outliers.
Pattern recognition letters, 24(9-10):1641–1650.
[27] Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality
of data with neural networks. science, 313(5786):504–507.
[28] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural
computation, 9(8):1735–1780.
[29] Huber, P. J. (2011). Robust statistics. In International encyclopedia of statistical
science, pages 1248–1251. Springer.
[30] Hundman, K., Constantinou, V., Laporte, C., Colwell, I., and Söderström,
T. (2018). Detecting Spacecraft Anomalies Using LSTMs and Nonparametric
Dynamic Thresholding. In Proceedings of the 24th International Conference on
Knowledge Discovery and Data Mining, pages 387–395. ACM.
[31] Hwang, C. and Lee, T. (2021). E-sfd: Explainable sensor fault detection in
the ics anomaly detection system. IEEE Access, 9:140470–140486.
[32] Jiang, M.-F., Tseng, S.-S., and Su, C.-M. (2001). Two-phase clustering process
for outliers detection. Pattern recognition letters, 22(6-7):691–700.
[33] Jiang, S., Song, X., Wang, H., Han, J.-J., and Li, Q.-H. (2006). A clusteringbased method for unsupervised intrusion detections. Pattern Recognition Letters,
27(7):802–810.
[34] Jiang, T., Xie, W., Li, Y., Lei, J., and Du, Q. (2021). Weakly supervised discriminative learning with spectral constrained generative adversarial network
for hyperspectral anomaly detection. IEEE Transactions on Neural Networks and
Learning Systems, 33(11):6504–6517.
[35] Keogh, E., Taposh, D. R., Naik, U., and Agrawal, A. (2021). Multi-dataset
time-series anomaly detection competition. In ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. https://compete. hexagonml.
com/practice/competition/39.

104

Bibliography

[36] Kieu, T., Yang, B., Guo, C., and Jensen, C. S. (2019). Outlier detection for
time series with recurrent autoencoder ensembles. In IJCAI, pages 2725–2732.
[37] Kim, D., Antariksa, G., Handayani, M. P., Lee, S., and Lee, J. (2021). Explainable anomaly detection framework for maritime main engine sensor data.
Sensors, 21(15):5200.
[38] Kimura, D., Chaudhury, S., Narita, M., Munawar, A., and Tachibana, R.
(2020). Adversarial discriminative attention for robust anomaly detection. In
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,
pages 2172–2181.
[39] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980.
[40] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.
arXiv preprint arXiv:1312.6114.
[41] Lakew, S. M., Cettolo, M., and Federico, M. (2018). A comparison of
transformer and recurrent neural networks on multilingual neural machine
translation. arXiv preprint arXiv:1806.06957.
[42] Li, D., Chen, D., Jin, B., Shi, L., Goh, J., and Ng, S.-K. (2019). Mad-gan:
Multivariate anomaly detection for time series data with generative adversarial
networks. In International conference on artificial neural networks, pages 703–716.
Springer.
[43] Liao, W., Guo, Y., Chen, X., and Li, P. (2018). A unified unsupervised gaussian
mixture variational autoencoder for high dimensional outlier detection. In
2018 IEEE International Conference on Big Data (Big Data), pages 1208–1217. IEEE.
[44] Lin, J., Fernández, J. A., Rayhana, R., Zaji, A., Zhang, R., Herrera, O. E.,
Liu, Z., and Mérida, W. (2022). Predictive analytics for building power
demand: Day-ahead forecasting and anomaly prediction. Energy and Buildings,
255:111670.
[45] Mahadevan, V., Li, W., Bhalodia, V., and Vasconcelos, N. (2010). Anomaly
detection in crowded scenes. In 2010 IEEE computer society conference on computer
vision and pattern recognition, pages 1975–1981. IEEE.
[46] Makhzani, A. and Frey, B. (2013). K-sparse autoencoders. arXiv preprint
arXiv:1312.5663.
[47] Malhotra, P., Ramakrishnan, A., Anand, G., Vig, L., Agarwal, P., and Shroff,
G. (2016). Lstm-based encoder-decoder for multi-sensor anomaly detection.
arXiv preprint arXiv:1607.00148.

Bibliography

105

[48] Malhotra, P., Vig, L., Shroff, G., Agarwal, P., et al. (2015). Long short
term memory networks for anomaly detection in time series. In Proceedings,
volume 89, pages 89–94.
[49] Mathur, A. P. and Tippenhauer, N. O. (2016). Swat: A water treatment
testbed for research and training on ics security. In 2016 international workshop
on cyber-physical systems for smart water networks (CySWater), pages 31–36. IEEE.
[50] Motamed, S., Rogalla, P., and Khalvati, F. (2021). Randgan: randomized
generative adversarial network for detection of covid-19 in chest x-ray. Scientific
Reports, 11(1):1–10.
[51] Moya, M. M., Koch, M. W., and Hostetler, L. D. (1993). One-class classifier
networks for target recognition applications. NASA STI/Recon Technical Report
N, 93:24043.
[52] Ng, A. et al. (2011). Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19.
[53] Ngo, P. C., Winarto, A. A., Kou, C. K. L., Park, S., Akram, F., and Lee, H. K.
(2019). Fence gan: Towards better anomaly detection. In 2019 IEEE 31St
International Conference on tools with artificial intelligence (ICTAI), pages 141–148.
IEEE.
[54] Nguyen, M.-N. and Vien, N. A. (2018). Scalable and interpretable one-class
svms with deep learning and random fourier features. In Joint European
Conference on Machine Learning and Knowledge Discovery in Databases, pages
157–172. Springer.
[55] Nor, A. K. M., Pedapati, S. R., Muhammad, M., and Leiva, V. (2022).
Abnormality detection and failure prediction using explainable bayesian deep
learning: Methodology and case study with industrial data. Mathematics,
10(4):554.
[56] Pang, G. and Aggarwal, C. (2021). Toward explainable deep anomaly
detection. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Discovery & Data Mining, pages 4056–4057.
[57] Pang, G., Cao, L., Chen, L., and Liu, H. (2018). Learning representations
of ultrahigh-dimensional data for random distance-based outlier detection.
In Proceedings of the 24th ACM SIGKDD international conference on knowledge
discovery & data mining, pages 2041–2050.
[58] Pang, G., Ding, C., Shen, C., and Hengel, A. v. d. (2021a). Explainable
deep few-shot anomaly detection with deviation networks. arXiv preprint
arXiv:2108.00462.
[59] Pang, G., Shen, C., Cao, L., and Hengel, A. V. D. (2021b). Deep learning for
anomaly detection: A review. ACM Computing Surveys (CSUR), 54(2):1–38.

106

Bibliography

[60] Pang, G., Ting, K. M., and Albrecht, D. (2015). Lesinn: Detecting anomalies
by identifying least similar nearest neighbours. In 2015 IEEE international
conference on data mining workshop (ICDMW), pages 623–630. IEEE.
[61] Perera, P., Morariu, V. I., Jain, R., Manjunatha, V., Wigington, C., Ordonez,
V., and Patel, V. M. (2020). Generative-discriminative feature representations
for open-set recognition. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 11814–11823.
[62] Perera, P., Nallapati, R., and Xiang, B. (2019). Ocgan: One-class novelty
detection using gans with constrained latent representations. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
2898–2906.
[63] Peters, B., Niculae, V., and Martins, A. F. T. (2019). Sparse sequence-tosequence models. In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 1504–1519. Association for Computational
Linguistics.
[64] Pidhorskyi, S., Almohsen, R., and Doretto, G. (2018). Generative probabilistic
novelty detection with adversarial autoencoders. Advances in neural information
processing systems, 31.
[65] Ramaswamy, S., Rastogi, R., and Shim, K. (2000). Efficient algorithms for
mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD
international conference on Management of data, pages 427–438.
[66] Ravanbakhsh, M., Nabi, M., Sangineto, E., Marcenaro, L., Regazzoni, C.,
and Sebe, N. (2017). Abnormal event detection in videos using generative
adversarial nets. In 2017 IEEE international conference on image processing (ICIP),
pages 1577–1581. IEEE.
[67] Ravanbakhsh, M., Sangineto, E., Nabi, M., and Sebe, N. (2019). Training
adversarial discriminators for cross-channel abnormal event detection in
crowds. In 2019 IEEE Winter Conference on Applications of Computer Vision
(WACV), pages 1896–1904. IEEE.
[68] Roth, V. (2004). Outlier detection with one-class kernel fisher discriminants.
Advances in Neural Information Processing Systems, 17.
[69] Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder,
A., Müller, E., and Kloft, M. (2018). Deep one-class classification. In International
conference on machine learning, pages 4393–4402. PMLR.
[70] Ruff, L., Vandermeulen, R. A., Görnitz, N., Binder, A., Müller, E., Müller,
K.-R., and Kloft, M. (2019). Deep semi-supervised anomaly detection. arXiv
preprint arXiv:1906.02694.

Bibliography

107

[71] Sabokrou, M., Khalooei, M., Fathy, M., and Adeli, E. (2018). Adversarially
learned one-class classifier for novelty detection. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 3379–3388.
[72] Schlegl, T., Seeböck, P., Waldstein, S. M., Langs, G., and Schmidt-Erfurth,
U. (2019). f-anogan: Fast unsupervised anomaly detection with generative
adversarial networks. Medical image analysis, 54:30–44.
[73] Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth, U., and Langs, G.
(2017). Unsupervised anomaly detection with generative adversarial networks
to guide marker discovery. In International conference on information processing
in medical imaging, pages 146–157. Springer.
[74] Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson,
R. C. (2001). Estimating the support of a high-dimensional distribution. Neural
computation, 13(7):1443–1471.
[75] Schubert, E., Sander, J., Ester, M., Kriegel, H. P., and Xu, X. (2017). Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM
Transactions on Database Systems (TODS), 42(3):1–21.
[76] Song, H., Li, P., and Liu, H. (2021). Deep clustering based fair outlier
detection. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Discovery & Data Mining, page 1481–1489.
[77] Storey-Fisher, K., Huertas-Company, M., Ramachandra, N., Lanusse, F.,
Leauthaud, A., Luo, Y., Huang, S., and Prochaska, J. X. (2021). Anomaly
detection in hyper suprime-cam galaxy images with generative adversarial
networks. Monthly Notices of the Royal Astronomical Society, 508(2):2946–2963.
[78] Su, Y., Zhao, Y., Niu, C., Liu, R., Sun, W., and Pei, D. (2019). Robust anomaly
detection for multivariate time series through stochastic recurrent neural
network. In Proceedings of the 25th ACM SIGKDD international conference on
knowledge discovery & data mining, pages 2828–2837.
[79] Sugiyama, M. and Borgwardt, K. (2013). Rapid distance-based outlier
detection via sampling. Advances in neural information processing systems, 26.
[80] Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning
with neural networks. Advances in neural information processing systems, 27.
[81] Tax, D. M. and Duin, R. P. (2004). Support vector data description. Machine
learning, 54(1):45–66.
[82] Tian, F., Gao, B., Cui, Q., Chen, E., and Liu, T.-Y. (2014). Learning deep
representations for graph clustering. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 28.

108

Bibliography

[83] Tsallis, C. (1988). Possible generalization of boltzmann-gibbs statistics.
Journal of statistical physics, 52(1):479–487.
[84] Tuli, S., Casale, G., and Jennings, N. R. (2022). TranAD: Deep Transformer
Networks for Anomaly Detection in Multivariate Time Series Data. Proceedings
of VLDB, 15(6):1201–1214.
[85] Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne.
Journal of machine learning research, 9(11).
[86] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in
neural information processing systems, 30.
[87] Vig, J. (2019). A multiscale visualization of attention in the transformer
model. arXiv preprint arXiv:1906.05714.
[88] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting
and composing robust features with denoising autoencoders. In Proceedings of
the 25th international conference on Machine learning, pages 1096–1103.
[89] Wang, H., Pang, G., Shen, C., and Ma, C. (2020). Unsupervised representation
learning by predicting random distances. In Bessiere, C., editor, Proceedings of
the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20,
pages 2950–2956. International Joint Conferences on Artificial Intelligence
Organization.
[90] Wu, P., Liu, J., and Shen, F. (2019). A deep one-class neural network for
anomalous event detection in complex scenes. IEEE transactions on neural
networks and learning systems, 31(7):2609–2622.
[91] Xie, J., Girshick, R., and Farhadi, A. (2016). Unsupervised deep embedding
for clustering analysis. In International conference on machine learning, pages
478–487. PMLR.
[92] Yang, J., Parikh, D., and Batra, D. (2016). Joint unsupervised learning of
deep representations and image clusters. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 5147–5156.
[93] Yang, X., Deng, C., Zheng, F., Yan, J., and Liu, W. (2019). Deep spectral
clustering using dual autoencoder network. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 4066–4075.
[94] Yao, R., Liu, C., Zhang, L., and Peng, P. (2019). Unsupervised anomaly
detection using variational auto-encoder based feature extraction. In 2019 IEEE
International Conference on Prognostics and Health Management (ICPHM), pages
1–7. IEEE.

Bibliography

109

[95] Zavrtanik, V., Kristan, M., and Skoˇcaj, D. (2021). Draem-a discriminatively
trained reconstruction embedding for surface anomaly detection. In Proceedings
of the IEEE/CVF International Conference on Computer Vision, pages 8330–8339.
[96] Zenati, H., Foo, C. S., Lecouat, B., Manek, G., and Chandrasekhar, V. R.
(2018a). Efficient gan-based anomaly detection. arXiv preprint arXiv:1802.06222.
[97] Zenati, H., Romain, M., Foo, C.-S., Lecouat, B., and Chandrasekhar, V. (2018b).
Adversarially learned anomaly detection. In 2018 IEEE International conference
on data mining (ICDM), pages 727–736. IEEE.
[98] Zhang, K., Hutter, M., and Jin, H. (2009). A new local distance-based outlier
detection approach for scattered real-world data. In Pacific-Asia Conference on
Knowledge Discovery and Data Mining, pages 813–822. Springer.
[99] Zhou, K., Gao, S., Cheng, J., Gu, Z., Fu, H., Tu, Z., Yang, J., Zhao, Y., and
Liu, J. (2020). Sparse-gan: Sparsity-constrained generative adversarial network
for anomaly detection in retinal oct image. In 2020 IEEE 17th International
Symposium on Biomedical Imaging (ISBI), pages 1227–1231. IEEE.
[100] Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., Cho, D., and
Chen, H. (2018). Deep autoencoding gaussian mixture model for unsupervised
anomaly detection. In International conference on learning representations.

Appendix A

A.1

F-score

Definition of the F-score
The F-score measures accuracy in binary classification, calculated from precision
and recall. The formula for computing the F-score fβ is
fβ =

(β2 + 1)PR
β2 P + R

,

(A.1)

where P is precision, R is recall, and β is a parameter that controls a balance
between precision and recall [13]. The β indicates how much more important
recall is than precision.
In particular, the F1-score f1 is defined as
f1 =

2PR
.
P+R

(A.2)

in Eq. (A.1) with β = 1. Eq. (A.2) is a harmonic mean of precision and recall.

112

A.2

GANs Optimization

The loss function of a discriminator D is defined as
max DL (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z)[log(1 − D(G(z)))].

(A.3)

Using the data distribution pG of the data generated from a generator G,
Eq. (A.3) is transformed to
L (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
Z
Z
= pdata (x) log(D(x))dx + pz (z) log(1 − D(G(z)))dz
z
Zx
= pdata (x) log(D(x)) + pG (x) log(1 − D(x))dx.

(A.4)

x

D = D∗G that maximizes Eq. (A.4) is defined as
D∗G (x) =

pdata (x)
.
pdata (x) + pG (x)

(A.5)

Substituting D∗G for the GANs objective function L , L (D∗G , G) is represented
as

h
i
h

i
L D∗G , G = Ex∼pdata log D∗G (x) + Ez∼pz log 1 − D∗G (G(z))
h
i
h

i
= Ex∼pdata log D∗G (x) + Ex∼pG log 1 − D∗G (x)
#
# (A.6)
"
"
pdata (x)
pG (x)
= Ex∼pdata log
+ Ex∼pG log
.
Pdata (x) + pG (x)
pdata (x) + pG (x)

A.2 GANs Optimization

113

Using the Kullback-Leibler divergence, Eq. (A.6) is represented as

i
h

i
h
L D∗G , G = Ex∼pdata log D∗G (x) + Ez∼pz log 1 − D∗G (G(z))






 pdata (x)

 pdata (x)



1
1
= Ex∼pdata log  p (x)+p (x) ·  + Ex∼pg log  p (x)+p (x) · 
G
G
data
data
2
2
2

2






 pdata (x) 
 pg (x) 


1

log 
 + E

= Ex∼pdata log  p (x)+p (x)  + log
x∼pg 
 pdata (x)+pG (x) (A.7)


G
data
2 
2


 2 


 pdata (x) 

 pdata (x) 

= − log 4 + Ex∼pdata log  p (x)+p (x)  + Ex∼pg log  p (x)+p (x) 
G
G
data
data
2
2

pdata + pG
pdata + pG
= − log 4 + KL pdata
+ KL pG ∥
.
2
2
Rewriting Eq. (A.7) using the Jensen-Shannon divergence, we obtain

pdata + pG
pdata + pG
+ KL pG ∥
L G, D∗G = − log 4 + KL pdata ∥
2
2

= − log 4 + 2 · JS pdata ∥pG .

(A.8)

Since the Jensen-Shannon divergence between two distributions is always
non-negative and zero only when they are equal, Eq. (A.8) is maximized with
pdata = pG . In other words, the GANs objective function L is maximized for D
and minimized for G when pdata = pG and is therefore minimax-optimal.

論文の公開元へ

この論文で使われている画像

参考文献

109

[95] Zavrtanik, V., Kristan, M., and Skoˇcaj, D. (2021). Draem-a discriminatively

trained reconstruction embedding for surface anomaly detection. In Proceedings

of the IEEE/CVF International Conference on Computer Vision, pages 8330–8339.

[96] Zenati, H., Foo, C. S., Lecouat, B., Manek, G., and Chandrasekhar, V. R.

(2018a). Efficient gan-based anomaly detection. arXiv preprint arXiv:1802.06222.

[97] Zenati, H., Romain, M., Foo, C.-S., Lecouat, B., and Chandrasekhar, V. (2018b).

Adversarially learned anomaly detection. In 2018 IEEE International conference

on data mining (ICDM), pages 727–736. IEEE.

[98] Zhang, K., Hutter, M., and Jin, H. (2009). A new local distance-based outlier

detection approach for scattered real-world data. In Pacific-Asia Conference on

Knowledge Discovery and Data Mining, pages 813–822. Springer.

[99] Zhou, K., Gao, S., Cheng, J., Gu, Z., Fu, H., Tu, Z., Yang, J., Zhao, Y., and

Liu, J. (2020). Sparse-gan: Sparsity-constrained generative adversarial network

for anomaly detection in retinal oct image. In 2020 IEEE 17th International

Symposium on Biomedical Imaging (ISBI), pages 1227–1231. IEEE.

[100] Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., Cho, D., and

Chen, H. (2018). Deep autoencoding gaussian mixture model for unsupervised

anomaly detection. In International conference on learning representations.

Appendix A

A.1

F-score

Definition of the F-score

The F-score measures accuracy in binary classification, calculated from precision

and recall. The formula for computing the F-score fβ is

fβ =

(β2 + 1)PR

β2 P + R

(A.1)

where P is precision, R is recall, and β is a parameter that controls a balance

between precision and recall [13]. The β indicates how much more important

recall is than precision.

In particular, the F1-score f1 is defined as

f1 =

2PR

P+R

(A.2)

in Eq. (A.1) with β = 1. Eq. (A.2) is a harmonic mean of precision and recall.

112

A.2

GANs Optimization

The loss function of a discriminator D is defined as

max DL (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z)[log(1 − D(G(z)))].

(A.3)

Using the data distribution pG of the data generated from a generator G,

Eq. (A.3) is transformed to

L (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]

= pdata (x) log(D(x))dx + pz (z) log(1 − D(G(z)))dz

= pdata (x) log(D(x)) + pG (x) log(1 − D(x))dx.

(A.4)

D = D∗G that maximizes Eq. (A.4) is defined as

D∗G (x) =

pdata (x)

pdata (x) + pG (x)

(A.5)

Substituting D∗G for the GANs objective function L , L (D∗G , G) is represented

L D∗G , G = Ex∼pdata log D∗G (x) + Ez∼pz log 1 − D∗G (G(z))

= Ex∼pdata log D∗G (x) + Ex∼pG log 1 − D∗G (x)

# (A.6)

pdata (x)

pG (x)

= Ex∼pdata log

+ Ex∼pG log

Pdata (x) + pG (x)

pdata (x) + pG (x)

A.2 GANs Optimization

113

Using the Kullback-Leibler divergence, Eq. (A.6) is represented as

L D∗G , G = Ex∼pdata log D∗G (x) + Ez∼pz log 1 − D∗G (G(z))



 pdata (x)



 pdata (x)





= Ex∼pdata log  p (x)+p (x) ·  + Ex∼pg log  p (x)+p (x) · 

data

2



 pdata (x) 

 pg (x) 



log 

 + E

= Ex∼pdata log  p (x)+p (x)  + log

x∼pg 

 pdata (x)+pG (x) (A.7)



data

2 

 2 



 pdata (x) 



 pdata (x) 



= − log 4 + Ex∼pdata log  p (x)+p (x)  + Ex∼pg log  p (x)+p (x) 

data

pdata + pG

= − log 4 + KL pdata

+ KL pG ∥

Rewriting Eq. (A.7) using the Jensen-Shannon divergence, we obtain

pdata + pG

+ KL pG ∥

L G, D∗G = − log 4 + KL pdata ∥

= − log 4 + 2 · JS pdata ∥pG .

(A.8)

Since the Jensen-Shannon divergence between two distributions is always

non-negative and zero only when they are equal, Eq. (A.8) is maximized with

pdata = pG . In other words, the GANs objective function L is maximized for D

and minimized for G when pdata = pG and is therefore minimax-optimal.

...

参考文献をもっと見る

分野

大学

学位論文種類・取得年

言語

Anomaly Detection using Adversarial Generative Networks in Multivariate Time Series

概要

この論文で使われている画像

関連論文

BPF: a novel cluster boundary points detection method for static and streaming data

Study on SVM Classifiers for Imbalanced Data Classification Using Quasi-Linear Kernel

S-SOM v1.0: a structural self-organizing map algorithm for weather typing

Indexing complex networks for fast attributed kNN queries

Automated sleep stage scoring employing a reasoning mechanism and evaluation of its explainability

参考文献

分野

大学

学位論文種類・取得年

言語

コピーが完了しました

URLをコピーしました

Anomaly Detection using Adversarial Generative Networks in Multivariate Time Series

概要

この論文で使われている画像

関連論文

BPF: a novel cluster boundary points detection method for static and streaming data

Study on SVM Classifiers for Imbalanced Data Classification Using Quasi-Linear Kernel

S-SOM v1.0: a structural self-organizing map algorithm for weather typing

Indexing complex networks for fast attributed kNN queries

Automated sleep stage scoring employing a reasoning mechanism and evaluation of its explainability

参考文献