Classification methodologies with high-dimensional data in heterogeneity of sample size and covariance matrices
概要
The development of instruments has enabled us to encounter data in the last few decades having
considerably larger variables than sample size in data analysis; these are called “high-dimension,
low-sample size” (HDLSS) data. Microarray data, image data, and spectroscopic map data represent such HDLSS data. Statistical methodologies were developed under the assumption that the
sample size would be greater than the dimension. Generally, this condition works better than in
a situation in which the sample size is smaller than the dimension, as seen in the law of large
numbers and the central limit theorem. As another example, the inverse of the sample covariance
matrix is implicitly guaranteed to exist under a situation in which the dimension is larger than the
sample size. Given the background, it was anticipated that conventional statistics could give an
irrelevant answer for HDLSS data, and it was revealed that some statistical methodologies could
give biased results for high-dimensional data. To give some validity to statistical methodologies
used for HDLSS data, we need to proceed with our current comprehension of multivariate analysis
for HDLSS data.
In the field of machine learning, there are many studies on discriminant analysis, which is a
type of supervised learning. In the HDLSS context, Hall et al. [17], Chan and Hall [11], and
Aoshima and Yata [4] considered distance-based classifiers. Aoshima and Yata [7] developed a
distance-based classifier by transforming data. Ishii et al. [22] proposed a quadratic classifier using
a data transformation technique. Aoshima and Yata [3, 5] considered geometric classifiers based on
a geometric representation of HDLSS data. In addition, Aoshima and Yata [8] considered quadratic
classifiers in general and discussed their optimality under high-dimension and non-sparse settings.
One of the representative tools of binary linear discriminant analysis is the support vector
machine (SVM) developed by Vapnik [35]. Hall et al. [16], Chan and Hall [11], and Nakayama et
al. [28, 29] examined the asymptotic properties of the SVM in the HDLSS context. Nakayama et al.
[28, 29] indicated the strong inconsistency of the SVM with unbalanced sample sizes, proposed biascorrected SVMs, and showed their superiority to the SVM. In contrast, Marron et al. [27] pointed
out that the SVM causes data piling problem in the HDLSS context. Data piling is the phenomenon
where the projection of training data to the normal direction vector of a separating hyperplane is
the same for each class. Marron et al. [27] proposed distance-weighted discrimination (DWD) to
overcome this data piling issue. Unfortunately, the DWD is designed for balanced training data
sets. For imbalanced training data sets, Qiao et al. [32] developed the weighted DWD (WDWD),
which imposes different weights on two classes. However, the asymptotic properties of DWD have
not been sufficiently studied in the HDLSS context.
To deal with discriminant analysis for multi-class classification, a series of binary classifications
needs to be solved using one-versus-one (OVO) or one-versus-rest (OVR) methodologies. Instead
of regarding multiclass classification as a series of binary classifications, Lee et al. [25] proposed
multiclass SVM (MSVM) that simultaneously finds classifier functions. Further, Huang et al. [20]
proposed multiclass DWD (MDWD) by generalizing binary DWD. MSVM has a lower number of
calculations than MDWD. Nakayama et al. [28, 29] discovered a bias term in the discriminant
function for binary SVM in high-dimensional and unbalanced settings. Because MSVM is a generalization of binary SVM, it is expected that the discriminant functions for MSVM have a biased
term.
Clustering is another unsupervised classification. The aim is to group a set of data without
a supervisor such that the data in a set are similar. Cluster analysis is divided into two types:
partitional and hierarchical. Partitional clustering, as its name suggests, splits data into a predetermined number of clusters. For discussions of non-hierarchical cluster analysis, see Everitt et al.
[15] and Hastie et al. [18], among others. Hierarchical clustering is a methodology to group a set of
data by building a dendrogram based on a similarity or a dissimilarity between clusters such that
the data in a cluster are similar in the sense of a pre-determined linkage function (given later). ...