論文の公開元へ

書き出し

Refer/BibIX

RIS

BibTeX

TSV

Classification methodologies with high-dimensional data in heterogeneity of sample size and covariance matrices

江頭, 健斗筑波大学 DOI:10.15068/0002008286

2023.09.13

概要

The development of instruments has enabled us to encounter data in the last few decades having
considerably larger variables than sample size in data analysis; these are called “high-dimension,
low-sample size” (HDLSS) data. Microarray data, image data, and spectroscopic map data represent such HDLSS data. Statistical methodologies were developed under the assumption that the
sample size would be greater than the dimension. Generally, this condition works better than in
a situation in which the sample size is smaller than the dimension, as seen in the law of large
numbers and the central limit theorem. As another example, the inverse of the sample covariance
matrix is implicitly guaranteed to exist under a situation in which the dimension is larger than the
sample size. Given the background, it was anticipated that conventional statistics could give an
irrelevant answer for HDLSS data, and it was revealed that some statistical methodologies could
give biased results for high-dimensional data. To give some validity to statistical methodologies
used for HDLSS data, we need to proceed with our current comprehension of multivariate analysis
for HDLSS data.
In the ﬁeld of machine learning, there are many studies on discriminant analysis, which is a
type of supervised learning. In the HDLSS context, Hall et al. [17], Chan and Hall [11], and
Aoshima and Yata [4] considered distance-based classiﬁers. Aoshima and Yata [7] developed a
distance-based classiﬁer by transforming data. Ishii et al. [22] proposed a quadratic classiﬁer using
a data transformation technique. Aoshima and Yata [3, 5] considered geometric classiﬁers based on
a geometric representation of HDLSS data. In addition, Aoshima and Yata [8] considered quadratic
classiﬁers in general and discussed their optimality under high-dimension and non-sparse settings.
One of the representative tools of binary linear discriminant analysis is the support vector
machine (SVM) developed by Vapnik [35]. Hall et al. [16], Chan and Hall [11], and Nakayama et
al. [28, 29] examined the asymptotic properties of the SVM in the HDLSS context. Nakayama et al.
[28, 29] indicated the strong inconsistency of the SVM with unbalanced sample sizes, proposed biascorrected SVMs, and showed their superiority to the SVM. In contrast, Marron et al. [27] pointed
out that the SVM causes data piling problem in the HDLSS context. Data piling is the phenomenon
where the projection of training data to the normal direction vector of a separating hyperplane is
the same for each class. Marron et al. [27] proposed distance-weighted discrimination (DWD) to
overcome this data piling issue. Unfortunately, the DWD is designed for balanced training data
sets. For imbalanced training data sets, Qiao et al. [32] developed the weighted DWD (WDWD),
which imposes diﬀerent weights on two classes. However, the asymptotic properties of DWD have
not been suﬃciently studied in the HDLSS context.
To deal with discriminant analysis for multi-class classiﬁcation, a series of binary classiﬁcations
needs to be solved using one-versus-one (OVO) or one-versus-rest (OVR) methodologies. Instead
of regarding multiclass classiﬁcation as a series of binary classiﬁcations, Lee et al. [25] proposed
multiclass SVM (MSVM) that simultaneously ﬁnds classiﬁer functions. Further, Huang et al. [20]
proposed multiclass DWD (MDWD) by generalizing binary DWD. MSVM has a lower number of
calculations than MDWD. Nakayama et al. [28, 29] discovered a bias term in the discriminant
function for binary SVM in high-dimensional and unbalanced settings. Because MSVM is a generalization of binary SVM, it is expected that the discriminant functions for MSVM have a biased
term.
Clustering is another unsupervised classiﬁcation. The aim is to group a set of data without
a supervisor such that the data in a set are similar. Cluster analysis is divided into two types:
partitional and hierarchical. Partitional clustering, as its name suggests, splits data into a predetermined number of clusters. For discussions of non-hierarchical cluster analysis, see Everitt et al.
[15] and Hastie et al. [18], among others. Hierarchical clustering is a methodology to group a set of
data by building a dendrogram based on a similarity or a dissimilarity between clusters such that
the data in a cluster are similar in the sense of a pre-determined linkage function (given later). ...

論文の公開元へ

参考文献

[1] Ahn, J., Lee, M.H., Yoon, Y.J. (2012). Clustering high dimension, low sample size data using

the maximal data piling distance. Statistica Sinica, 22, 443–464.

[2] Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J. (1999).

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon

tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of

the United States of America, 96, 6745–6750.

[3] Aoshima, M., Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential

Analysis (Editor’s special invited paper), 30, 356–399.

[4] Aoshima, M., Yata, K. (2014). A distance-based, misclassiﬁcation rate adjusted classiﬁer for

multiclass, high-dimensional data. Annals of the Institute of Statistical Mathematics, 66, 983–

1010.

[5] Aoshima, M., Yata, K. (2015). Geometric classiﬁer for multiclass, high-dimensional data. Sequential Analysis, 34, 279–294.

[6] Aoshima, M., Yata, K. (2018). Two-sample tests for high-dimension, strongly spiked eigenvalue

models. Statistica Sinica, 28, 43–62.

[7] Aoshima, M., Yata, K. (2019a). Distance-based classiﬁer by data transformation for highdimension, strongly spiked eigenvalue models. Annals of the Institute of Statistical Mathematics, 71, 473-503.

[8] Aoshima, M., Yata, K. (2019b). High-dimensional quadratic classiﬁers in non-sparse settings.

Methodology and Computing in Applied Probability, 21, 663–682.

[9] Bhattacharjee, A., Richards, W. G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C.,

Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong,

W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J., Meyerson, M. (2001). Classiﬁcation

of human lung carcinomas by mRNA expression proﬁling reveals distinct adenocarcinoma

subclasses. Proceedings of the National Academy of Sciences of the United States of America,

98, 13790–13795.

[10] Borysov, P., Hannig, J., Marron, J.S. (2014). Asymptotics of hierarchical clustering for growing

dimension. Journal of Multivariate Analysis, 124, 465–479.

[11] Chan, Y.-B., Hall, P. (2009). Scale adjustments for classiﬁers in high-dimensional, low sample

size settings. Biometrika, 96, 469–478.

[12] Egashira, K., Yata, K., Aoshima, M. (2021). Asymptotic properties of distance weighted discrimination and its bias correction for high-dimension, low-sample-size data. Japanese Journal

of Statistics and Data Science, 4, 821–840.

[13] Egashira, K. (2022). Asymptotic properties of multiclass support vector machine under high

dimensional settings. Communications in Statistics - Simulation and Computation, DOI:

10.1080/03610918.2022.2066693.

[14] Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D. (1998). Cluster analysis and display

of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the

United States of America, 95, 14863–14868.

[15] Everitt, B.S., Landau, S., Leese, M. (2001). Cluster Analysis. Arnold.

[16] Hall, P., Marron, J.S., Neeman, A. (2005). Geometric representation of high dimension, low

sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.

[17] Hall, P., Pittelkow, Y., Ghosh, M. (2008). Theoretical measures of relative performance of

classiﬁers for high dimensional data with small sample sizes. Journal of the Royal Statistical

Society, Series B, 70, 159–173.

[18] Hastie, T., Tibshirani, R., Friedman, J. (2001). The Element of Statistical Learning. Springer.

[19] Hippo, Y., Taniguchi, H., Tsutsumi, S., Machida, N., Chong, J. M., Fukayama, M., Kodama,

T., Aburatani, H. (2002). Global gene expression analysis of gastric cancer by oligonucleotide

microarrays. Cancer research, 62(1), 233–240.

[20] Huang, H., Liu, Y., Du, Y., Perou, C.M., Hayes, D.N., Todd, M.J., Marron, J.S. (2013). Multiclass Distance-Weighted Discrimination. Journal of Computational and Graphical Statistics,

22:4, 953–969.

[21] Huang, H., Liu, Y., Yuan, M., Marron, J.S. (2015). Statistical Signiﬁcance of Clustering using

Soft Thresholding. Journal of computational and graphical statistics, 24, 975–993.

[22] Ishii, A., Yata, K., Aoshima, M. (2022). Geometric classiﬁers for high-dimensional noisy data.

Journal of Multivariate Analysis, 188, 104850.

[23] Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F.,

Schwab, M., Antonescu, C.R., Peterson, C., Meltzer, P.S. (2001). Classiﬁcation and diagnostic

prediction of cancers using gene expression proﬁling and artiﬁcial neural networks. Nature

Medicine, 7, 673–679.

[24] Kimes, P. K., Liu, Y., Neil Hayes, D., Marron, J. S. (2017). Statistical signiﬁcance for hierarchical clustering. Biometrics, 73, 811–821.

[25] Lee, Y., Lin, Y., Wahba, G. (2004). Multicategory support vector machines: Theory and

application to the classiﬁcation of microarray data and satellite radiance data. Journal of the

American Statistical Association, 99:465, 67–82.

[26] Liu, Y., Hayes, D.N., Nobel, A., Marron, J.S. (2008). Statistical signiﬁcance of clustering for

high-dimension, low-sample size data. Journal of the American Statistical Association, 103,

1281–1293.

[27] Marron, J.S., Todd, M.J., Ahn, J. (2007). Distance-weighted discrimination. Journal of the

American Statistical Association, 102, 1267–1271.

[28] Nakayama, Y., Yata, K., Aoshima, M. (2017). Support vector machine and its bias correction

in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference, 191,

88–100.

[29] Nakayama, Y., Yata, K., Aoshima, M. (2020). Bias-corrected support vector machine with

Gaussian kernel in high-dimension, low-sample-size settings. Annals of the Institute of Statistical Mathematics, 72, 1257–1286.

[30] Nakayama, Y., Yata, K., Aoshima, M. (2021). Clustering by principal component analysis with

Gaussian kernel in high-dimension, low-sample-size settings. Journal of Multivariate Analysis,

185, 104779.

[31] Perou, C. M., Sørlie, T., Eisen, M. B., van de Rijn, M., Jeﬀrey, S. S., Rees, C. A., Pollack,

J. R., Ross, D. T., Johnsen, H., Akslen, L. A., Fluge, O., Pergamenschikov, A., Williams, C.,

Zhu, S. X., Lønning, P. E., Børresen-Dale, A. L., Brown, P. O., Botstein, D. (2000). Molecular

portraits of human breast tumours. Nature, 406, 747–752.

[32] Qiao, X., Zhang, H. H., Liu, Y., Todd, M.J., Marron, J.S. (2010). Weighted distance weighted

discrimination and its asymptotic properties. Journal of the American Statistical Association,

105, 401–414.

[33] Qiao, X., Zhang, L. (2015). Flexible high-dimensional classiﬁcation machines and their asymptotic properties. Journal of Machine Learning Research, 16, 1547–1572.

[34] Ross, D. T., Scherf, U., Eisen, M. B., Perou, C. M., Rees, C., Spellman, P., Iyer, V., Jeﬀrey,

S. S., Van de Rijn, M., Waltham, M., Pergamenschikov, A., Lee, J. C., Lashkari, D., Shalon,

D., Myers, T. G., Weinstein, J. N., Botstein, D., Brown, P. O. (2000). Systematic variation in

gene expression patterns in human cancer cell lines. Nature genetics, 24, 227–235.

[35] Vapnik, V. N. (2000). The nature of statistical learning theory (second ed.). Springer.

[36] Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the

American Statistical Association, 58, 236–244.

[37] Yata, K., Aoshima, M. (2020). Geometric consistency of principal component scores for highdimensional mixture models and its application. Scandinavian Journal of Statistics, 47, 899–

921.

...

参考文献をもっと見る

分野

大学

学位論文種類・取得年

言語

Classification methodologies with high-dimensional data in heterogeneity of sample size and covariance matrices

概要

関連論文

余事象モデルに基づく未学習クラス推定法と生体信号分類への応用

Anomaly Detection using Adversarial Generative Networks in Multivariate Time Series

Study on SVM Classifiers for Imbalanced Data Classification Using Quasi-Linear Kernel

Profile analysis and tests for mean vectors with two-step monotone missing data

Automated sleep stage scoring employing a reasoning mechanism and evaluation of its explainability

参考文献

分野

大学

学位論文種類・取得年

言語

コピーが完了しました

URLをコピーしました

Classification methodologies with high-dimensional data in heterogeneity of sample size and covariance matrices

概要

関連論文

余事象モデルに基づく未学習クラス推定法と生体信号分類への応用

Anomaly Detection using Adversarial Generative Networks in Multivariate Time Series

Study on SVM Classifiers for Imbalanced Data Classification Using Quasi-Linear Kernel

Profile analysis and tests for mean vectors with two-step monotone missing data

Automated sleep stage scoring employing a reasoning mechanism and evaluation of its explainability

参考文献