リケラボ論文検索は、全国の大学リポジトリにある学位論文・教授論文を一括検索できる論文検索サービスです。

リケラボ 全国の大学リポジトリにある学位論文・教授論文を一括検索するならリケラボ論文検索大学・研究所にある論文を検索できる

リケラボ 全国の大学リポジトリにある学位論文・教授論文を一括検索するならリケラボ論文検索大学・研究所にある論文を検索できる

大学・研究所にある論文を検索できる 「Classification methodologies with high-dimensional data in heterogeneity of sample size and covariance matrices」の論文概要。リケラボ論文検索は、全国の大学リポジトリにある学位論文・教授論文を一括検索できる論文検索サービスです。

コピーが完了しました

URLをコピーしました

論文の公開元へ論文の公開元へ
書き出し

Classification methodologies with high-dimensional data in heterogeneity of sample size and covariance matrices

江頭, 健斗 筑波大学 DOI:10.15068/0002008286

2023.09.13

概要

The development of instruments has enabled us to encounter data in the last few decades having
considerably larger variables than sample size in data analysis; these are called “high-dimension,
low-sample size” (HDLSS) data. Microarray data, image data, and spectroscopic map data represent such HDLSS data. Statistical methodologies were developed under the assumption that the
sample size would be greater than the dimension. Generally, this condition works better than in
a situation in which the sample size is smaller than the dimension, as seen in the law of large
numbers and the central limit theorem. As another example, the inverse of the sample covariance
matrix is implicitly guaranteed to exist under a situation in which the dimension is larger than the
sample size. Given the background, it was anticipated that conventional statistics could give an
irrelevant answer for HDLSS data, and it was revealed that some statistical methodologies could
give biased results for high-dimensional data. To give some validity to statistical methodologies
used for HDLSS data, we need to proceed with our current comprehension of multivariate analysis
for HDLSS data.
In the field of machine learning, there are many studies on discriminant analysis, which is a
type of supervised learning. In the HDLSS context, Hall et al. [17], Chan and Hall [11], and
Aoshima and Yata [4] considered distance-based classifiers. Aoshima and Yata [7] developed a
distance-based classifier by transforming data. Ishii et al. [22] proposed a quadratic classifier using
a data transformation technique. Aoshima and Yata [3, 5] considered geometric classifiers based on
a geometric representation of HDLSS data. In addition, Aoshima and Yata [8] considered quadratic
classifiers in general and discussed their optimality under high-dimension and non-sparse settings.
One of the representative tools of binary linear discriminant analysis is the support vector
machine (SVM) developed by Vapnik [35]. Hall et al. [16], Chan and Hall [11], and Nakayama et
al. [28, 29] examined the asymptotic properties of the SVM in the HDLSS context. Nakayama et al.
[28, 29] indicated the strong inconsistency of the SVM with unbalanced sample sizes, proposed biascorrected SVMs, and showed their superiority to the SVM. In contrast, Marron et al. [27] pointed
out that the SVM causes data piling problem in the HDLSS context. Data piling is the phenomenon
where the projection of training data to the normal direction vector of a separating hyperplane is
the same for each class. Marron et al. [27] proposed distance-weighted discrimination (DWD) to
overcome this data piling issue. Unfortunately, the DWD is designed for balanced training data
sets. For imbalanced training data sets, Qiao et al. [32] developed the weighted DWD (WDWD),
which imposes different weights on two classes. However, the asymptotic properties of DWD have
not been sufficiently studied in the HDLSS context.
To deal with discriminant analysis for multi-class classification, a series of binary classifications
needs to be solved using one-versus-one (OVO) or one-versus-rest (OVR) methodologies. Instead
of regarding multiclass classification as a series of binary classifications, Lee et al. [25] proposed
multiclass SVM (MSVM) that simultaneously finds classifier functions. Further, Huang et al. [20]
proposed multiclass DWD (MDWD) by generalizing binary DWD. MSVM has a lower number of
calculations than MDWD. Nakayama et al. [28, 29] discovered a bias term in the discriminant
function for binary SVM in high-dimensional and unbalanced settings. Because MSVM is a generalization of binary SVM, it is expected that the discriminant functions for MSVM have a biased
term.
Clustering is another unsupervised classification. The aim is to group a set of data without
a supervisor such that the data in a set are similar. Cluster analysis is divided into two types:
partitional and hierarchical. Partitional clustering, as its name suggests, splits data into a predetermined number of clusters. For discussions of non-hierarchical cluster analysis, see Everitt et al.
[15] and Hastie et al. [18], among others. Hierarchical clustering is a methodology to group a set of
data by building a dendrogram based on a similarity or a dissimilarity between clusters such that
the data in a cluster are similar in the sense of a pre-determined linkage function (given later). ...

参考文献

[1] Ahn, J., Lee, M.H., Yoon, Y.J. (2012). Clustering high dimension, low sample size data using

the maximal data piling distance. Statistica Sinica, 22, 443–464.

[2] Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J. (1999).

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon

tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of

the United States of America, 96, 6745–6750.

[3] Aoshima, M., Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential

Analysis (Editor’s special invited paper), 30, 356–399.

[4] Aoshima, M., Yata, K. (2014). A distance-based, misclassification rate adjusted classifier for

multiclass, high-dimensional data. Annals of the Institute of Statistical Mathematics, 66, 983–

1010.

[5] Aoshima, M., Yata, K. (2015). Geometric classifier for multiclass, high-dimensional data. Sequential Analysis, 34, 279–294.

[6] Aoshima, M., Yata, K. (2018). Two-sample tests for high-dimension, strongly spiked eigenvalue

models. Statistica Sinica, 28, 43–62.

[7] Aoshima, M., Yata, K. (2019a). Distance-based classifier by data transformation for highdimension, strongly spiked eigenvalue models. Annals of the Institute of Statistical Mathematics, 71, 473-503.

[8] Aoshima, M., Yata, K. (2019b). High-dimensional quadratic classifiers in non-sparse settings.

Methodology and Computing in Applied Probability, 21, 663–682.

[9] Bhattacharjee, A., Richards, W. G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C.,

Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong,

W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J., Meyerson, M. (2001). Classification

of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma

subclasses. Proceedings of the National Academy of Sciences of the United States of America,

98, 13790–13795.

[10] Borysov, P., Hannig, J., Marron, J.S. (2014). Asymptotics of hierarchical clustering for growing

dimension. Journal of Multivariate Analysis, 124, 465–479.

[11] Chan, Y.-B., Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample

size settings. Biometrika, 96, 469–478.

[12] Egashira, K., Yata, K., Aoshima, M. (2021). Asymptotic properties of distance weighted discrimination and its bias correction for high-dimension, low-sample-size data. Japanese Journal

of Statistics and Data Science, 4, 821–840.

[13] Egashira, K. (2022). Asymptotic properties of multiclass support vector machine under high

dimensional settings. Communications in Statistics - Simulation and Computation, DOI:

10.1080/03610918.2022.2066693.

[14] Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D. (1998). Cluster analysis and display

of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the

United States of America, 95, 14863–14868.

46

[15] Everitt, B.S., Landau, S., Leese, M. (2001). Cluster Analysis. Arnold.

[16] Hall, P., Marron, J.S., Neeman, A. (2005). Geometric representation of high dimension, low

sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.

[17] Hall, P., Pittelkow, Y., Ghosh, M. (2008). Theoretical measures of relative performance of

classifiers for high dimensional data with small sample sizes. Journal of the Royal Statistical

Society, Series B, 70, 159–173.

[18] Hastie, T., Tibshirani, R., Friedman, J. (2001). The Element of Statistical Learning. Springer.

[19] Hippo, Y., Taniguchi, H., Tsutsumi, S., Machida, N., Chong, J. M., Fukayama, M., Kodama,

T., Aburatani, H. (2002). Global gene expression analysis of gastric cancer by oligonucleotide

microarrays. Cancer research, 62(1), 233–240.

[20] Huang, H., Liu, Y., Du, Y., Perou, C.M., Hayes, D.N., Todd, M.J., Marron, J.S. (2013). Multiclass Distance-Weighted Discrimination. Journal of Computational and Graphical Statistics,

22:4, 953–969.

[21] Huang, H., Liu, Y., Yuan, M., Marron, J.S. (2015). Statistical Significance of Clustering using

Soft Thresholding. Journal of computational and graphical statistics, 24, 975–993.

[22] Ishii, A., Yata, K., Aoshima, M. (2022). Geometric classifiers for high-dimensional noisy data.

Journal of Multivariate Analysis, 188, 104850.

[23] Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F.,

Schwab, M., Antonescu, C.R., Peterson, C., Meltzer, P.S. (2001). Classification and diagnostic

prediction of cancers using gene expression profiling and artificial neural networks. Nature

Medicine, 7, 673–679.

[24] Kimes, P. K., Liu, Y., Neil Hayes, D., Marron, J. S. (2017). Statistical significance for hierarchical clustering. Biometrics, 73, 811–821.

[25] Lee, Y., Lin, Y., Wahba, G. (2004). Multicategory support vector machines: Theory and

application to the classification of microarray data and satellite radiance data. Journal of the

American Statistical Association, 99:465, 67–82.

[26] Liu, Y., Hayes, D.N., Nobel, A., Marron, J.S. (2008). Statistical significance of clustering for

high-dimension, low-sample size data. Journal of the American Statistical Association, 103,

1281–1293.

[27] Marron, J.S., Todd, M.J., Ahn, J. (2007). Distance-weighted discrimination. Journal of the

American Statistical Association, 102, 1267–1271.

[28] Nakayama, Y., Yata, K., Aoshima, M. (2017). Support vector machine and its bias correction

in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference, 191,

88–100.

[29] Nakayama, Y., Yata, K., Aoshima, M. (2020). Bias-corrected support vector machine with

Gaussian kernel in high-dimension, low-sample-size settings. Annals of the Institute of Statistical Mathematics, 72, 1257–1286.

[30] Nakayama, Y., Yata, K., Aoshima, M. (2021). Clustering by principal component analysis with

Gaussian kernel in high-dimension, low-sample-size settings. Journal of Multivariate Analysis,

185, 104779.

47

[31] Perou, C. M., Sørlie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Rees, C. A., Pollack,

J. R., Ross, D. T., Johnsen, H., Akslen, L. A., Fluge, O., Pergamenschikov, A., Williams, C.,

Zhu, S. X., Lønning, P. E., Børresen-Dale, A. L., Brown, P. O., Botstein, D. (2000). Molecular

portraits of human breast tumours. Nature, 406, 747–752.

[32] Qiao, X., Zhang, H. H., Liu, Y., Todd, M.J., Marron, J.S. (2010). Weighted distance weighted

discrimination and its asymptotic properties. Journal of the American Statistical Association,

105, 401–414.

[33] Qiao, X., Zhang, L. (2015). Flexible high-dimensional classification machines and their asymptotic properties. Journal of Machine Learning Research, 16, 1547–1572.

[34] Ross, D. T., Scherf, U., Eisen, M. B., Perou, C. M., Rees, C., Spellman, P., Iyer, V., Jeffrey,

S. S., Van de Rijn, M., Waltham, M., Pergamenschikov, A., Lee, J. C., Lashkari, D., Shalon,

D., Myers, T. G., Weinstein, J. N., Botstein, D., Brown, P. O. (2000). Systematic variation in

gene expression patterns in human cancer cell lines. Nature genetics, 24, 227–235.

[35] Vapnik, V. N. (2000). The nature of statistical learning theory (second ed.). Springer.

[36] Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the

American Statistical Association, 58, 236–244.

[37] Yata, K., Aoshima, M. (2020). Geometric consistency of principal component scores for highdimensional mixture models and its application. Scandinavian Journal of Statistics, 47, 899–

921.

48

...

参考文献をもっと見る

全国の大学の
卒論・修論・学位論文

一発検索!

この論文の関連論文を見る