論文の公開元へ

書き出し

Refer/BibIX

RIS

BibTeX

TSV

Support vector machine in high-dimension, low-sample-size settings

中山, 優吾筑波大学 DOI:10.15068/00160434

2020.07.21

概要

With the development of modern science, it has become possible to observe large-scale data. One of the features of such data is a high-dimension, low-sample-size. We call such data HDLSS. A divergence condition d/n → ∞ is met for HDLSS data, where d is the data dimension and n is the sample size. HDLSS data is observed in many areas of modern science such as genetic microarrays, medical imaging, text recognition, finance, chemometrics, and so on. Researches on HDLSS data have been actively studied in various fields such as multivariate analysis and machine learning. Many methods of multivariate analysis rely on the large sample theory, so that we cannot apply some of them for high-dimensional data analysis. On the other hand, we can use machine learning methods for low-dimensional and high- dimensional data. However, their asymptotic properties seem not to have been sufficiently studied in the HDLSS context. In order to analyze HDLSS data, we need further analyses for multivariate analysis and machine learning.

Aoshima and Yata (2011) is one of the pioneer researches in high-dimensional data analysis, and they gave a broad perspective of high-dimensional statistical analysis such as a test of equality of two covariance matrices, classification and so on along with sample size determination to ensure prespecified accuracy for each inference. Regarding the classification problem, Aoshima and Yata (2014) gave the misclassification rate adjusted classifier for multiclass, high-dimensional data in which misclassification rates are no more than specified thresholds. Aoshima and Yata (2011, 2015b) gave geometric classifiers based on a geometric representation of HDLSS data. Ahn and Marron (2010) considered a classifier based on the maximal data piling direction. Aoshima and Yata (2019a) considered the distance-based classifier by using data transformation based on the eigenstructure. Noting that non-sparse situations often occur in high-dimensional settings, Aoshima and Yata (2019b) considered a family of quadratic classifiers and discussed asymptotic properties and optimality of the classifies under high-dimension, non-sparse settings.

In the field of machine learning, there are many studies about the classification in the context of supervised learning. For example, the support vector machine (SVM) has been an efficient tool for classification and pattern recognition in many areas. Hall et al. (2005) and Qiao and Zhang (2015) investigated the versatility of the linear SVM (LSVM) for high-dimensional data. Hall et al. (2005), Chan and Hall (2009) and Qiao and Zhang (2015) investigated asymptotic properties of the LSVM in the HDLSS context and showed a consistency property in the sense that the misclassification rates of the LSVM tend to zero as d → ∞ under certain strict conditions in the HDLSS context. Chan and Hall (2009) gave scale-adjusted of the average distance, nearest neighbor and distance-based classifiers, including the LSVM. Huang (2017) investigated the SVM in the high-dimension, large-sample-size context as d/n → c > 0. As long as we know, asymptotic properties of nonlinear SVMs seem not to have been sufficiently studied in the HDLSS context.

In this thesis, we consider tests of covariance matrix structures and asymptotic properties of the SVM in the HDLSS framework. This thesis consists of four chapters.

In Chapter 1, we consider a test of the sphericity for high-dimensional covariance matrices. This chapter is organized by the findings of Yata et al. (2018). We construct a test statistic by using the extended cross-data-matrix (ECDM) methodology proposed by Yata and Aoshima (2013). We show that the ECDM test statistic is based on an unbiased estimator of a sphericity measure. In addition, the ECDM test statistic enjoys consistency properties and the asymptotic normality in high-dimensional settings. We propose a new test procedure based on the ECDM test statistic and evaluate its asymptotic size and power theoretically and numerically. We give a two-stage sampling scheme so that the test procedure can ensure a prespecified level both for the size and power. We apply the test procedure to detect divergently spiked noise in high-dimensional statistical analysis. We analyze gene expression data by the proposed test procedure.

In Chapter 2, we consider asymptotic properties of the hard-margin LSVM (hmLSVM) in HDLSS settings. This chapter is organized by the findings of Nakayama et al. (2017). We show that the LSVM holds a consistency property in which misclassification rates tend to zero as the dimension goes to infinity under certain severe conditions. We show that the LSVM is very biased in HDLSS settings and its performance is affected by the bias directly. In order to overcome such difficulties, we propose a bias- corrected LSVM (BC-LSVM). We show that the BC-LSVM gives preferable performances in HDLSS settings. We also discuss the LSVMs in multiclass HDLSS settings. Finally, we check the performance of the classifiers in real data analyses.

In Chapter 3, we investigate behaviors of the soft-margin LSVM (smLSVM) for the regularization parameter. This chapter is organized by the findings of Nakayama (2019). We show that the smLSVM cannot handle imbalanced classification and the smLSVM is very biased in HDLSS settings. In order to overcome such difficulties, we propose a robust LSVM (RSVM). We show that the RSVM gives preferable performances in HDLSS settings.

In Chapter 4, we study asymptotic properties of nonlinear SVMs in HDLSS settings. This chapter is organized by the findings of Nakayama et al. (2019). We propose a bias-corrected SVM (BC-SVM) which is robust against imbalanced data in a general framework. In particular, we investigate asymptotic properties of the BC-SVM having the Gaussian kernel and compare them with the ones having the linear kernel. We show that the performance of the BC-SVM is influenced by the scale parameter involved in the Gaussian kernel. We discuss a choice of the scale parameter yielding a high performance and examine the validity of the choice by numerical simulations and real data analyses.

論文の公開元へ

参考文献

Ahn, J., Marron, J., Muller, K. M., and Chi, Y.-Y. (2007). The high-dimension, low-sample-size geo- metric representation holds under mild conditions. Biometrika, 94(3):760–766.

Ahn, J. and Marron, J. S. (2010). The maximal data piling direction for discrimination. Biometrika, 97(1):254–259.

Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12):6745–6750.

Aoshima, M. and Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential Analysis (Editor’s special invited paper), 30(4):356–399.

Aoshima, M. and Yata, K. (2014). A distance-based, misclassification rate adjusted classifier for multi- class, high-dimensional data. Annals of the Institute of Statistical Mathematics, 66(5):983–1010.

Aoshima, M. and Yata, K. (2015a). Asymptotic normality for inference on multisample, high-dimensional mean vectors under mild conditions. Methodology and Computing in Applied Probability, 17(2):419– 439.

Aoshima, M. and Yata, K. (2015b). Geometric classifier for multiclass, high-dimensional data. Sequential Analysis, 34(3):279–294.

Aoshima, M. and Yata, K. (2018). Two-sample tests for high-dimension, strongly spiked eigenvalue models. Statistica Sinica, 28(1):43–62.

Aoshima, M. and Yata, K. (2019a). Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models. Annals of the Institute of Statistical Mathematics, 71:473–503.

Aoshima, M. and Yata, K. (2019b). High-dimensional quadratic classifiers in non-sparse settings. Method- ology and Computing in Applied Probability, 21:663–682.

Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R., and Korsmeyer, S. J. (2002). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics, 30(1):41–47.

Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: by an example of a two sample problem. Statistica Sinica, 6(2):311–329.

Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked popu- lation models. Journal of Multivariate Analysis, 97:1382–1408.

Benjamin, X. W. and Nathalie, J. (2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25(1):1–20.

Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, ‘naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989– 1010.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, New York.

Carmichael, I. and Marron, J. (2017). Geometric insights into support vector machine behavior using the kkt conditions. arXiv preprint arXiv:1704.00767.

Chan, Y.-B. and Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika, 96(2):469–478.

Chang, J. C., Wooten, E. C., Tsimelzon, A., Hilsenbeck, S. G., Gutierrez, M. C., Elledge, R., Mohsin, S., Osborne, C. K., Chamness, G. C., Allred, D. C., and O’Connell, P. (2003). Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. The Lancet, 362(9381):362–369.

Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Annals of Statistics, 38(2):808–835.

Chen, S. X., Zhang, L.-X., and Zhong, P.-S. (2010). Tests for high-dimensional covariance matrices. Journal of the American Statistical Association, 105(490):810–819.

Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77–87.

Fan, J., Liao, Y., and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4):603–680.

Friedman, J. (1996). Another Approach to Polychotomous Classification. Technical report.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537.

Hall, P., Marron, J. S., and Neeman, A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):427–444.

He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284.

Huang, H. (2017). Asymptotic behavior of support vector machine for spiked population model. Journal of Machine Learning Research, 18(45):1–21.

Ishii, A., Yata, K., and Aoshima, M. (2016). Asymptotic properties of the first principal component and equality tests of covariance matrices in high-dimension, low-sample-size context. Journal of Statistical Planning and Inference, 170:186–199.

Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics, 29(2):295–327.

Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low sample size context. The Annals of Statistics, 37(6B):4104–4130.

Ledoit, O. and Wolf, M. (2002). Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. The Annals of Statistics, 30(4):1081–1102.

Naderi, A., Teschendorff, A. E., Barbosa-Morais, N. L., Pinder, S. E., Green, A. R., Powe, D. G., Robertson, J. F. R., Aparicio, S., Ellis, I. O., Brenton, J. D., and Caldas, C. (2007). A gene-expression signature to predict survival in breast cancer across independent data sets. Oncogene, 26(10):1507– 1516.

Nagao, H. (1973). On some test criteria for covariance matrix. Annals of Statistics, 1(4):700–709. Nakayama, Y. (2019). Robust support vector machine for high-dimensional imbalanced data. Communi-cations in Statistics - Simulation and Computation, in press (doi: 10.1080/03610918.2019.1586922).

Nakayama, Y., Yata, K., and Aoshima, M. (2017). Support vector machine and its bias correction in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference, 191:88–100.

Nakayama, Y., Yata, K., and Aoshima, M. (2019). Bias-corrected support vector machine with gaussian kernel in high-dimension, low-sample-size settings. Annals of Institute of Mathematical Statistics, in press (doi: 10.1007/s10463-019-00727-1).

Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M., von Deimling, A., Pomeroy, S. L., Golub, T. R., and Louis, D. N. (2003). Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Research, 63(7):1602–1607.

Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 17:1617–1642.

Qiao, X. and Zhang, L. (2015). Flexible high-dimensional classification machines and their asymptotic properties. Journal of Machine Learning Research, 16(45):1547–1572.

Shen, D., Shen, H., Zhu, H., and Marron, J. S. (2016). The statistics and mathematics of high dimension low sample size asymptotics. Statistica Sinica, 26(4):1747–1770.

Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C. T., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., Ray, T. S., Koval, M. A., Last, K. W., Norton, A., Lister, T. A., Mesirov, J., Neuberg, D. S., Lander, E. S., Aster, J. C., and Golub, T. R. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 8(1):68–74.

Srivastava, M. S., Kolloand, T., and von Rosen, D. (2011). Some tests for the covariance matrix with fewer observations than the dimension under non-normal. Journal of Multivariate Analysis, 102(6):1090–1103.

Tian, E., Zhan, F., Walker, R., Rasmussen, E., Ma, Y., Barlogie, B., and Shaughnessy, J. D. J. (2003). The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. New England Journal of Medicine, 349(26):2483–2494.

Vapnik, V. (2000). The Nature of Statistical Learning Theory (second ed.). Springer, New York.

Yata, K. and Aoshima, M. (2009). PCA consistency for non-gaussian data in high dimension, low sample size context. Communications in Statistics - Theory and Methods, Special Issue Honoring Zacks, S. (ed. Mukhopadhyay, N.), 38:2634–2652.

Yata, K. and Aoshima, M. (2012a). Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations. Journal of Multivariate Analysis, 105(1):193–215.

Yata, K. and Aoshima, M. (2012b). Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data matrix. Journal of Multivariate Analysis, 101(9):2060–2077.

Yata, K. and Aoshima, M. (2013). Correlation tests for high-dimensional data using extended cross- data-matrix methodology. Journal of Multivariate Analysis, 117:313–331.

Yata, K. and Aoshima, M. (2016). High-dimensional inference on covariance structures via the extended cross-data-matrix methodology. Journal of Multivariate Analysis, 151:151–166.

Yata, K., Aoshima, M., and Nakayama, Y. (2018). A test of sphericity for high-dimensional data and its application for detection of divergently spiked noise. Sequential Analysis, 37:397–411.

参考文献をもっと見る

分野

大学

学位論文種類・取得年

言語

Support vector machine in high-dimension, low-sample-size settings

概要

関連論文

Lp-Kato class measures and their relations with Sobolev embedding theorems

Flat convergence of integral currents and the size of rectifiable sets in metric spaces

Some Constructions of Wavelets, Frames and Orthonormal Bases

Equi-correlated random matrices and high-dimensional statistics

GORENSTEIN T-SPREAD VERONESE ALGEBRAS

参考文献

分野

大学

学位論文種類・取得年

言語

コピーが完了しました

URLをコピーしました

Support vector machine in high-dimension, low-sample-size settings

概要

関連論文

Lp-Kato class measures and their relations with Sobolev embedding theorems

Flat convergence of integral currents and the size of rectifiable sets in metric spaces

Some Constructions of Wavelets, Frames and Orthonormal Bases

Equi-correlated random matrices and high-dimensional statistics

GORENSTEIN T-SPREAD VERONESE ALGEBRAS

参考文献