Support vector machine in high-dimension, low-sample-size settings
概要
With the development of modern science, it has become possible to observe large-scale data. One of the features of such data is a high-dimension, low-sample-size. We call such data HDLSS. A divergence condition d/n → ∞ is met for HDLSS data, where d is the data dimension and n is the sample size. HDLSS data is observed in many areas of modern science such as genetic microarrays, medical imaging, text recognition, finance, chemometrics, and so on. Researches on HDLSS data have been actively studied in various fields such as multivariate analysis and machine learning. Many methods of multivariate analysis rely on the large sample theory, so that we cannot apply some of them for high-dimensional data analysis. On the other hand, we can use machine learning methods for low-dimensional and high- dimensional data. However, their asymptotic properties seem not to have been sufficiently studied in the HDLSS context. In order to analyze HDLSS data, we need further analyses for multivariate analysis and machine learning.
Aoshima and Yata (2011) is one of the pioneer researches in high-dimensional data analysis, and they gave a broad perspective of high-dimensional statistical analysis such as a test of equality of two covariance matrices, classification and so on along with sample size determination to ensure prespecified accuracy for each inference. Regarding the classification problem, Aoshima and Yata (2014) gave the misclassification rate adjusted classifier for multiclass, high-dimensional data in which misclassification rates are no more than specified thresholds. Aoshima and Yata (2011, 2015b) gave geometric classifiers based on a geometric representation of HDLSS data. Ahn and Marron (2010) considered a classifier based on the maximal data piling direction. Aoshima and Yata (2019a) considered the distance-based classifier by using data transformation based on the eigenstructure. Noting that non-sparse situations often occur in high-dimensional settings, Aoshima and Yata (2019b) considered a family of quadratic classifiers and discussed asymptotic properties and optimality of the classifies under high-dimension, non-sparse settings.
In the field of machine learning, there are many studies about the classification in the context of supervised learning. For example, the support vector machine (SVM) has been an efficient tool for classification and pattern recognition in many areas. Hall et al. (2005) and Qiao and Zhang (2015) investigated the versatility of the linear SVM (LSVM) for high-dimensional data. Hall et al. (2005), Chan and Hall (2009) and Qiao and Zhang (2015) investigated asymptotic properties of the LSVM in the HDLSS context and showed a consistency property in the sense that the misclassification rates of the LSVM tend to zero as d → ∞ under certain strict conditions in the HDLSS context. Chan and Hall (2009) gave scale-adjusted of the average distance, nearest neighbor and distance-based classifiers, including the LSVM. Huang (2017) investigated the SVM in the high-dimension, large-sample-size context as d/n → c > 0. As long as we know, asymptotic properties of nonlinear SVMs seem not to have been sufficiently studied in the HDLSS context.
In this thesis, we consider tests of covariance matrix structures and asymptotic properties of the SVM in the HDLSS framework. This thesis consists of four chapters.
In Chapter 1, we consider a test of the sphericity for high-dimensional covariance matrices. This chapter is organized by the findings of Yata et al. (2018). We construct a test statistic by using the extended cross-data-matrix (ECDM) methodology proposed by Yata and Aoshima (2013). We show that the ECDM test statistic is based on an unbiased estimator of a sphericity measure. In addition, the ECDM test statistic enjoys consistency properties and the asymptotic normality in high-dimensional settings. We propose a new test procedure based on the ECDM test statistic and evaluate its asymptotic size and power theoretically and numerically. We give a two-stage sampling scheme so that the test procedure can ensure a prespecified level both for the size and power. We apply the test procedure to detect divergently spiked noise in high-dimensional statistical analysis. We analyze gene expression data by the proposed test procedure.
In Chapter 2, we consider asymptotic properties of the hard-margin LSVM (hmLSVM) in HDLSS settings. This chapter is organized by the findings of Nakayama et al. (2017). We show that the LSVM holds a consistency property in which misclassification rates tend to zero as the dimension goes to infinity under certain severe conditions. We show that the LSVM is very biased in HDLSS settings and its performance is affected by the bias directly. In order to overcome such difficulties, we propose a bias- corrected LSVM (BC-LSVM). We show that the BC-LSVM gives preferable performances in HDLSS settings. We also discuss the LSVMs in multiclass HDLSS settings. Finally, we check the performance of the classifiers in real data analyses.
In Chapter 3, we investigate behaviors of the soft-margin LSVM (smLSVM) for the regularization parameter. This chapter is organized by the findings of Nakayama (2019). We show that the smLSVM cannot handle imbalanced classification and the smLSVM is very biased in HDLSS settings. In order to overcome such difficulties, we propose a robust LSVM (RSVM). We show that the RSVM gives preferable performances in HDLSS settings.
In Chapter 4, we study asymptotic properties of nonlinear SVMs in HDLSS settings. This chapter is organized by the findings of Nakayama et al. (2019). We propose a bias-corrected SVM (BC-SVM) which is robust against imbalanced data in a general framework. In particular, we investigate asymptotic properties of the BC-SVM having the Gaussian kernel and compare them with the ones having the linear kernel. We show that the performance of the BC-SVM is influenced by the scale parameter involved in the Gaussian kernel. We discuss a choice of the scale parameter yielding a high performance and examine the validity of the choice by numerical simulations and real data analyses.