論文の公開元へ

書き出し

Refer/BibIX

RIS

BibTeX

TSV

Comprehensive evaluation of preprocessing methods for visualizing single-cell RNA-seq count data

張, 子龍東京大学 DOI:10.15083/0002004920

2022.06.22

概要

Introduction
(Chapter 1) Single-cell RNA sequencing (scRNA-seq) provides RNA expressions of a number of individual cells with high resolution. The technology can profile up to one million cells at a time, but the sequencing per cell is shallow. It results in false zero count observation for an expressed gene, often referred to as dropout event. It also has the overall high levels of noise mainly due to the low amounts of input RNA. Therefore, an important step when analyzing scRNA-seq count data is preprocessing (including imputation and smoothing). A number of preprocessing methods such as DrImpute and DCA (Deep Count Autoencoder) has been developed.

A typical scRNA-seq count matrix contains tens of thousands of genes or dimensions in rows and hundreds or thousands of cells in columns. After preprocessing, the dimensions are reduced to capture the underlying structure in the data and to visualize the data. The commonly used dimensionality reduction methods include the principal component analysis (PCA) and the t-distributed stochastic neighbor embedding (t-SNE). Although they are not designed specifically for scRNA-seq data having dropouts, many preprocessed data have been visualized specifically using t-SNE. A visual representation of the data with two or three dimensions (2D or 3D) has led to the discovery of new cell types and/or subtypes. However, since new methods are proposed one after another, it has become difficult to determine which method is suitable for one's own data.

This study mainly focuses on preprocessing methods for visualizing scRNA-seq data. I developed an autoencoder-based preprocessing method (called DAE) in this framework. The purpose of this study is to evaluate eleven preprocessing methods with t-SNE as well as five other methods. Comprehensive analysis with both simulated and real datasets provides sound recommendations regarding an optimal guideline for the dataset at hand.

Materials and methods (Chapter 2)
Simulated data were obtained using two R packages (Splatter and powsimR). A total of 17 conditions were evaluated: eight from Splatter and nine from powsimR. Each Splatter dataset contained 1,000 cells split into 2–5 groups (or cell types) and 1,000–5,000 genes of which 5–20% were differentially expressed (DE) across the groups. Each powsimR dataset contained 200–2000 cells split into three groups and 10,000–30,000 genes of which 10% DE. For real data, a total of 15 raw count datasets (five from humans and ten from mice) were obtained from the original papers. Each real dataset contained 56–3,005 cells split into 3–32 groups and 19,020–41,480 genes.

A total of 16 pipelines with raw count data as input and output reduced data were evaluated. They can be divided into three categories. The first category, consisting of two methods (PCA and t-SNE), only performs dimensionality reduction (i.e., visualization). The second category, consisting of four methods (ZIFA, PHATE, CIDR, and MAGIC), performs from preprocessing to visualization in one way. The third category, consisting of ten preprocessing methods (DCA, SAVER, scImpute, autoImpute, SAVER-X, DrImpute, LSimpute, kNN-smoothing, scRMD, and DAE), does not have a means of visualization by themselves. Accordingly, t-SNE was subsequently applied to the outputs of these preprocessing methods for obtaining the final 2D representation.

Similar to DCA, our DAE employs an autoencoder framework that tries to learn a representation of high dimensional data. It consists of four kinds of layers: an input layer, a dropout layer, hidden layers, and an output layer (Figure 1). The number of neurons/nodes (N) in both the input and output layers is the same as the dimensions/genes in the count matrix. The dropout layer was designed to convert items in the matrix to zero at an arbitrary rate. The default dropout rate was set to 0.5. The hidden layers can further be divided into two kinds of networks, i.e., Encoder and Decoder. The encoder networks are designed as fully connected. By adjusting the weights of the neural networks, the autoencoder learns in an unsupervised manner how to efficiently compress the data. This compression corresponds to the dimensionality reduction. The latent (or bottleneck) layer is set in the middle of the hidden layers. Since this layer size can be changed arbitrarily, five different sizes (i.e., 10, 20, 30, 40, and 50 neurons/nodes) were investigated. The information in this layer corresponds to the low dimensional representation for the original high dimensional data and is the basic output of DAE. Although the decoder networks (the right side of the latent layer) themselves reconstruct the denoised count data, it is not important in this study.

The outputs of competing methods were evaluated using k-means clustering. It measures how well the low-dimensional space allows simple method (i.e., k-means) to recover the true groups (or cell types). To evaluate a best-case scenario, the number of clusters (k) was set to the number of known groups. I assessed methods by adjusted rand index (ARI), between predicted and true group labels as well as NMI and HOMO. For all criteria, the better method is closer to 1.

Results and discussion (Chapter 3)
In general, methods have substantial sensitivity to hyperparameters. In case of our DAE model, it corresponds to the latent layer size. I first evaluated it with five possible numbers by Splatter simulation and determined the use of 30 neurons would be the best. The rest of the analysis was performed using the 30 neurons for our model and default settings for the other methods.

Table 1 shows average ARI values obtained from individual analyses. For the simulated data generated by Splatter, three methods (kNN-smoothing, SAVER, and SAVER-X) showed nearly perfect ARI values on average. This trend was also observed when the other metrics (i.e., NMI and HOMO) were compared. For the simulated data generated by powsimR, PHATE performed the best overall. When the nine conditions were divided into three kinds according to the numbers of cells (i.e., 200, 1000, and 2000 cells), PHATE performed well on the large number of cells (i.e., 2000 cells). Although CIDR was the second best on average, it performed well on the small number of cells (i.e., 200 cells). The nine conditions can also be divided according to the numbers of genes (10000, 20000, and 30000 genes). Interestingly, best performed methods clearly differed: DrImpute and scImpute on 30000 genes, PHATE on 20000 genes, and kNN-smoothing on 10000 genes. These results suggest that the high performance of kNN-smoothing on Splatter simulation was simply because the small numbers of genes (1000 and 5000) used on the simulation were convenient for the method.

Many previous studies have evaluated performance by simulation using Splatter with relatively small numbers of genes (<= 5,000). However, the 15 real datasets consist of much more numbers of genes (> 19,000 and 26,084 genes on average). For the real datasets, DrImpute showed the best, followed by PHATE. This result is consistent with the powsimR results on the different numbers of genes. Taken together, it is better to change the method used depending on the number of genes in the data at hand. Although our DAE model outperformed DCA as a main competitor that also employed autoencoder, it failed to show outstanding performance. The unexciting results of DAE were probably due to parameter tuning only using a Splatter simulation with a small number of genes. Therefore, the improvement would be expected by performing parameter tuning with a large number of genes

Conclusion (Chapter 4)
The main finding of this study is that the preprocessing performance for visualizing scRNA-seq count data varies greatly depending on the number of genes. PHATE or DrImpute is practically recommended. Although this conclusion was based on the investigation of a total of 16 pipelines with both simulated and real datasets, further study is needed, especially the numbers of genes/groups as simulation parameters.

論文の公開元へ

参考文献

Aleksandar, A. Application of Autoencoders on Single-cell Data (master’s thesis). 2008; Retrieved from http://www.dmi.uns.ac.rs/site/dmi/download/master/primenjena_matematika/Aleks andarArmacki.pdf .

Andrews TS & Hemberg M. False signals induced by single-cell imputation. Version 2 F1000Research 2018 Nov 2.

Becht E, Mclnnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37:38-44.

Biase FH, Cao X & Zhong S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Res., 2014, 24:1787- 1796.

Bø TH, Dysvik B & Jonassen I. LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res., 2004, 32(3):e34.

Bray NL, Pimentel H, Melsted P & Pachter L. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnol., 2016, 34:525-527.

Camp JG, Sekine K & Gerber T, Loeffler-Wirth H. Multilineage communication regulates human liver bud deveploment. Nature, 2017, 546:533-538.

Chen C, Wu C, Wu L, Wang Y, Deng M & Xi R. scRMD: Imputation for single cell RNA-seq data via robust matrix decomposition. BioRxiv [Preprint]. 2018; bioRxiv 459404 [posted 2018 Nov 9; cited 2018 Nov 9]. Available from: https://www.biorxiv.org/content/10.1101/459404v2 doi: 10.1101/459404.

Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, et al. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci USA., 2015, 112(23):7285-90.

Deng Q, Ramskold D, Reinius B & Sandberg R. Single-cell RNA-seq reveals dynamic random monoallelic gene expression in mammalian cells. Science, 2014, 343:193- 196.

Ding J, Condon AE & Shah SP. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat Commun., 2018, 9:2002.

Eraslan G, Simon LM, Mircea M, Mueller NS & Theis FJ. Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun., 2019, 10(1):390.

Fan X, Zhang X, Wu X, Guo H, Hu Y, Tang f, et al. Single-cell RNA-seq transcriptome analysis of linear and circular RNAs in mouse preimplantation enbryos. Genome Biol., 2015, 16:148.

Gong W, Kwak Y, Pota P, Nakagawa NK, Garry D. DrImpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinformatics, 2018, 19:220.

Goolam M, Scialdone A, Graham SJ, Macaulay IC, Jedrusik A, Hupalowska A, et al.

Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell, 2016, 165:61-74.

Haque A, Engel J, Teichmann SA & Lönnberg T. A practical guide to single-cell RNAsequencing for biomedical research and clinical applications. Genome Med., 2017, 9(1):75.

Hashimshony T, Wagner F, Sher N & Yanai I. CEL-Seq2: sensitive highlymultiplexed single-cell RNA-Seq. Genome Biol., 2016, 17:77.

Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, et al. SAVER: gene expression recovery for single-cell RNE sequencing. Nat Methods, 2018, 15(7):539-542.

Hu Q & Greene CS. Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics. Pac Symp Biocomput., 2019, 24:362-373.

Hubert L & Arabie P. Comparing partitions. J Classif., 1985, 2(1):193-218.

Kingma DP & Ba J. Adam: A Method for Stochastic Optimization. arXiv: 1412.6980v1 [Preprint]. 2014 [cited 2014 Dec 22]. Available from: https://arxiv.org/abs/1412.7003v1.

Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 2015, 161(5):1187-1201.

Kolodziejczuk AA, Kim JK, Svensson V, Marioni JC & Teichmann SA. The technology and Biology of Single-Cell RNA Sequencing. Molecular Cell, 2015, 58(4):610-620.

Kolodziejczyk AA, Kim JK, Tsang JC, Ilicic T, Henriksson J, Natatajan KN, et al. Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell, 2015, 17:471-485.

Kvalseth TO. Entropy and correlation: Some comments. IEEE Transactions on Systems, Man, and Cybernetics, 1987, 17(3):517-519.

Li WV & Li JJ. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat Commun., 2018, 9(1):997.

Lin P, Troup M & Ho JWK. CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol., 2017, 18(1):59.

Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 2015, 161:1202–1214.

Manno GL, Gyllborg D, Codeluppi S, Nishimura K, Salto C, Zeisel A, et al. Molecular Diversity of Midbrain Development in Mouse, Human, and Stem Cells. Cell, 2016, 167(2):566-580.

Moon KR, Stanley JS & Burkhardt D. Manifold learning-based methods for analyzing single-cell RNA-sequencing data. Curr Opin Syst Biol., 2018, 7:36-46.

Moon KR, van Dijk D, Wang Z, Burkhardt D, Chen WS, van den Elzen A, et al. Visualizing Structure and Transitions for Biological Data Exploration. BioRxiv [Preprint]. 2018;bioRxiv 20378 [posted 2019 April 04]. Available from: https://www.biorxiv.org/content/10.1101/120378v4 doi: 10.1101/120378.

O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res., 2016, 44:733-745.

Patro R, Duggal G, Love ML, Irizarry RA & Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods, 2017, 14:417-419.

Pierson E & Yau C. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol., 2015, 16:241.

Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA, Lui JH, et al., Lowcoverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol. 2014;32:1053-1058.

Romanov RA, Zeisel A, Bakker J, Girach F, Hellysaz A, Tomer R, et al. Molecular interrogation of hypothalamic organization reveals distinct dopamine neuronal subtypes. Nat Neurosci., 2017, 20:176-188.

Shapiro E, Biezuner & Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet., 2013, 14(9):618-30.

Stegle O, Teichmann SA & Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet., 2015, 16(3):133-45.

Talwar D, Mongia A, Sengupta D & Majumdar A. AutoImpute: Autoencoder based imputation of single-cell RNA-seq data. Sci Rep., 2018, 8(1):16329.

Usoskin D, Furlan A, Islam S, Abdo H, Lonnerberg P, Lou D, et al. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat Neurosci., 2015, 18:145-153.

Van der Maaten L & Hinton G. Visualizing data using t-SNE. J Mach Learn Res., 2008, 9:2579-2605.

van Dijk D, Sharma R, Nainys J, Yin K, Kathail P, Carr AJ, et al. Recovering gene interactions from single-cell data using data diffusion. Cell, 2018, 174(3):716-729.

Vieth B, Ziegenhain C, Parekh S, Enard W & Hellmann I. powsimR: pow analysis for bulk and single cell RNA-seq experiment. Bioinformatics, 2017, 33(21):3486-3488.

Vinh NX, Epps J & Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res., 2010, 11:2837-2854.

Wagner F, Yan Y & Yanai I. K-nearest neighbor smoothing for high-throughput singlecell RNA-Seq data. BioRxiv [Preprint]. 2017 bioRxiv 217737 [posted 2017 Nov 21; revised 2018; cited 2018 Apr 9]. Available from: https://www.biorxiv.org/content/10.1101/217737v3 doi: 10.1101/217737.

Wang D & Gu J. VASC: Dimension Reduction and Visualization of Single-cell RNAseq Data by Deep Variational Autoencoder. Genomics Proteomics Bioinformatics, 2018, 16(5):320-331.

Wang J, Agarwal D, Huang M, Hu G, Zhou Z, Ye C, et al. Data denoising with transfer learning in single-cell transcriptomics. Nat Methods, 2019, 16,875-878.

Wang YJ, Schug J, Won KJ, Liu C, Naji A, Avrahami D, et al. Single-Cell Transcriptomeics of the Human Endocrine Pancreas. Diabetes, 2016, 65(10):3028- 3038.

Wang Z, Gerstein M & Snyder M. RNA-Seq: revolutionary tool for transcriptomics. Nat Rev Genet., 2009, 10(1):57-63.

Way GP & Greene CS. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput., 2018, 23:80–91.

Wold S, Esbensen K & Geladi P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems. 1987, 2:37-52

Xin Y, Kim J, Okamoto H, Ni M, Wei Y, Adler C, et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metabolism, 2016, 24:608-615.

Yan L, Yang M, Guo H, Yang L, Wu J, Li R, et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol., 2013, 20:1131-1139.

Zappia L, Phipson B & Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol., 2017, 18(1):174.

Zeisel A, Munoz-Manchado AN, Codeluppi S, lonnerberg P, Manno GL, Jureus A, etal. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science, 2015, 347:1138-1142.

参考文献をもっと見る

分野

大学

学位論文種類・取得年

言語

Comprehensive evaluation of preprocessing methods for visualizing single-cell RNA-seq count data

概要

関連論文

Prediction of protein-DNA interactions using deep learning method

Application of Silhouette Scores for Arbitrarily Defined Groups in Gene Expression Data

Search for gluinos in final states with jets and missing transverse momentum in pp collisions at √s = 13 TeV

画像品質及びデータ分布を考慮したデータ拡張

機械学習を用いた歯科診療内容推定基盤の構築

参考文献

分野

大学

学位論文種類・取得年

言語

コピーが完了しました

URLをコピーしました

Comprehensive evaluation of preprocessing methods for visualizing single-cell RNA-seq count data

概要

関連論文

Prediction of protein-DNA interactions using deep learning method

Application of Silhouette Scores for Arbitrarily Defined Groups in Gene Expression Data

Search for gluinos in final states with jets and missing transverse momentum in pp collisions at √s = 13 TeV

画像品質及びデータ分布を考慮したデータ拡張

機械学習を用いた歯科診療内容推定基盤の構築

参考文献