リケラボ論文検索は、全国の大学リポジトリにある学位論文・教授論文を一括検索できる論文検索サービスです。

リケラボ 全国の大学リポジトリにある学位論文・教授論文を一括検索するならリケラボ論文検索大学・研究所にある論文を検索できる

リケラボ 全国の大学リポジトリにある学位論文・教授論文を一括検索するならリケラボ論文検索大学・研究所にある論文を検索できる

大学・研究所にある論文を検索できる 「Development of RNA informatics for RNA sequence and structure analysis (本文)」の論文概要。リケラボ論文検索は、全国の大学リポジトリにある学位論文・教授論文を一括検索できる論文検索サービスです。

コピーが完了しました

URLをコピーしました

論文の公開元へ論文の公開元へ
書き出し

Development of RNA informatics for RNA sequence and structure analysis (本文)

秋山, 真那斗 慶應義塾大学

2022.03.23

概要

Non-coding RNAs (ncRNAs) that are not translated into proteins were formerly considered as junk regions. However, various functions have been revealed in recent years ranging from development and cell differentiation processes to cause of diseases. Elucidation of ncRNA structural information is an indispensable step for understanding the function of ncRNA through RNA informatics, which is information science for RNA molecules. However, existing methods to obtain structural information of RNAs are not accurate, and the development of better methods is an active field of study. In this dissertation, I set out to develop more accurate methods for two different use cases: RNA secondary structure prediction and RNA sequence embedding. The background necessary for the explanation of these methods is given in Chapter 1.

 The first method in this dissertation focuses on the development of a highly accurate RNA secondary structure prediction algorithm. Since the functions of ncRNAs are believed to be closely related to the structures of ncRNAs, it is possible to infer their biological functions from their structures. A popular approach for predicting RNA secondary structure is the thermodynamic nearest- neighbor model that finds a thermodynamically most stable secondary structure with minimum free energy (MFE). An alternative approach based on machine learning has been developed that can employ a fine-grained model that includes much richer feature representations. Rich feature representation is achieved by modeling more detailed substructures for RNA secondary structure. Although the machine learning-based fine-grained model achieved extremely high performance in prediction accuracy, the possibility of the risk of overfitting has been reported. In Chapter 2 of this dissertation, I propose a novel algorithm for RNA secondary structure prediction that integrates both the thermodynamic approach and the machine learning-based weighted approach. My benchmark showed that my algorithm achieved the best prediction accuracy compared with existing methods and resolved heavy overfitting.

 "Embedding" is a popular technique that vectorizes DNA sequences and amino acid sequences, and is known to be useful for detecting DNA sequence motifs and predicting protein functions but embedding for RNA sequences has not been developed so far. In Chapter 3 of the dissertation, I showcase the development of a pre-training algorithm with the aim of acquiring an embedded vector of an RNA sequence that contains abundant structural information and sequence context information. Finally, to verify the quality of embedding, I performed two basic RNA informatics tasks (structural alignment and gene clustering), and in the process, achieved greater accuracy than existing state-of- the-art methods.

 To conclude, I have succeeded in obtaining effective analytical methods of ncRNA using two approaches: RNA secondary structure prediction and RNA sequence vectorization. Each approach can be applied to analysis in all fields of RNA informatics including RNA-protein interaction and RNA-RNA interaction and can be expected to have a large spillover effect. In Chapter 4, the conclusions of this dissertation and the ripple effects are described in detail.

参考文献

Akiba,T. et al. (2019) Optuna: A Next-generation Hyperparameter Optimization Framework. In, Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, NY, USA, 2623–2631.

Akiyama,M. et al. (2018) A max-margin training of RNA secondary structure prediction integrated with the thermodynamic model. J. Bioinform. Comput. Biol., 16, 1840025.

Alley,E.C. et al. (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods, 16, 1315–1322.

Andronescu,M. et al. (2010) Computational approaches for RNA energy parameter estimation. RNA, 16, 2304–2318.

Andronescu,M. et al. (2007) Efficient parameter estimation for RNA secondary structure prediction. Bioinformatics, 23, i19-28.

Aoki,G. and Sakakibara,Y. (2018) Convolutional neural networks for classification of alignments of non-coding RNA sequences. Bioinformatics, 34, i237–i244.

Asgari,E. et al. (2019) Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci. Rep., 9, 1–11.

Backofen,R. et al. (2011) Sparse RNA folding: Time and space efficient algorithms. J. Discrete Algorithms , 9, 12–31.

Baek,J. et al. (2018) LncRNAnet: long non-coding RNA identification using deep learning.

Bioinformatics, 34, 3889–3897.

Balakrishnan,M. et al. (2001) The kissing hairpin sequence promotes recombination within the HIV-I 5’ leader region. J. Biol. Chem., 276, 36482–36492.

Bepler,T. and Berger,B. (2019) Learning protein sequence embeddings using information from structure. In, International Conference on Learning Representations.

Bushati,N. and Cohen,S.M. (2007) microRNA functions. Annu. Rev. Cell Dev. Biol., 23, 175– 205.

Carvalho,L.E. and Lawrence,C.E. (2008) Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl. Acad. Sci. U. S. A., 105, 3209–3214.

Chang,T.-C. and Mendell,J.T. (2007) microRNAs in vertebrate physiology and human disease. Annu. Rev. Genomics Hum. Genet., 8, 215–239.

Chen,C.-C. et al. (2019) TOPAS: network-based structural alignment of RNA sequences. Bioinformatics, 35, 2941–2948.

Devlin,J. et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)., 4171–4186.

Ding,Y. et al. (2005) RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA, 11, 1157–1166.

Do,C.B. et al. (2008) A max-margin model for efficient simultaneous alignment and folding of RNA sequences. Bioinformatics, 24, i68-76.

Do,C.B. et al. (2006) CONTRAfold: RNA secondary structure prediction without physics- based models. Bioinformatics, 22, e90-8.

Dowell,R.D. and Eddy,S.R. (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 5, 1–14.

Durbin,R. et al. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Cambridge University Press.

Eddy,S.R. and Durbin,R. (1994) RNA sequence analysis using covariance models. Nucleic Acids Res., 22, 2079–2088.

Edgar,R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5, 1–19.

Flanagan,J.M. and Wild,L. (2007) An epigenetic role for noncoding RNAs and intragenic DNA methylation. Genome Biol., 8, 1–3.

Fu,Y. et al. (2014) Dynalign II: common secondary structure prediction for RNA homologs with domain insertions. Nucleic Acids Res., 42, 13939–13948.

Ganot,P. et al. (1997) The family of box ACA small nucleolar RNAs is defined by an evolutionarily conserved secondary structure and ubiquitous sequence elements essential for RNA accumulation. Genes Dev., 11, 941–956.

Gardner,P.P. et al. (2010) Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res., 39, D141–D145.

Hamada,M. et al. (2009) Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics, 25, 465–473.

Harmanci,A.O. et al. (2008) PARTS: probabilistic alignment for RNA joinT secondary structure prediction. Nucleic Acids Res., 36, 2406–2417.

Heinzinger,M. et al. (2019) Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics, 20, 1–17.

Hendrix,D.K. et al. (2005) RNA structural motifs: building blocks of a modular biomolecule. Q. Rev. Biophys., 38, 221–243.

Heyne,S. et al. (2012) GraphClust: alignment-free structural clustering of local RNA secondary structures. Bioinformatics, 28, i224-32.

Hirose T. and Tomari Y. (2016) ノンコーディングRNA: RNA分子の全体像を俯瞰する化学同人.

Hofacker,I.L. et al. (2004) Alignment of RNA base pairing probability matrices. Bioinformatics, 20, 2222–2227.

Howe,J.A. et al. (2015) Selective small-molecule inhibition of an RNA structural element. Nature, 526, 672–677.

Kalvari,I. et al. (2018) Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res., 46, D335–D342.

Kato,Y. et al. (2010) RactIP: fast and accurate prediction of RNA-RNA interaction using integer programming. Bioinformatics, 26, i460-6.

Katoh,K. and Standley,D.M. (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol., 30, 772–780.

Kimura,M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 16, 111–120.

Knudsen,B. and Hein,J. (1999) RNA secondary structure prediction using stochastic context- free grammars and evolutionary history. Bioinformatics, 15, 446–454.

Lalwani,S. et al. (2014) Sequence-Structure Alignment Techniques for RNA: A Comprehensive Survey. Advances in Life Sciences, 4, 21–35.

Laslett,D. and Canback,B. (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res., 32, 11–16.

Lodish H. et al. (2019) 分子細胞生物学 東京化学同人.

Lorenz,R. et al. (2011) ViennaRNA Package 2.0. Algorithms Mol. Biol., 6, 1–14.

Lu,Z.J. et al. (2009) Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA, 15, 1805–1813.

van der Maaten,L. and Hinton,G. (2008) Visualizing Data using t-SNE. J. Mach. Learn. Res., 9, 2579–2605.

Mayr,F. and Heinemann,U. (2013) Mechanisms of Lin28-Mediated miRNA and mRNA Regulation—A Structural and Functional Perspective. Int. J. Mol. Sci., 14, 16532– 16553.

McCaskill,J.S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119.

Mikolov,T. et al. (2013) Distributed Representations of Words and Phrases and their Compositionality. In, Burges,C.J.C. et al. (eds), Advances in Neural Information Processing Systems. Curran Associates, Inc., 3111–3119.

Min,S. et al. (2021) Pre-Training of Deep Bidirectional Protein Sequence Representations With Structural Information. IEEE Access, 9, 123912–123926.

Moore,K.S. and ’t Hoen,P.A.C. (2019) Computational approaches for the analysis of RNA- protein interactions: A primer for biologists. J. Biol. Chem., 294, 1–9.

Morita,K. et al. (2009) Genome-wide searching with base-pairing kernel functions for noncoding RNAs: computational and expression analysis of snoRNA families in Caenorhabditis elegans. Nucleic Acids Res., 37, 999–1009.

Nakamura Y. (2003) RNAがわかる: 多彩な生命現象を司るRNAの機能からRNAi,創薬への応用まで 羊土社.

Nakamura Y. and Siomi H. (2004) 躍進するRNA研究: 進展する構造解析, 機能性RNAの多彩な役割の解明とRNAiなど生命・医工学への応用 羊土社.

Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453.

Ng,P. (2017) dna2vec: Consistent vector representations of variable-length k-mers. arXiv [q- bio.QM].

Nussinov,R. and Jacobson,A.B. (1980) Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc. Natl. Acad. Sci. U. S. A., 77, 6309–6313.

Pan,X. et al. (2018) Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics, 19, 1–11.

Pennington,J. et al. (2014) Glove: Global vectors for word representation. In, Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)., 1532–1543.

Peters,M. et al. (2018) Deep Contextualized Word Representations. In, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227–2237.

Reuter,J.S. and Mathews,D.H. (2010) RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11, 1–9.

Rivas,E. et al. (2012) A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more. RNA, 18, 193–212.

Rivas,E. (2013) The four ingredients of single-sequence RNA secondary structure prediction. A unifying perspective. RNA Biol., 10, 1185–1196.

Rives,A. et al. (2021) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A., 118, e2016239118.

Saito,Y. et al. (2011) Fast and accurate clustering of noncoding RNAs using ensembles of sequence alignments and secondary structures. BMC Bioinformatics, 12, 11–14.

Sakakibara,Y. et al. (1994) Stochastic context-free grammers for tRNA modeling. Nucleic Acids Research, 22, 5112–5120.

Samarsky,D.A. et al. (1998) The snoRNA box C/D motif directs nucleolar targeting and also couples snoRNA synthesis and localization. EMBO J., 17, 3747–3757.

Sankoff,D. (1985) Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems. SIAM J. Appl. Math., 45, 810–825.

Sato,K. et al. (2010) A non-parametric Bayesian approach for predicting RNA secondary structures. J. Bioinform. Comput. Biol., 08, 727–742.

Sato,K. et al. (2009) CENTROIDFOLD: a web server for RNA secondary structure prediction. Nucleic Acids Res., 37, W277-80.

Sato,K. et al. (2012) DAFS: simultaneous aligning and folding of RNA sequences via dual decomposition. Bioinformatics, 28, 3218–3224.

Sato,K. et al. (2008) Directed acyclic graph kernels for structural RNA analysis. BMC Bioinformatics, 9, 1–12.

Sato,K. et al. (2011) IPknot: Fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming. Bioinformatics, 27, i85–i93.

Sato,K. et al. (2021) RNA secondary structure prediction using deep learning with thermodynamic integration. Nat. Commun., 12, 1–9.

Schroeder,S.J. and Turner,D.H. (2009) Optical melting measurements of nucleic acid thermodynamics. Methods Enzymol., 468, 371–387.

Serganov,A. and Nudler,E. (2013) A decade of riboswitches. Cell, 152, 17–24.

Sundfeld,D. et al. (2015) Foldalign 2.5: multithreaded implementation for pairwise structural RNA alignment. Bioinformatics, 32, 1238–1240.

The RNAcentral Consortium et al. (2017) RNAcentral: a comprehensive database of non- coding RNA sequences. Nucleic Acids Res., 45, D128–D134.

Tsochantaridis,I. et al. (2005) Large Margin Methods for Structured and Interdependent Output Variables. J. Mach. Learn. Res., 6, 1453–1484.

Turner,D.H. and Mathews,D.H. (2010) NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res., 38, D280-2.

Vaswani,A. et al. (2017) Attention is all you need. In, Advances in neural information processing systems. papers.nips.cc, 5998–6008.

Will,S. et al. (2007) Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol., 3, e65.

Will,S. et al. (2015) SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics. Bioinformatics, 31, 2489–2496.

Wilm,A. et al. (2006) An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol. Biol., 1, 1–11.

Wilm,A. et al. (2008) R-Coffee: a method for multiple alignment of non-coding RNA. Nucleic Acids Res., 36, e52.

Zakov,S. et al. (2011) Rich parameterization improves RNA structure prediction. J. Comput. Biol., 18, 1525–1542.

Zuker,M. (1989) On finding all suboptimal foldings of an RNA molecule. Science, 244, 48– 52.

Zuker,M. and Stiegler,P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res., 9, 133–148.

参考文献をもっと見る

全国の大学の
卒論・修論・学位論文

一発検索!

この論文の関連論文を見る