論文の公開元へ

書き出し

Refer/BibIX

RIS

BibTeX

TSV

Prediction of protein-DNA interactions using deep learning method

崔, 菲菲東京大学 DOI:10.15083/0002004921

2022.06.22

概要

Introduction (Chapter 1)
Protein-DNA interactions are fundamental to almost all biological processes, and they play an especially crucial role in gene expression. DNA-binding proteins (DBP) can regulate and affect the processes of transcription, replication, denaturation and annealing of DNA, detection of DNA damage, and organization and condensation of the chromosome. Thus, due to their great importance, DBPs are the primary focus of this study.

While both computational and experimental techniques have been developed to identify DBPs, using experimental techniques alone is both time-consuming and expensive. Therefore, using computational methods to predict DBPs is useful for annotating proteins and guiding experimental methods.

Machine-learning methods are main computational methods for DBP predictions. Most machine-learning methods require specific numerical inputs with a fixed format. Machine-learning algorithms often use the extracted features that represent data as input such as the position-specific scoring matrix and the physiochemical properties of amino acids including hydrophobicity, polarizability, volume, helix probability, isoelectric point, and so on. For deep learning algorithms, there are two main types of inputs; one is the extracted features previously described, and the other is the encoded primary sequence, which can automatically extract feature information using a deep neural network. Because feature extraction requires many programs to obtain effective features, it can be complex and time-consuming, and it is usually suitable for small-scale data rather than large-scale data. Therefore, with the large amount of biological data available today, the latter approach (extracting feature information by encoding primary sequence using a deep neural network) is necessary.

This report describes the development of the prediction systems of DBPs and the prediction systems of DBP classification using deep neural networks. The data representations suitable for these predictions were also investigated.

Prediction of DNA-binding proteins (Chapter 2)
A dataset of DBPs was constructed from UniProtKB. This dataset contained 8,414 chains of positive data (DBPs) and 8,414 chains of negative data (non-DBPs); of data in the dataset, 80% and 20%, respectively, were randomly selected for training and testing the DBP prediction model.

Data representation. Two representations of one-hot encoding were proposed based on overlapping 2-gram and overlapping 3-gram methods. The overlapping k-gram method is described in Figure 1. For one-hot encoding based on k-grams, an amino acid sequence is converted to numerical data in four steps: (1) obtaining integer representation of unique k-grams form from 20 amino acids (total 20k k-grams), (2) representing an amino acid sequence with k-grams (Figure 1), (3) encoding the k-gram-represented sequence into an integer sequence, and (4) encoding each integer in the integer sequence into a vector that can be described as " = " , where " is ||-dimension one-hot vector of each integer in the integer sequence, ∈ ℛ+×|-| is a weight matrix that can be updated by fitting a deep neural network, || is the number of total unique k-grams (=20k ), and d is the dimension of the encoded vector ". Because each integer in the integer sequence is converted to d-dimension vector ", the protein sequence lastly is converted into numerical data with the shape of (l−k+1, d). The process of one-hot encoding based on 3-grams is the same as that for 2-grams except that 20 amino acids form 8,000 3-grams. Because this dimension is too large for effective learning, these amino acids were classified into seven groups to form 343 3-grams. In this study, d was set to 100, which means each k-gram in sequence was converted to a 100-dimension vector.

Deep neural networks. Convolutional neural networks (CNN) and the combination of CNN and long short-term memory (LSTM) neural networks (denoted as CNN-LSTM) are used with one-hot encoding methods. Table 1 shows the performance of DBP prediction for different one-hot encoding methods with different deep neural networks. Comparing the results of the CNN models with the CNN-LSTM models among the same one-hot encoding method, it can be seen that the performance of the CNN-LSTM model is much better than that of the CNN model. For one-hot encoding, setting deeper neural layers is suitable, and adding LSTM further provides useful features automatically as it can learn long-term dependencies between motifs.

Classification prediction for DNA-binding proteins (Chapter 3)
DBPs were classified into five categories, “transcription regulatory region sequence-specific DNA binding,” “sequence-specific single stranded DNA binding,” “chromatin DNA binding,” “damaged DNA binding,” and “others,” which were labeled Class 1, Class2, Class 3, Class 4, and Other Class, respectively. For these five classes, 3,690, 103, 408, 2,124, and 6,830 chains were collected as the positive data for each category, respectively, from UniProtKB using their Gene Ontology annotation. For the negative dataset, 14,760, 412, 1,532, 8,496, and 6,830 chains for each of the five classes, respectively, were randomly selected from the negative data. For each category, 80% and 20% of the data in each dataset (positive data and negative data of each category) were randomly selected for training and testing. The data not used for training was used as prediction data and consisted of 738, 18, 88, 408, and 1,368 chains of the five categories and 1,252 chains of negative data.

Two sequence-based prediction methods, the deep learning method and the homology-based method, were developed to predict the classification of the DBPs.

Deep learning method for classification. Classification was performed in two steps: (1) predicting whether or not the input sequence belongs to each category and (2) determining the category to which the input sequence belongs. For the first step, a predictor was created for each of the five categories using the overlap 2-gram-based one-hot encoding method with the CNN-LSTM model, and these five predictors were trained using the data previously described. The output of each predictor is the probability that the input sequence belongs to this category. For the second step, a predictor was created to predict the classification using the results from the above five predictors. This predictor predicts a category with the highest value of probability as the final classification result. When there are no categories with a probability > 0.5, the input sequence is predicted as an “unclassifiable protein.”

Homology-based method for classification. The homology-based method predicts the classification of DBPs by using sequence identity. The positive data of Class 1, Class 2, Class 3, Class 4, and Other Class and the negative data (labeled as “Unclassifiable”) previously described were used as the prediction dataset. The positive dataset for training contained 2,952, 85, 320, 1,716, and 5,462 chains for each category, respectively. The homology-based method for classification identification consisted of three steps. (1) The BLASTP program was used to compare an input (query) sequence with each of the five datasets to calculate the similarity between the input sequence and the sequences of each dataset. (2) To calculate the similarity between the input sequence and each dataset of five categories, the mean value of the bit scores (mean bit score) of the input sequence and the resulting sequences with E-value less than 0.01 were computed for each dataset. (3) The predictor predicted a category with the highest value of mean bit score as the final classification result. When there were no hits found for all five datasets, the input sequence was predicted as an “unclassifiable protein.”

The confusion matrix of the homology method and the deep learning method for classification are shown in Figure 3 and 4. The two confusion matrices indicate that the trends of prediction of the two methods are similar to each other. For both of them, the data of Class1, Class 4, and Other Class were classified well, and most negative data were identified as unclassifiable data (labeled as “Unclassifiable”). Notably, for the homology method, the values in the last column of the confusion matrix (except the last value) indicate that some query sequences of the five categories were incorrectly classified as “Unclassifiable” data because these query sequences did not generate hit when using BLASP to search the DBP database. That suggests that in many cases, the homology-based method can be used to classify DBPs, but if no homologue is found, it cannot be used. The deep learning method, however, can be used in any case, especially for unknown or newly found sequences

The integrated prediction system. In this study, a newly developed prediction system is described that combines the DBP predictor and the classification predictor. An input sequence is first predicted by the trained DBP predictor to be a DBP or not. If not, the input sequence is predicted as non-DBP. If positively predicted, the sequence is then applied to the classification predictor. Evaluation metrics were calculated as shown in Table 2 to assess the performance of the prediction system.

The prediction system which combines the DBP prediction and classification prediction performs better than classification prediction model, the reason is that the prediction system first filters some non-DNA binding proteins through the DNA binding protein prediction model, thereby reducing interference from the unclassifiable data.

Conclusion (Chapter 4)
This study describes a prediction system using deep neural networks that not only predicts DBPs but also predicts the classification of DBPs with high performance. This prediction system is based on the investigation of data representation methods and deep neural networks for DBP prediction.

論文の公開元へ

参考文献

A.Travers. DNA-Binding Proteins. Encyclopedia of Genetics. 2001: p541-544.

Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015 Aug; 33(8): p831-8.

Altschul SF et al. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res . 1997; 25: p3389–3402.

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990; 215 (3): p403–410.

Asgari E, Mohammad MRK. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PloS one. 2015 Nov 10.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet.. 2000; 25: p25 –29.

B.Ø. Palsson. Systems Biology: Properties of Reconstructed Networks. Cambridge University Press, Cambridge, UK (2006).

Boutet E et al. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. Methods Mol Biol. 2016.

Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PloS one. 2017; 12(6): e0177678. Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJA, Hofmann K, and Bairoch A. The PROSITE database, its status in 2002. Nucleic Acids Res.. 2002; 30: P235 – 238.

Finn RD, et al. The Pfam protein families database. Nucleic Acids Res. . 2008; (36): D281-D288.

Goldberg Y, Omer L. word2vec Explained: Deriving Mikolov et al.'s NegativeSampling Word-Embedding Method. 2014 Feb; arXiv:1402.3722.

Gonzalez DH. Introduction to Transcription Factor Structure and Function in Plant Transcription Factors. Evolutionary, Structural and Functional Aspects. 2016; p3-11.

Goodfellow Ian J, Bengio Yoshua, and Courville Aaron. Deep Learning. MIT Press. 2016; p200–220.

Graves A. Generating sequences with recurrent neural networks. Arxiv preprint arXiv: 1308–0850, 2013.

Han J, Moraga C. The influence of the sigmoid function parameters on the speed of backpropagation learning. Proceedings of the International Workshop on Artificial Neural Networks: From Natural to Artificial Neural Computation. 1995; p195–201.

Heffernan R. et al. Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Sci. Rep .. 2015, 5, 11476.

Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997; 9 (8): 1735–1780.

Horii T, Ogawa T, and Ogawa H. Organization of the recA gene of Escherichia coli. Proc. Natl. Acad. Sci.. 1980; 77 (1): p313–317.

Hwang S, Gou Z, Kuznetsov IB. DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics. 2007; 23: p634–636.

Jones S. An overview of the basic helix–loop–helix proteins. Genome Biol.. 2004; 5(6): 226.

Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers. 1983;22: p2577-2637.

Li FF, Johnson J, Yeung S, et al. CS 231N: Convolutional Neural Networks for Visual Recognition. Stanford University. Computer vision course: http://cs231n.stanford.edu. Retrieved 2018-12-13.

McGinnis S, & Madden TL. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 2004; 32: W20–W25.

Mikolov T, Chen K, Corrado GS, and Dean J. Efficient estimation of word representations in vector space. CoRR. 2013. abs/1301.3781.

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. 2013; p3111–3119.

Nair V and Hinton GE. Rectified Linear Units Improve Restricted Boltzmann Machines. ICML. 2010.

Preeti Pandey, Sabeeha Hasnain, Shandar Ahmad. Protein-DNA Interactions. Encyclopedia of Bioinformatics and Computational Biology. 2019; 2: p142-154. Roderic Guigo. An Introduction to Position Specific Scoring Matrices. Bioinformatica.upf.edu. 2003.

Shen JW, Zhang J, Luo XM, Zhu WL, Yu KQ, Chen KX, et al. Predicting protein–protein interactions based only on sequences information. PNAS. 2007 March 13; 104 (11): p4337-4341.

Viola IL, Gonzalez DH. Structure and Evolution of Plant Homeobox Genes in Plant Transcription Factors. Evolutionary, Structural and Functional Aspects. 2016; p101-112.

Yu HQ, Hua Y, Xiu JG, Jia HX, Hong SL. On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach. PloS one. 2017 Dec 29;12(12).

Zaman R, Chowdhury SY, Rashid MA, Sharma A, Dehzang, Shatabda S. HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features. Biomed Res Int. 2017, Nov 14.

Kumar M, Gromiha M & Raghava G. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics. 2007; 8, 463.

Lin WZ, Fang JA, Xiao X & Chou K C. iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model. PloS one. 2011; 6, e24756.

Kumar KK, Pugalenthi G & Suganthan PN. DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest. Journal of Biomolecular Structure and Dynamics. 2009; 26, 679–686.

参考文献をもっと見る

分野

大学

学位論文種類・取得年

言語

Prediction of protein-DNA interactions using deep learning method

概要

関連論文

Comprehensive evaluation of preprocessing methods for visualizing single-cell RNA-seq count data

機械学習を用いた歯科診療内容推定基盤の構築

化学構造式から受容体親和性を予測する機械学習モデルの構築とヒト有害事象の予測への応用

画像品質及びデータ分布を考慮したデータ拡張

Edge expansion parallel cascade selection molecular dynamics simulation (eePaCS-MD) for investigating protein dynamics

参考文献

分野

大学

学位論文種類・取得年

言語

コピーが完了しました

URLをコピーしました

Prediction of protein-DNA interactions using deep learning method

概要

関連論文

Comprehensive evaluation of preprocessing methods for visualizing single-cell RNA-seq count data

機械学習を用いた歯科診療内容推定基盤の構築

化学構造式から受容体親和性を予測する機械学習モデルの構築とヒト有害事象の予測への応用

画像品質及びデータ分布を考慮したデータ拡張

Edge expansion parallel cascade selection molecular dynamics simulation (eePaCS-MD) for investigating protein dynamics

参考文献