INTRODUCTION

Downloaded from rnajournal.cshlp.org on March

Gene2vec: gene subsequence embedding for prediction 6 of mammalian N -methyladenosine sites from mRNA

QUAN ZOU

0 2

PENGWEI XING

LEYI WEI

BIN LIU

1 0 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China , 610051 Chengdu , China 1 School of Computer Science and Technology, Harbin Institute of Technology , 150001 Shenzhen , China 2 School of Computer Science and Technology, Tianjin University , 300350 Tianjin , China

2024

9 2024 205 218

N6-Methyladenosine (m6A) refers to methylation modification of the adenosine nucleotide acid at the nitrogen-6 position. Many conventional computational methods for identifying N6-methyladenosine sites are limited by the small amount of data available. Taking advantage of the thousands of m6A sites detected by high-throughput sequencing, it is now possible to discover the characteristics of m6A sequences using deep learning techniques. To the best of our knowledge, our work is the first attempt to use word embedding and deep neural networks for m6A prediction from mRNA sequences. Using four deep neural networks, we developed a model inferred from a larger sequence shifting window that can predict m6A accurately and robustly. Four prediction schemes were built with various RNA sequence representations and optimized convolutional neural networks. The soft voting results from the four deep networks were shown to outperform all of the stateof-the-art methods. We evaluated these predictors mentioned above on a rigorous independent test data set and proved that our proposed method outperforms the state-of-the-art predictors. The training, independent, and cross-species testing data sets are much larger than in previous studies, which could help to avoid the problem of overfitting. Furthermore, an online prediction web server implementing the four proposed predictors has been built and is available at http://server. malab.cn/Gene2vec/.

N6-methyladenosine machine learning deep learning RNA word embedding mRNA

INTRODUCTION

N6-Methyladenosine (m6A) is an RNA methylation modification at the nitrogen-6 position of the adenosine base. It has been identified as the most commonly modified base in the messenger RNA of most eukaryotes. Research has 6 shown that m A modification is involved in numerous biological activities, including the differentiation and reprogramming of stem cells (Yue et al. 2015), translation and alternative splicing (Geula et al. 2015; Xu et al. 2018), circadian clock (Fustin et al. 2013), and cerebellar development (Wang et al. 2018). Research in cancer biology has also 6 shown that m A mRNA modification plays a critical role in glioblastoma stem cell self-renewal and tumorigenesis (Cui et al. 2017; Zhang et al. 2017), and m6A modification was also shown to exert anti-leukemic activity in recent studies (Li et al. 2017b; Su et al. 2018). Moreover, Lichinchi et al. showed that viral infection triggers a massive 6 increase in m A in both host and viral mRNAs (Lichinchi et al. 2016a), and similar regulatory mechanisms of several other viruses have also been confirmed (Gokhale et al. 2016; Lichinchi et al. 2016b; Hesser et al. 2018). Furthermore, high-throughput analysis of m6A, for instance, using the RNA immunoprecipitation biotechnologies MeRIP-seq and m6A-seq (Dominissini et al. 2012; Meyer et al. 2012), has provided insights into the functions and topological 6 patterns of m A mRNA modification. Based on information from MeRIP-seq experiments, comprehensive databases of 6 m A modification have been built to help researchers determine the locations and effects of these modifications (Liu et al. 2017a; Xuan et al. 2017). The work of Wan et al.

6 (2015) showed that m A patterns are similar between plants and mammals, with both being abundant near stop codons and 3′ untranslated regions (UTRs) and having similar consensus m6A methylation motifs and similar frequencies of 6 m A sites per transcript in the transcriptome. Following the profiling of m6A distributions in mammalian transcriptomes (Dominissini et al. 2012; Meyer et al. 2012) and the 6 mapping of the yeast m A methylome (Schwartz et al. 2013), transcriptome-wide profiling of m6A in two accessions of Arabidopsis thaliana (Luo et al. 2014) was performed, which indicated that m6A is a highly conserved mRNA modification in plants and that there is a positive correlation between m6A deposition and mRNA abundance. In a recent study, it was also shown that m6A is regulated by microRNA via complementary sequence motifs (Chen et al. 2015a), which suggested that sequence signaling is very important for m6A sites. Recently, a novel, singlebase-resolution technique, miCLIP-seq (Ke et al. 2015; Linder et al. 2015), was also established, which prompted a 6 new wave of research on computational methods of m A identification (Xiang et al. 2016; Zhou et al. 2016).

Notably, m A sites have methylation-specific surroundings, with the topology of a DRACH (where D = A, G, or U; R = A or G; and H = A, C, or U) consensus motif and a GAC consensus motif localized near stop codons, in 3′UTRs, within long internal exons, and at 5′-UTRs (Meyer et al. 2012; Schwartz et al. 2013; Li et al. 2014; Luo et al. 2014; Zhou et al. 2016). Furthermore, the conserved Pu [G > A]m6AC[A/C/U] consensus motif dominates mammalian m6A sites (Dominissini et al. 2012). However, only ∼15% of all methylation Pu[G > A] m6AC[A/C/U] consensus 6 motifs are m A sites (Yue et al. 2015). Identification of the actual methylated m6A sites among these consensus motifs remains a problem. High-throughput sequencing and wet experiments could not solve this problem due to the cost and time-consuming nature of the research, as well as inaccuracy regarding the identified sites. Therefore, computational tools were required to guide the accurate prediction of modification sites and to help reduce the costs associated with high-throughput sequencing.

In the above context, computational tools were developed for detecting different modification sites, including protein methylation (Wei et al. 2018d), protein phosphorylation (Wei et al. 2017) and dephosphorylation (Jia et al. 2017), protein O-GlcNAcylation (Jia et al. 2018), histone crotonylation (Qiu et al. 2017), DNA N4-methylcytosine (Chen et al. 2017c; Wei et al. 2018b), RNA pseudouridine (Chen et al. 2016b), and various RNA adenosine modifications (Chen et al. 2018). However, it has been proven that sequence alignment (e.g., PSI-BLAST) cannot accurately identify the modification sites. Instead, machine learning techniques are used in approaches that are currently popular. For these, a sliding window is selected around the candidate modification sites. Then, sequences in the sliding window are collected for standard machine learning training and testing processes. Related features are then proposed for representing the sliding window sequences with equal length (Liu et al. 2015, 2018; He et al. 2018). However, in this context, there is a major problem regarding construction of the training data set. Specifically, low-quality negative samples cause low generalizability of the model, resulting in poor performance when applied to novel data. In addition, the compatibility of prediction methods in different species also remains a problem, so researchers cannot currently be certain that cross-species predictions are accurate.

The yeast data set (Schwartz et al. 2013) and Arabidopsis data set (Luo et al. 2014) are two benchmark data sets for the computational prediction of m6A. Focusing on the Arabidopsis data set, Chen and coworkers first proposed 6 a support vector machine-based method to identify m A sites. Soon afterwards, they proposed a predictor called “iRNA-Methyl” (Chen et al. 2015b) on a near singlenucleotide resolution yeast data set. Chen and coworkers represented RNA sequences using “pseudo-dinucleotide composition,” which focuses on the physiochemical properties of RNA. Other improved works in terms of prediction accuracy on the two data sets were also presented (Chen et al. 2017b; Xing et al. 2017). Recently, Zhou et al. estab6 lished an m A data set from published single-nucleotide6 resolution maps of human and mouse m A sites (Ke et al. 6 2015; Linder et al. 2015). They then developed an m A predictor named SRAMP (Zhou et al. 2016), which simply uses three sequence-derived features with Random Forest classifiers. Following this work, Xiang et al. (2016) improved the predictive performance by integrating multiple sequence features, including positional binary nucleotide sequence encoding, nucleotide pair spectrum encoding, positionspecific encoding, and k-mer nucleotide frequency encoding. These methods were used with conventional features in position and frequency statistics, with the sliding window being limited to a narrow region and focusing on only a single species. Additionally, there was a lack of independent testing, which resulted in overfitting.

To solve the above problems, we used convolutional neural network (CNN) for m6A prediction. CNNs have been applied in various fields of bioinformatics (Zhang et al. 2018), including regulatory genomics (Angermueller et al. 2016; Xu et al. 2017), drug discovery (Stephenson et al. 2018), protein subcellular localization (Almagro Armenteros et al. 2017; Wei et al. 2018a), protein function prediction (Cao et al. 2017), and single-cell DNA methylation states (Angermueller et al. 2017; Xu and Zhou 2018). These networks directly trained predictor models without predefined features and outperformed conventional predictors with larger sequence sliding windows. The combination of big data and larger sliding windows together with deep learning techniques appeared to solve the m6A overfitting problem. It has also become common to represent sequences with word embedding algorithms, instead of the sparse one-hot encoding (Dai et al. 2017; Min et al. 2017; Wei et al. 2018c).

In this paper, we report a neural embedding predictor named Gene2vec (gene subsequence to embedding vector). We extended the sliding window length to the thousand level, used word embedding to represent mRNA subsequences, which were parts of long sliding window sequences, and performed classification with CNNs. We progressively experienced various gene sequence representation schemes with their most effective CNN structures with library Keras (https://keras.io/) and proved that the gene-subsequence-based neural embedding method is the best option for various large-scale sequence data. It appeared that CNN together with Gene2vec can address the problem of overfitting in m6A prediction. To our knowledge, this is the first time to test ∼1000 nt length sliding frame, and we used much more training and testing sequences than before, which could help to avoid overfitting of deep learning techniques.

RESULTS AND DISCUSSION Optimization of parameters

Large sequence windows confer more contextual sequence information and greater GAC/AAC site coverage. It is an important ingredient of the contributions to effective performance. We evaluated the effect of different sequence windows on the prediction results with simple one-hot encoding using the two convolutional cell structures mentioned above (Fig. 1). The results in Figure 1 demonstrated that the AUROC and MCC have growth trend integrally with the increase of the sequence window length and no longer change drastically when it is up to 1001 nt.

The selection of RNA word split length is very important for embedding and Gene2vec prediction mode. Too short a length will lead to a small number of nucleotide letter combinations, which in extreme conditions will degenerate into one-hot encoding with assigning four types of a unique integral index to each individual nucleotide, while an excessive length will result in too many combinations, leading to complex vector representation and high computing costs. To determine the optimal RNA word split length, we compared the performance on a validated set with different slice lengths of RNA word from two to five nucleotides (Fig. 2). As shown in Figure 2, three nucleotides is an appropriate length in this range with metrics of both AUROC and MCC. We also compared the results of different embedding output dimensions of word embedding methods with two convolutional cell structures on a validation set (Fig. 3). We used AUROC and MCC metrics to compare performances of different embedding output dimensions of word embedding. As shown in Figure 3, the performances of 128 output dimensions reach a peak with both AUROC and MCC metrics in an appropriate range of [64,256], though the performance of 256 output dimensions with MCC metrics is slightly higher than the 128 output dimensions. As a result, we chose 128 as an appropriate word embedding output dimension parameter.

The main network hyperparameters of the four prediction modes are the convolution number of filters and size of the convolution kernel. For the convolution number of filters, we first empirically set this as decreasing power values of 2, and for the size of the convolution kernel, we set it as some decreasing odd values. The convolution network structures of all four prediction schemes are similarly constituted in terms of cell structures, the detailed hyperparameters of which are supplied in the Supplemental Material.

Learned and analyzed motifs from CNN kernel

Deep learning is to a certain extent a black box with the difficulty in tracing a prediction back to which features are important. And it is very meaningful to explain biological meaning in the process of training of CNN with visualization. Recently, many studies on biological computing prediction classifications (Liu and Li 2018; Li et al. 2017a; Liu et al. 2017c; Zeng et al. 2018) involving CNNs have used convolution kernels of the first layer to extract informative motifs from massive sequence data sets. These followed on from the heuristic work from Deepbind (Alipanahi et al. 2015) that generated a position weight matrix (PWM) by aligning all matched sequence segments and calculating the frequency for each kernel. We here apply this new method to achieve conversion from convolution kernels to PWMs on single training sets from humans and mice. For compatibility regarding representation of the four nucleotides in PWMs, we retrained a CNN model with 4-nt binary representation by randomly transforming the padding character into one of the four nucleotides.

We used 64 convolution kernels with a length of 11 binary representations for each kernel on the first CNN layer and generated 64 motifs after transforming kernels to PWMs. For further analysis of these motifs, we used the TOMTOM (Gupta et al. 2007) motif comparison tool to compare one or more motifs against a database of known RNA motifs. We picked out several representative examples of similar known motifs found in previous research (Ray et al. 2013) on humans and mice, as shown in Figure 4. The matching metrics shown are E-value and the number of overlaps, with E-value being the expected number of false positives in the matches up to this point. The comparative results in Figure 4 show that the CNN kernel or filter weight learning from an abstract representation of a deep neural network by scanning convolution operation corresponds to response function of consensus motifs in biological sequences. For instance, the 54th kernel in 64 convolution kernels is very similar to the RNCMPT00041 motif, which we analyzed from RNA-binding protein Musashi homolog 1 (MSI1), encoded by the MSI1 gene. On the one hand, the known consensus motifs matched with convolution kernels help researchers to study the relationship between them in biological experiments. On the other hand, the unmatched kernels may help to uncover new motifs. All 64 motifs of PWM logos and a detailed motif comparison report can be accessed in the Supplemental Material or on our web server.

Performance of the predictors

We compared four prediction modes with four different data preprocessing and encoding schemes on 1001-nt sequence data extracted from the unbalanced independent test set transcript ID of Zhou et al. (2016) (Fig. 5). For building a prediction model with an unbalanced training set with a positive-to-negative ratio of 1:10, the same numbers of negative samples and positive samples were randomly selected as training data sets. We used 80% of the training set for building a model and used the other 20% of the training set to verify the model and optimize the neural network parameters, so we evaluated the performance of our predictor on an unbalanced independent test set. With the same unbalanced independent test set, neural networks based on all four predictors achieved better prediction than SRAMP, with the Gene2vec method achieving the best results (Table 1).

To use these four learning algorithms to obtain better predictive performance, we also established a soft vote with average predicted probabilities. We compared the ensemble result using the metrics of accuracy (Acc), Sn, Sp, and MCC in the unbalanced independent test set. As shown in Figure 6, all four metrics of the soft vote showed better results than for any of the single prediction methods. Note that AUROC and AUPR metrics were not considered here due to their step-wise character.

Cross-validation (10-fold) was performed with mature mRNA data from two different species and the corresponding trained Gene2vec model. The different predictive AUROC values are shown in Figure 7. We also added a mixed model built using the above-mentioned human and mouse data to predict independent species test data. As the figure shows, the model built using mouse data was less effective at prediction than the model built using human data due to the smaller amount of mouse training data available. We also found in the cross-species validation that the prediction results were poorer than when using the species-consistent data and model, which indicates the specificity of our method for a particular species. Furthermore, we obtained almost the same AUROC values of 0.8415 and 0.8414 for human and mouse data predicted using the mixed model, which are consistent with the mixed test data prediction result obtained with Gene2vec as shown in Table 1.

Further assessment of robustness of prediction model was performed in the YTHDF binding site data set. YTHDF proteins are m6A readers. The above predictor could not only identify N6-methyladenosine sites, but should also predict YTHDF protein binding sites that recognize m6A-modified mRNAs by selection. Our method performed better with AUROC = 0.737 and AUPRC = 0.963 (Fig. 8) than SRAMP with AUROC = 0.720 and AUPRC = 0.251 as reported previously for the YTHDF binding site data set.

We also compared our method with the newest predictor of mammalian N6-methyladenosine sites of which we are aware, RNAMethPre, for a remapped unbalanced independent test set. We used the balanced training set from RNAMethPre with Gene2vec encoding and the same CNN structure to build a prediction model for the purpose of establishing consistent comparison conditions. As Table 2 shows, RNAMethPre was better than SRAMP in terms of predictive performance with three stringency thresholds corresponding to 90%, 85%, and 80% specificity in the independent data set tests. Our method achieved more effective prediction than the RNAMethPre predictor with these four thresholds.

Conclusions

In this study, we built four prediction modes for predicting mammalian N6-methyladenosine sites with various RNA sequence representations and optimized CNN structure. We achieved more effective prediction than in conventional methods with information on the context regarding flanking sequences for 1000 nucleotides. Specifically, we combined deep learning and word embedding technology to present an RNA N6-adenosine methylation predictor based on gene-subsequence-based neural embedding algorithms by splitting the sequence into pseudo-RNA words. We also used these pseudo-RNA words to train a corpus model. Using this model, we analyzed and explained semantic equivalence and semantic symmetry phenomena of RNA sequences with vector space presentation. We also trained an embedding model and a network model, and optimized encoding parameters and network parameters on a validation set. We evaluated our method on a rigorous unbalanced independent mammalian m6A site test set and YTHDF binding site test set and achieved better results than with the conventional method. We also established a user-friendly web server at http://server.malab.cn/Gene2vec/, where users can submit uncharacterized mRNA sequences for the prediction of 6 potential m A sites. In the future, we will pay more attention to the genomics data (e.g., tRNA, rRNA) besides mRNA sequences if related training dates are released. Supplemental Material can be accessed on our web server.

MATERIALS AND METHODS Data sets

Data sets were retrieved from Homo sapiens and Mus musculus complementary DNA (cDNA) FASTA data in Ensembl (Zerbino et al. 2017). The mature mRNA transcript IDs were derived from the work of Zhou et al. (2016), which involved annotation of the mammalian m6A sites. In our work, RNA sequences were derived from and were equal to cDNA. Besides the data sets of Zhou and coworkers, we built other data sets in accordance with the work of Xiang et al. (2016) for comparative experiments. For further confirming the validity of the prediction model, Zhou and coworkers also used YT521-B homology domain family (YTHDF) proteins which are m6A readers. RNA sequences with m6A sites would be bound by YTHDF. Therefore, YTHDF-binding RNAs could be selected as those with m6A sites.

For all RNA nucleotide sequence data, we cut out a 1001-nt local sequence window centered at an m6A/non-m6A site, removing the sequences whose centers were not GAC or AAC consensus motifs, and filled in the rest with the character “X” when the sequence was shorter than 1001 nt. Finally, we obtained 495,572 sequences in the training set and 128,561 in the testing set. For both of these, the positive-to-negative ratio was approximately 1:10.

The YTHDF-binding data set contained 57,516 testing sequences. There were 88,579 training sequences (positive-tonegative ratio of 1:1) and 88,227 testing sequences (positive-tonegative ratio 1:10) in the work of Xiang et al. (2016). The sizes of the different data sets are shown in Table 3. All of these sequences can be accessed in the Supplemental Material.

Data preprocessing and encoding

In this section, we introduce four sequence-encoding schemes, namely, one-hot encoding, neighboring methylation state encodIn one-hot encoding, there are five characters, standing for the four types of nucleotide along with the padding character “X.” One-hot encoding uses five-dimensional binary vectors, where A = [1, 0, 0, 0, 0], T = [0, 1, 0, 0, 0], G = [0, 0, 1, 0, 0], C = [0, 0, 0, 1, 0], and X = [0, 0, 0, 0, 1] and transforms the 1001 nt windows sequences centered at the prediction site into 5005-binary-long vector.

Neighboring methylation state encoding m6A sites appear to cluster in the chromosomes. Therefore, the positive sites may be close to other positive sites, instead of negative ones. For neighboring methylation state encoding, we counted the neighboring positive/negative site numbers as a kind of feature. We scanned GAC and AAC sites throughout the entire transcript sequences, and gave a label of 1 to known methylation (m6A) sites and 0 to unknown GAC/AAC sites. For every positive (label 1) or negative (label 0) site, we listed the 250 upstream GAC/AAC sites and 250 downstream GAC/AAC sites. Therefore, we extracted 501 0/1 codes for every site. If the end of the transcript was reached, we filled the vacancy with values of 0 in both sides to the 501-code-longvector.

RNA word embedding For RNA word embedding, heuristically, we shifted a 3-nt-long window along 1001-nt sample sequences to generate RNA subsequences that can be analogized into gene words. Embedding encoding identified possible 3-nt combinations (105 different combinations in our training data) with a unique integral index, and transformed sample sequences into integral sequences with the corresponding integral index. Then, we input them into the Keras embedding layer to transform each integral sequence into a data table S ∈ Rn×e, where n is the length of the integral sequence and e is the dimension of the dense embedding. Word2vec (Church 2017) is a statistical method for learning word embedding from a text corpus with neural-network-based training via Skip-gram and continuous bag-of-words (CBOW) models. The Skip-gram model predicts the surrounding words from the current word, while the CBOW model predicts the current word from its surroundings. Both models are focused on learning about words given their local usage context, where the context is defined by a window of neighboring words.

For Gene2vec, similar to the processing of embedding encoding, we regarded lengths of three RNA nucleotides as an RNA word, in an overlapping manner, analyzed them for sequence content as the RNA corpus, and used Word2vec in the Gensim tool package (https://radimrehurek.com/gensim/models/word2vec. html) with a five-word-long window of neighboring words to learn A B C the vector relationship of those RNA words and generate a 100-dimensional feature vector.

An RNA word co-occurrence network was built, as shown in Figure 9A, to find similarities between word pairs and the word patterns of latent structures and representations. This network was built using a square symmetric matrix, which was the result of multiplying a rectangular matrix by its transpose. Each row vector of this rectangular matrix represents a different RNA context unit (S1, S2, S3…Sn) corresponding to a sentence or paragraph, and each column-vector represents a different word (W1, W2, W3…Wn). Therefore, in the RNA word co-occurrence matrix, each row or column stands for an RNA word. Each cell stands for a context unit size, which is the number of RNA words Wi co-occurring with the RNA word Wj. Correlation function was computed to find the correlations, and then a t-test was used on the individual correlations using the following formulas in the Psych package (https://cran.r-project.org/web/packages/psych/): t = r ∗√(n − 2) ,

(1 − r 2) se = √((1n−−r22)) , where r is the matrix of correlations and n is the number of cases per correlation. We built the co-occurrence network by setting a correlation threshold of 0.7 and linked two points together with an edge when their correlation exceeded this threshold. For instance, the group [AGA, GAA, AAG] (green group in Fig. 9A) can be selected among 43 types of 3-nt-long RNA sequence permutations as an independent group, due to their strong correlation. We also removed RNA words containing the character “X” and transformed this 100-dimensional vector (Gene2vec output) of 43 types of RNA words to two components utilizing principal component analysis (PCA) to reveal the two-dimensional spatial correlation of those RNA words (Fig. 9B). Spatial distance is representative of word similarity. All 2D and 3D graphs with detailed annotations can be obtained on our web server. As can be seen in Figure 9B, 64 RNA words cluster into eight groups with axial symmetry. For instance, Group 1 [CCG, GCG] and Group 3 [CGG, CGC] are symmetrical about the vertical, which first indicates that CCG has a similar “meaning” for sequence composition to GCG due to their cluster in vector space, and second reveals that CCG is to CGG as GCG is to CGC (Fig. 9B). This is just like “king” is to “queen” and “man” is to “woman” in natural language. We call the two phenomena gene semantic equivalence and gene semantic symmetry (Fig. 9C). Two RNA words are said to have semantic equivalence when they exist among different sequences that have almost the same gene context, while two RNA words are said to have semantic symmetry when they exist within one sequence that has a unique biochemical property of not only positional symmetry, but also nucleotide-order symmetry, just like a pair of hands where each finger represents a nucleotide letter.

Network structure

We used CNNs with multiple cell structures that have two one-dimensional convolution layers, one pooling layer and one dropout layer. Convolution layers are designed to extract features with high-dimensional abstract representation. The pooling layer limits the number of model parameters tractable by pooling operations. The dropout layer prevents overfitting of the model by randomly setting some of the input units to a value of 0. Four prediction methods had been established based on four different network structures composed by the cell structures mentioned above. One-hot encoding data were fed into the network with four cell structures and fully connected layers as input, while neighboring methylation state encoding data, RNA word embedding data, and Gene2vec processing data were fed into networks with two cell structures (Fig. 10). The final result was obtained by a voting strategy from the four prediction probabilities.

Taking an example of the one-hot coding sequence, the input data matrix Xn was first fed into a 1D-convolutional layer, which used a convolutional filter Wf ∈ RH, where H is the length of the filter vector. The output feature Ai at the ith position was computed by

Ai = ReLU

H h=1

Wf Xn,i+h + bf , where ReLU(x) = max(0, x) is the rectified linear unit function and bf ∈ R is a bias (Mairal et al. 2014). These convolutional operations are similar to data block of H length in sequence filtered by a sliding filter window at each ith position.

Next, a max. pooling layer was used for reduction of the dimensions of output data generated by the multiple convolutional filter operations. A max. pooling layer is a form of nonlinear downsampling achieved by outputting the maximum of each subregion.

To reduce overfitting, we added a dropout layer in which individual nodes were either “dropped out” from the network with probability 1 − P or kept with probability P at each training stage. This not only prevented overfitting, but also led to integration of various deformed network structures to generate more robust features that are more generalizable to new data.

Finally, a flattening layer that “flattened” the input data was used, which transformed multidimensional data into a single dimension. Fully connected layers with an ReLU activation function and output layer predict the binary classification probability with activation function as follows (Han and Moraga 1995): yˆ (x) = sigmoid(x) =

1 1 + e−x .

Evaluation metrics

To assess the performance of our prediction model on an unbalanced independent test set, we used the following metrics (Liu et al. 2017b): sensitivity (Sn), specificity (Sp), and Matthew’s correlation coefficient (MCC), which are formulated as follows:

TP Sn = TP + FN × 100%,

TN Sp = TN + FP × 100%,

TP × TN − FP × FN MCC = √(TP + FN) + (TN + FP) + (TP + FP) + (TN + FN) , where TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives, and false negatives, respectively.

The area under the ROC curve (AUROC) and the area under the precision–recall curve (AUPR) were calculated to evaluate the performance of the predictors. Receiver operating characteristic (ROC) curves can be plotted as sensitivity against 1 − specificity and precision–recall curves as precision (the proportion of true positives among all predicted positives) against recall (the proportion of relevant instances that have been retrieved among the total number of relevant instances). A precision–recall plot is more informative than an ROC plot when applied to unbalanced data sets (Song et al. 2014).

SUPPLEMENTAL MATERIAL

Supplemental material is available for this article.

ACKNOWLEDGMENTS

This work was supported by the National Key R&D Program of China (SQ2018YFC090002) and the Natural Science Foundation of China (nos. 61771331, 61822306, 61672184). The authors thank Liwen Bianji, Edanz Group China (www.liwenbianji.cn/ac), for editing the English text of a draft of this manuscript.

Received October 3, 2018; accepted November 1, 2018.

Gene2vec: prediction of m6A sites Zou et al. Quan Zou, Pengwei Xing, Leyi Wei, et al. RNA 2019 25: 205-218 originally published online November 13, 2018 Access the most recent version at doi:10.1261/rna.069112.118 Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA

Creative

Commons License

Email Alerting Service

This article cites 68 articles, 4 of which can be accessed free at: http://rnajournal.cshlp.org/content/25/2/205.full.html#ref-list-1 This article is distributed exclusively by the RNA Society for the first 12 months after the full-issue publication date (see http://rnajournal.cshlp.org/site/misc/terms.xhtml). After 12 months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/. Receive free email alerts when new articles cite this article - sign up in the box at the top right corner of the article or click here.