Deep Learning for Circular RNA Classification

Abstract

Introduction

Circular RNAs (circRNAs) are increasingly recognized as key regulators of gene expression due to their unique closed-loop structure and involvement in various cellular processes. This study investigates the utilization of machine learning algorithms in predicting circRNA-disease associations.

Methods

This study proposes a novel deep learning approach leveraging artificial neural networks (ANN) for circRNA classification. The methodology involves data collection from circRNA databases, k-mers counting for feature extraction, Gaussian blur implementation for data smoothing, and ANN-based model training.

Results

Evaluation of the trained models based on precision, recall, and f1-score metrics shows an overall accuracy of 0.7511, with an average precision score of 0.7982, recall of 0.7511, and f1-score of 0.7637.

Discussion

The results indicate that our ANN-based algorithm effectively detects and classifies circRNA datasets with considerable accuracy. Compared to the algorithm from past research, our algorithm is also shown to have less computational power.

Conclusion

Comparative analysis demonstrates improved performance compared to previous algorithms, suggesting its potential for widespread implementation due to reduced computational requirements and simpler implementation.

Introduction

Circular RNAs (circRNAs) represent a category of single-stranded RNA molecules characterized by the absence of 5′ caps and 3′ polyadenylated tails, forming a covalently closed continuous loop structure. Recent research indicates that circRNAs play a distinct role in regulating human gene expression.

CircRNAs are emerging as valuable biomarkers due to their high stability and tissue-specific expression patterns, making them reliable indicators for various diseases.

The rapid development of machine learning has enhanced the prediction of circRNA-disease associations. Current models for predicting these associations are often complex, which can lead to overfitting and lack of generalization. This study proposes a simpler model using Artificial Neural Networks (ANN) to achieve high accuracy with improved computational efficiency.

Methods

1

Dataset Collection

Data was collected from circAtlas (https://ngdc.cncb.ac.cn/circatlas/links1.php) and processed with Python programming language. The data consisted of the sequences of non-coding circular RNA & non-coding non-circular RNA (mRNA).

2

Feature Extraction with K-mers

K-mer counting is the method of counting the number of subsequences with a length of "k" within a set of RNA sequences dataset. The length was chosen by the user (k = 4) based on this formula:

k = log_k(average of l)

Where l is the length of sequences. For k = 4, there are 4-mers obtained, e.g., {ACGT, GTAA, CGTT} which were then processed.

3

Gaussian Blur Implementation

Blurring is one of the techniques in data processing to smoothen the collected values in the dataset. This study implemented a Gaussian function to smoothen the vectorized non-coding RNA sequence using this formula:

f(x) = 1/(σ√(2π))e^{-1/2((x-μ)/σ)²}

The value of sigma was 1 because it provides a moderate amount of blurring, computational efficiency, and flexibility for adjustment.

4

Model Architecture

The Artificial Neural Network (ANN) structure is arranged into an input layer (length: 256 nodes), 5 hidden layers, and an output layer. Every hidden layer was followed by a dropout layer to prevent overfitting and a Leaky ReLU for adding non-linearity in the neurons.

Results

Our neural network models for both detection and classification showed promising training progress with the following performance metrics:

Model Performance Comparison

Method	Accuracy	Precision	F1-Score	Time per Prediction
Our Algorithm	0.7511	0.7982	0.7637	0.14 ms
SVM	0.7328	0.7742	0.7921	0.84 ms
Random Forest	0.7186	0.7393	0.7918	0.62 ms
DeepCirCode	0.8129	0.9271	0.8365	6.74 ms
Att-CNN	0.7264	0.7452	0.7696	23.67 ms

The proposed method's runtime was significantly faster than other existing algorithms, taking only 0.14 milliseconds to predict one RNA. It was at least ten times faster than similar models like DeepCirCode and circDeep.

Discussion

The objective of this experiment was to create a machine-learning model that can detect and classify different types of circular RNA with a high accuracy score by using an artificial neural network. The results show that our designed neural network was able to detect and classify the circular RNA dataset with good evaluation scores.

The similarity between both training and validation evaluation scores shows that there was not an underfitting and overfitting issue on the detection model. However, the high difference between the training and validation loss in the classification model showed an indicator of overfitting that occurred in the training process.

The proposed algorithm would likely have less computing power requirements and simpler implementation compared to previous research, which would make its implementation less costly and more efficient.

By conducting runtime analysis, the proposed method is also shown to have better efficiency in time and algorithm complexity, which leads to less computational power. Hence, the proposed algorithm might have comparable accuracy, while also having less computational power which made the model best to be commercialized and implemented widely.

Conclusion

This study aimed to create an algorithm based on neural networks to detect and classify non-coding circular RNA with a high accuracy score. We implemented an ANN as the detection and classification model and evaluated its accuracy, precision, recall, and f1-score.

The results demonstrate that the proposed method was comparable to other similar algorithms. However, our algorithm was determined to have a better evaluation score compared to most other existing algorithms while requiring significantly less computational resources.

In conclusion, the proposed algorithm was able to detect and classify the non-coding circular RNA with a high accuracy score and exceptional computational efficiency.

Back to Portfolio