Research Paper

Deep Learning for Circular RNA Classification

Implementation with Gaussian Blur Data Preprocessing in Detection and Classification

A focused ANN pipeline translating circAtlas k-mer patterns into reliable disease classification while keeping the workflow light enough for rapid experimentation.

75.11% Model accuracy
0.14 ms Prediction runtime
8 min Reading time

Introduction

Circular RNAs (circRNAs) are increasingly recognized as key regulators of gene expression due to their unique closed-loop structure and involvement in various cellular processes. This study investigates the utilization of machine learning algorithms in predicting circRNA-disease associations.

Methods

This study proposes a novel deep learning approach leveraging artificial neural networks (ANN) for circRNA classification. The methodology involves data collection from circRNA databases, k-mers counting for feature extraction, Gaussian blur implementation for data smoothing, and ANN-based model training.

Results

Evaluation of the trained models based on precision, recall, and f1-score metrics shows an overall accuracy of 0.7511, with an average precision score of 0.7982, recall of 0.7511, and f1-score of 0.7637.

Discussion

The results indicate that our ANN-based algorithm effectively detects and classifies circRNA datasets with considerable accuracy. Compared to the algorithm from past research, our algorithm is also shown to have significantly less computational requirements.

Conclusion

Comparative analysis demonstrates improved performance compared to previous algorithms, suggesting its potential for widespread implementation due to reduced computational requirements and simpler implementation.

Circular RNAs (circRNAs) represent a category of single-stranded RNA molecules characterized by the absence of 5′ caps and 3′ polyadenylated tails, forming a covalently closed continuous loop structure. Recent research indicates that circRNAs play a distinct role in regulating human gene expression.

Key Insight

CircRNAs are emerging as valuable biomarkers due to their high stability and tissue-specific expression patterns, making them reliable indicators for various diseases.

The rapid development of machine learning has enhanced the prediction of circRNA-disease associations. Current models for predicting these associations are often complex, which can lead to overfitting and lack of generalization. This study proposes a simpler model using Artificial Neural Networks (ANN) to achieve high accuracy with improved computational efficiency.

Artificial Neural Network Architecture

5-layer ANN with 256 input nodes for k-mers features

01

Dataset Collection

Data was collected from circAtlas (https://ngdc.cncb.ac.cn/circatlas/links1.php) and processed with Python programming language. The data consisted of the sequences of non-coding circular RNA & non-coding non-circular RNA (mRNA).

Python circAtlas RNA Sequences
02

Feature Extraction with K-mers

K-mer counting is the method of counting the number of subsequences with a length of "k" within a set of RNA sequences dataset. The length was chosen by the user (k = 4) based on this formula:

k = logk(average of l) where l is the length of sequences

For k = 4, there are 4-mers obtained, e.g., {ACGT, GTAA, CGTT} which were then processed.

K-mer Analysis Feature Engineering
03

Gaussian Blur Implementation

Blurring is one of the techniques in data processing to smoothen the collected values in the dataset. This study implemented a Gaussian function to smoothen the vectorized non-coding RNA sequence using this formula:

f(x) = 1/(σ√(2π))e-1/2((x-μ)/σ)2 The value of sigma was 1 because it provides a moderate amount of blurring, computational efficiency, and flexibility for adjustment.
Data Preprocessing Gaussian Blur Signal Processing
04

Model Architecture

The Artificial Neural Network (ANN) structure is arranged into an input layer (length: 256 nodes), 5 hidden layers, and an output layer. Every hidden layer was followed by a dropout layer to prevent overfitting and a Leaky ReLU for adding non-linearity in the neurons.

Neural Network Deep Learning Leaky ReLU Dropout

Our neural network models for both detection and classification showed promising training progress with the following performance metrics:

Model Performance Comparison

Accuracy scores across different algorithms

Model Performance Comparison

Evaluation metrics and runtime performance across tested algorithms.

Model
Accuracy
Precision
F1-Score
Runtime
ANN + Gaussian Blur
Our Algorithm
75.11%
79.82%
76.37%
0.14 ms
DeepCirCode
Deep Learning
81.29%
92.71%
83.65%
6.74 ms
Support Vector Machine
Traditional ML
73.28%
77.42%
79.21%
0.84 ms
Random Forest
Ensemble
71.86%
73.93%
79.18%
0.62 ms
Att-CNN
Attention Model
72.64%
74.52%
76.96%
23.67 ms

Our algorithm achieves competitive performance while being significantly faster than deep learning approaches, making it ideal for real-time applications.

Performance Highlight

The proposed method's runtime was significantly faster than other existing algorithms, taking only 0.14 milliseconds to predict one RNA. It was at least 10× faster than similar models like DeepCirCode and circDeep.

The objective of this experiment was to create a machine-learning model that can detect and classify different types of circular RNA with a high accuracy score by using an artificial neural network. The results show that our designed neural network was able to detect and classify the circular RNA dataset with good evaluation scores.

The similarity between both training and validation evaluation scores shows that there was not an underfitting and overfitting issue on the detection model. However, the high difference between the training and validation loss in the classification model showed an indicator of overfitting that occurred in the training process.

Key Advantage

The proposed algorithm would likely have less computing power requirements and simpler implementation compared to previous research, which would make its implementation less costly and more efficient.

By conducting runtime analysis, the proposed method is also shown to have better efficiency in time and algorithm complexity, which leads to less computational power. Hence, the proposed algorithm might have comparable accuracy, while also having less computational power which made the model best to be commercialized and implemented widely.

This study aimed to create an algorithm based on neural networks to detect and classify non-coding circular RNA with a high accuracy score. We implemented an ANN as the detection and classification model and evaluated its accuracy, precision, recall, and f1-score.

The results demonstrate that the proposed method was comparable to other similar algorithms. However, our algorithm was determined to have a better evaluation score compared to most other existing algorithms while requiring significantly less computational resources.

In conclusion, the proposed algorithm was able to detect and classify the non-coding circular RNA with a high accuracy score and exceptional computational efficiency.