Research Paper

Deep Learning for Circular RNA Classification

Implementation with Gaussian Blur Data Preprocessing in Detection and Classification

Some RNAs loop back on themselves. We smooth the data with Gaussian blur, then let a small neural network spot and sort them.

A lightweight ANN pipeline that turns circAtlas k-mer frequencies into reliable disease predictions: fast enough for real-time screening, accurate enough for clinical relevance.

75.11% Model accuracy
0.14 ms Prediction runtime
8 min Reading time

Concept Overview

Introduction

Circular RNAs (circRNAs) are increasingly recognized as key regulators of gene expression due to their unique closed-loop structure and involvement in various cellular processes. This study investigates the utilization of machine learning algorithms in predicting circRNA-disease associations.

Most RNA in your cells is a straight strand. A small group, called circRNAs, loops into a closed ring instead. That shape lets them help your cells turn genes on and off in ways straight RNAs cannot. Our goal was simple: use machine learning to guess which circRNAs are tied to which diseases.

Methods

This study proposes a novel deep learning approach leveraging artificial neural networks (ANN) for circRNA classification. The methodology involves data collection from circRNA databases, k-mers counting for feature extraction, Gaussian blur implementation for data smoothing, and ANN-based model training.

We built an artificial neural network (ANN), a machine-learning model loosely inspired by how brain cells connect. We grabbed circRNA sequences from public databases. Then we chopped each one into short fixed-length pieces called k-mers and counted how often each piece showed up. A Gaussian blur step gently smoothed those counts, like softening a photo, so the network could focus on real patterns instead of noise.

Results

Evaluation of the trained models based on precision, recall, and f1-score metrics shows an overall accuracy of 0.7511, with an average precision score of 0.7982, recall of 0.7511, and f1-score of 0.7637.

The finished model scored 0.7511 on accuracy, right about three out of four times. Its precision was 0.7982, its recall 0.7511, and its f1-score 0.7637. The ANN sorted circRNAs reliably across every test we threw at it.

Discussion

The results indicate that our ANN-based algorithm effectively detects and classifies circRNA datasets with considerable accuracy. Compared to the algorithm from past research, our algorithm is also shown to have significantly less computational requirements.

Our ANN spots and sorts circRNAs well. The bigger win: it does the job using far less computing power than earlier models. Same result, lighter machine.

Conclusion

Comparative analysis demonstrates improved performance compared to previous algorithms, suggesting its potential for widespread implementation due to reduced computational requirements and simpler implementation.

Lined up against older approaches, our model comes out ahead. It needs less computing power and is simpler to set up, so labs and clinics can adopt it without a server farm.

Circular RNAs (circRNAs) represent a category of single-stranded RNA molecules characterized by the absence of 5′ caps and 3′ polyadenylated tails, forming a covalently closed continuous loop structure. Recent research indicates that circRNAs play a distinct role in regulating human gene expression.

A normal RNA strand has two distinct ends: a "head" called the 5′ cap and a "tail" called the 3′ poly-A tail. CircRNAs skip both. Their strand bends around and joins itself, forming a closed ring. Recent work shows this odd shape gives them a real job in steering how human genes turn on and off.

Key Insight

CircRNAs are emerging as valuable biomarkers due to their high stability and tissue-specific expression patterns, making them reliable indicators for various diseases.

CircRNAs are sturdy. They survive in the body longer than regular RNA. They also show up in different amounts in different tissues. Those two traits make them useful "biomarkers": natural signals doctors can read to spot disease early.

The rapid development of machine learning has enhanced the prediction of circRNA-disease associations. Current models for predicting these associations are often complex, which can lead to overfitting and lack of generalization. This study proposes a simpler model using Artificial Neural Networks (ANN) to achieve high accuracy with improved computational efficiency.

Machine learning has gotten better at guessing which circRNAs link to which diseases. But many of today's models are bloated. They memorize the training data instead of learning the real pattern (what researchers call "overfitting") and then stumble on new cases. We took the opposite path: a smaller, simpler ANN that stays accurate and runs lean.

Artificial Neural Network Architecture

5-layer ANN with 256 input nodes for k-mers features

01

Dataset Collection

Data was collected from circAtlas (https://ngdc.cncb.ac.cn/circatlas/links1.php) and processed with Python programming language. The data consisted of the sequences of non-coding circular RNA & non-coding non-circular RNA (mRNA).

We pulled the sequences from circAtlas, an open database (https://ngdc.cncb.ac.cn/circatlas/links1.php), and processed them in Python. The collection mixed two kinds of RNA: the looped circRNAs we cared about, and ordinary straight-strand RNA (mRNA) so the model could learn to tell them apart.

Python circAtlas RNA Sequences
02

Feature Extraction with K-mers

K-mer counting is the method of counting the number of subsequences with a length of "k" within a set of RNA sequences dataset. The length was chosen by the user (k = 4) based on this formula:

A k-mer is just a small chunk of an RNA strand, k letters long. We slide along each sequence and count how often every chunk appears, like counting how often every four-letter word shows up in a book. We picked k = 4 using the formula below:

k = logk(average of l) where l is the length of sequences

For k = 4, there are 4-mers obtained, e.g., {ACGT, GTAA, CGTT} which were then processed.

At k = 4, you end up with four-letter chunks like {ACGT, GTAA, CGTT}. Those counts become the numbers we feed into the network.

K-mer Analysis Feature Engineering
03

Gaussian Blur Implementation

Blurring is one of the techniques in data processing to smoothen the collected values in the dataset. This study implemented a Gaussian function to smoothen the vectorized non-coding RNA sequence using this formula:

Blurring is a trick from photo editing: it softens sharp jumps in the data so real patterns stand out over random noise. We applied a Gaussian blur (the same gentle smoothing used on images) to our k-mer counts. The formula sits below:

f(x) = 1/(σ√(2π))e-1/2((x-μ)/σ)2 The value of sigma was 1 because it provides a moderate amount of blurring, computational efficiency, and flexibility for adjustment.
Data Preprocessing Gaussian Blur Signal Processing
04

Model Architecture

The Artificial Neural Network (ANN) structure is arranged into an input layer (length: 256 nodes), 5 hidden layers, and an output layer. Every hidden layer was followed by a dropout layer to prevent overfitting and a Leaky ReLU for adding non-linearity in the neurons.

Think of the network as a stack of filters. The first layer takes in 256 numbers (our k-mer counts). Five middle layers then refine the signal, and a final layer spits out the answer. After each middle layer, two helpers kick in: a "dropout" step that randomly silences some connections so the network doesn't just memorize, and a "Leaky ReLU" step that lets the network learn curved, more complex patterns, not just straight lines.

Neural Network Deep Learning Leaky ReLU Dropout

Our neural network models for both detection and classification showed promising training progress with the following performance metrics:

Both jobs (spotting circRNAs and sorting them by type) went smoothly during training. Here are the scores:

Model Performance Comparison

Evaluation metrics and runtime performance across tested algorithms.

Model
Accuracy
Precision
F1-Score
Runtime
ANN + Gaussian Blur
Our Algorithm
75.11%
79.82%
76.37%
0.14 ms
DeepCirCode
Deep Learning
81.29%
92.71%
83.65%
6.74 ms
Support Vector Machine
Traditional ML
73.28%
77.42%
79.21%
0.84 ms
Random Forest
Ensemble
71.86%
73.93%
79.18%
0.62 ms
Att-CNN
Attention Model
72.64%
74.52%
76.96%
23.67 ms

Our algorithm achieves competitive performance while being significantly faster than deep learning approaches, making it ideal for real-time applications.

Performance Highlight

The proposed method's runtime was significantly faster than other existing algorithms, taking only 0.14 milliseconds to predict one RNA. It was at least 10× faster than similar models like DeepCirCode and circDeep.

Speed is where this approach really pulls ahead. Checking a single RNA takes just 0.14 milliseconds, far quicker than a blink. That makes the model at least 10× faster than similar tools like DeepCirCode and circDeep.

The objective of this experiment was to create a machine-learning model that can detect and classify different types of circular RNA with a high accuracy score by using an artificial neural network. The results show that our designed neural network was able to detect and classify the circular RNA dataset with good evaluation scores.

The plan was to build an ANN that could both spot circRNAs and sort them into types, without losing accuracy. The numbers say it worked: the network scored well on both jobs.

The similarity between both training and validation evaluation scores shows that there was not an underfitting and overfitting issue on the detection model. However, the high difference between the training and validation loss in the classification model showed an indicator of overfitting that occurred in the training process.

For the detection model, scores on the training data and the held-back test data lined up nicely, a sign the network learned the right pattern instead of memorizing or guessing. The classification model was a different story. Its training score pulled well ahead of its test score, the classic fingerprint of overfitting: it had memorized examples instead of generalizing.

Key Advantage

The proposed algorithm would likely have less computing power requirements and simpler implementation compared to previous research, which would make its implementation less costly and more efficient.

Our algorithm asks less from the machine running it, and it is easier to set up than earlier models. That means cheaper hardware, quicker deployment, and lower running costs.

By conducting runtime analysis, the proposed method is also shown to have better efficiency in time and algorithm complexity, which leads to less computational power. Hence, the proposed algorithm might have comparable accuracy, while also having less computational power which made the model best to be commercialized and implemented widely.

Timing tests back this up. The new approach is faster and structurally simpler, so it needs less computing power per prediction. You get accuracy on par with the heavier models at a fraction of the cost, a strong case for putting this into real products and using it widely.

This study aimed to create an algorithm based on neural networks to detect and classify non-coding circular RNA with a high accuracy score. We implemented an ANN as the detection and classification model and evaluated its accuracy, precision, recall, and f1-score.

The goal was a neural network that could both spot non-coding circRNAs and sort them into types, accurately. One ANN handled both jobs. We graded it on four standard yardsticks: accuracy, precision, recall, and f1-score.

The results demonstrate that the proposed method was comparable to other similar algorithms. However, our algorithm was determined to have a better evaluation score compared to most other existing algorithms while requiring significantly less computational resources.

Head-to-head, our method holds its own against similar algorithms. In most of those matchups it actually scores higher, and uses far less computing power to do it.

In conclusion, the proposed algorithm was able to detect and classify the non-coding circular RNA with a high accuracy score and exceptional computational efficiency.

Bottom line: the algorithm spots and sorts non-coding circRNAs accurately, and it does the work with unusually little computing power.

Related Research