Cancer Leading Mutation DNA of P-53 Gene
Characteristic Analysis using Genetic Algorithms in a Closed Environment Model
Introduction
The P-53 tumor suppressor gene is one of the most extensively studied genes in cancer research, with numerous studies implicating mutations in the gene as a key driver of cancer development and progression (Soussi, T., & Wiman, K. G. 2015). The P-53 gene is a critical component of the cell cycle mechanism, responsible for regulating cell growth and division, and acts as a checkpoint for DNA damage and other cellular stresses.
When the gene is mutated, it loses its ability to regulate the cell cycle, allowing for uncontrollable growth and proliferation of cancer cells. Several types of mutation have been identified in the P-53 gene, including point mutations, deletions, and insertions.
As one of the most leading causes of cancer and an essential requisite for tumor development, P-53 suppresses mitosis and cell growth to allow cell repairs of its DNA (Di-Leo, 2007).
P53 is contributed in cell regulators that usually abrogates its tumor suppression function, and fostering cancer cell growth (Zambetti, 2007). The process of genetic changes in its material alters the structure or number of genes or entire chromosomes called mutation (Johnsston, 2006). Mutation is a spontaneous mutation resulting from errors during DNA replication, recombinations, spontaneous lesions, and transposable elements (Vogelstein, 2000).
While the significance of these mutations in cancer development has been well-established, much remains unknown about the underlying mechanisms and how these mutations affect the structure and the function of the P-53 protein. This paper aims to analyze different sequences of the P-53 tumor suppressor gene to find a common attribute using genetics algorithms in a closed environment model.
By applying genetic algorithms in a closed environment model, we can gain insights into the characteristics of P-53 mutations and their potential implications in cancer development. Genetic algorithms provide a computational approach to simulate evolutionary processes, enabling us to analyze complex patterns and identify common attributes in mutated P-53 gene sequences (Eiben & Smith, 2015).
Researchers used a closed environmental model to create a simulated environment that mimics the conditions within a cell. This environment includes the necessary components for DNA replication, transcription, translation, and other cellular processes relevant to the functioning of the P-53 gene. Within this controlled setting, we can observe how different P-53 gene sequences interact with cellular components and track the behavior of mutated sequences in response to various stimuli.
Methods
Self organizing matrix or Kohonen maps is a particular topographically organized vector quantization algorithm. By implementing this method, the specific characteristic of DNA strands in the cancer leading mutation path could be determined. All of the methods were implemented in the Python programming language with available open source libraries.
Data Gathering
Data was gathered from the NCBI database, which is composed of the DNA strands of P53 Tumor Suppressor Genes. The collected data has a total of 25 DNA strands with the length of 2509 characters each and converted into a txt file for each DNA strand.
The data is then filtered to remove punctuation and spaces. On top of that, a sample of cancerous and parental pairs was collected to determine the value of Levenshtein Similarity for cancer children.
Generate Mutational Pathway
Tree is a structure that is used to maintain ordered data. A tree either consists of nodes and leaves or it could be empty. One of the trees is called binary tree, which is a tree that has a characteristic of only having two children, left child and right child.
A binary tree was implemented as a representation of mitosis self-multiplication of cells. In this paper, each node in the binary tree contains an object that represents the ID, generation number, DNA strand, and the cancer status of the node.
To generate the tree, a generative recursion algorithm was built to generate the mutational pathway which added a new left and right child for each node until the 14th generation, with the root of 25 DNA strands.
Filter Cancer Leading Mutations
Depth First Search algorithm is a deep search algorithm that starts from the initial nodes and visits only the leftmost child nodes at the next level. This paper used a depth first search algorithm to find the path, then saved the path in the form of a list of nodes.
After the list was built, the DNA strands of the list were extracted, and by using levenshtein similarity, a list of similarity was built to find the most mutated node among the path except the cancerous node.
DNA Encoding
Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads.
All of the DNA strands were gathered, and calculated the average of the DNA strand length, and used a log with base of 4 to find the k number. The data then saved in Comma Separated Value (CSV), which resulted in a matrix with 1024 columns and 73475 rows.
Clustering Analysis
Correlation Analysis
Correlation analysis is a method that estimates the strength of the relationship between any pair of variables. Since there are 1024 features as the result of the encoding, it is necessary to find the most valuable features in the data.
SOM Clustering
Self organizing matrix or Kohonen maps is a particular topographically organized vector quantization algorithm. This method computes a mapping from high dimensional space into two dimensional regular grid with specificity that close position in the regular grid is associated with close position in the original high dimensional space.
Results
Cluster Center Analysis
Cluster | cagcc | agcca | cccag | ccagc | ccagg | ttttt | ctttt |
---|---|---|---|---|---|---|---|
0 | 15.22389 | 12.09806 | 13.99139 | 11.97066 | 12.25152 | 29.22354 | 10.94611 |
1 | 13.48352 | 10.90663 | 12.52781 | 10.88192 | 10.90285 | 21.99507 | 9.096031 |
2 | 11.72513 | 9.39021 | 10.94836 | 9.488609 | 9.453751 | 20.30923 | 8.312927 |
3 | 9.718946 | 7.800927 | 9.07718 | 7.909327 | 7.984004 | 19.97388 | 7.916305 |
4 | 6.554556 | 5.575858 | 6.706779 | 5.629877 | 5.999755 | 14.50365 | 5.980873 |
5 | 5.049533 | 4.946125 | 5.794455 | 4.906471 | 5.353537 | 3.63425 | 2.674657 |
Matrix Parameters and Clustering Quality
Matrix Width | Matrix Height | Total Cluster | Silhouette Score | Davies Bouldin Score |
---|---|---|---|---|
1 | 2 | 2 | 0.397634 | 1.080491 |
1 | 3 | 3 | 0.279982 | 1.067024 |
1 | 4 | 4 | 0.445013 | 0.856827 |
1 | 5 | 5 | 0.415871 | 0.870465 |
1 | 6 | 6 | 0.451032 | 0.804099 |
1 | 7 | 7 | 0.329884 | 1.094722 |
2 | 4 | 8 | 0.338642 | 1.020047 |
1 | 9 | 9 | 0.312648 | 1.333904 |
2 | 5 | 10 | 0.331954 | 1.130516 |
1 | 11 | 11 | 0.286358 | 1.459598 |
The analysis identified six distinct DNA sequence patterns ("cagcc", "agcca", "cccag", "ccagg", "ttttt", "ctttt") that show the highest correlation among cancer leading mutations, with the optimal clustering achieved using a 1×6 matrix configuration.
Discussion
The result shows that there are 6 strands that have the highest correlation among the cancer leading mutations, which are "cagcc", "agcca", "cccag", "ccagg", "ttttt", "ctttt". It is shown that each of the clusters has its own unique strand characteristic, which represents their own cluster. Based on the silhouette score, it is shown that the clustering has done a decent amount of cluster separation.
Based on the trial, the result of the data shows that the silhouette score has an increasing trend from 1 by 2 to 1 by 6 matrix, and suddenly drops after the 1 by 7 matrix and the decreasing trend continues along the increasing amount of total cluster.
The best matrix size is represented on the silhouette score of 0.451032 on the 1 by 6 matrix that has a total cluster of 6. With the small size of the matrix, all of the agents in the matrix could be considered as a cluster center, which could be shown by a histogram plot.
Since this paper is a mathematical model simulation, this research could be improved by expanding the scope of the study, which includes other types of cancer. This could provide a better understanding of the general cancer leading mutation to all cancer types. Additionally, future research could implement other methods of machine learning, which would help to separate the clustering more decently, and determine better cluster characteristics.
The limitation of this study is the sample size that was limited and not nature based data. It would be better to understand and represent the nature of the mutation itself. Additionally, the use of specific algorithms or methods could also introduce limitations, as different algorithms may produce different results. It is important to acknowledge and discuss any limitations in a research paper to provide context for the results and to guide future research.
Conclusion
Based on the result, this paper shows that there is a characteristic of cancer leading mutation that might cause the beginning mutation path to cancer. The implementation of genetic algorithms in a closed environment model has allowed us to identify specific DNA patterns associated with cancer-causing mutations in the P-53 gene.
Through our clustering analysis, we identified six distinct DNA sequence patterns that appear to be strongly correlated with cancer progression. This finding suggests that certain mutation patterns in the P-53 gene may be more likely to lead to cancer development than others.
Our approach demonstrates the potential of computational methods in understanding the genetic basis of cancer. By simulating the evolutionary processes of mutation within a controlled environment, we can gain insights into the complex mechanisms of cancer development that would be difficult to observe in traditional experimental settings.
References
Di Leo, A., Tanner, M., Desmedt, C., Paesmans, M., Cardoso, F., Durbecq, V., Chan, S., Perren, T., Aapro, M., Sotiriou, C., Piccart, M., Larsimont, D., Isola, J., & TAX 303 translational study team. (2007). p-53 gene mutations as a predictive marker in a population of advanced breast cancer patients randomly treated with doxorubicin or docetaxel in the context of a phase III clinical trial. Annals of Oncology, 18(6), 997-1003.
Vogelstein B, Lane D, Levine AJ. (2000). Surfing the p53 network. Nature 408: 307–310.
Zambetti, G. P. (2007). The p53 mutation "gradient effect" and its clinical implications. Journal of Cellular Physiology, 213(2), 370-373.
Johnstion, M.O. (2006). Mutation and New Variation: Overview. Encyclopedia of life science.
Soussi, T., & Wiman, K. G. (2015). Shaping genetic alterations in human cancer: the p53 mutation paradigm. Cancer Cell, 28(2), 150-161.
Eiben, A.E., Smith, J. (2015). From evolutionary computation to the evolution of things. Nature, 521(7553), 476-482.
Holland, J.H. (1975). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. University of Michigan Press.
Vellido, A., Gibert, K., Angulo, C., Martín Guerrero, J. D. (2020). Advances in self-organizing maps, learning vector quantization, clustering and data visualization: Proceedings of the 13th international workshop, WSOM+ 2019. Springer International Publishing.
Source Code Repository
The complete source code and implementation details for this research are available at: https://github.com/Evintkoo/Vant-149-project