BICOB2022: Papers with Abstracts

Papers
Abstract. Alternative splicing (AS) increases the diversities of transcriptomes and proteomes in plants. The work reports identification and analysis of genes and their transcripts with a focus on AS in soybean plants by integrating mapping information of over 1.5 million of mRNAs and expressed sequence tags (ESTs) with more than 6 billons of mapped reads collected from 90 RNA-seq datasets obtained from multiple experiments. A total of 294,164 AS events were detected and categorized into basic events (151,710, 51.57%) and complex events (142,454, 48.43%). The basic AS events include intron retention (18.52%), alternative acceptor sites (16.33%), alternative donor site (8.99%), and exon skipping (7.73%). The AS rate in intron containing genes was estimated to be ~56.3% in soybean based on the current analysis. In addition, a total of 41,453 new genomic loci, which were not previously annotated in the genome, were detected by mapping transcripts to the genome. The annotated data can be accessed through a public database for searching and downloading. This work provides a resource for further detailed functional analysis of gene products in soybean plants.
Abstract. Interacting systems such as gene regulatory networks have the ability to respond to in- dividual component changes, propagate these changes throughout the network, and affect the temporal trajectories of other network elements. Causality techniques are frequently employed to investigate the interconnection between variables in complex dynamical sys- tems. However, the vast majority of causality models are rooted in regression techniques such as Vector Autoregression Models and Bootstrap Elastic net regression from Time Se- ries framework, and there is very limited research in the space of deep learning, particularly graph neural networks. In this paper, we explore in more depth the concept of Granger causality in deep learning and propose Granger causality deep learning framework using graphs convolutions, LSTM, and nonlinear penalties for the objective of learning causal relationships between temporal elements in gene regulatory networks. The deep learn- ing architecture proposed here for studying causality in dynamic networks has achieved high results on simulated networks as well as on more challenging Dream3 gene regulatory networks time-series datasets.
Abstract. Deep learning research, from ResNet to AlphaFold2, convincingly shows that deep learning can predict the native conformation of a given protein sequence with high accu- racy. Accounting for the plasticity of protein molecules remains challenging, and powerful algorithms are needed to sample the conformation space of a given amino-acid sequence. In the complex and high-dimensional energy surface that accompanies this space, it is critical to explore a broad range of areas. In this paper, we present a novel evolutionary algorithm that guides its optimization process with a memory of the explored conformation space, so that it can avoid searching already explored regions and search in the unexplored regions. The algorithm periodically consults an evolving map that stores already sampled non- redundant conformations to enhance exploration during selection. Evaluation on diverse datasets shows superior performance of the algorithm over the state-of-the-art algorithms.
Abstract. Genome Rearrangement distance problems are used to infer the evolutionary distance between genomes. These problems look at the number of mutations called rearrangement events necessary to transform one genome into another. Two commonly studied rearrange- ments are the reversal, which inverts a sequence of genes, and the transposition, which exchanges two consecutive sequences of genes. Seminal works on that topic looked only at the sequence of genes and assumed that no gene has more than one copy. More realistic models have been assuming multiple copies of a gene or have been taking the number of nucleotides between intergenic regions into account. This work combines these two generalizations defining the Signed Intergenic Reversal Distance (SIRD) and the Signed Intergenic Reversal and Transposition Distance (SIRTD) problems. Using a relationship with a problem called Signed Minimum Common Intergenic String Partition, we show Θ(k)-approximation algorithms for the SIRD and the SIRTD problems, where k is the maximum number of copies of a gene in the genomes. Our experimental tests on simulated genomes show that the algorithms tend to find low distances despite the high theorical approximation factor.
Abstract. The medical history information contained in electronic health records (EHR) is a valuable and largely untapped data mining source for predicting patient outcomes and thereby improving treatment. This paper presents a simple but novel evolutionary algorithm (EA) for identifying how various medical history and demographic factors predict clinical outcomes. For this initial study, our EA was tested using synthetic data concerning COVID-19 hospitalization rates and we show that the EA results are more informative than logistic regression, neural network, or decision tree results.
Abstract. ResNet and, more recently, AlphaFold2 have demonstrated that deep neural networks can now predict a tertiary structure of a given protein amino-acid sequence with high accuracy. This seminal development will allow molecular biology researchers to advance various studies linking sequence, structure, and function. Many studies will undoubtedly focus on the impact of sequence mutations on stability, fold, and function. In this paper, we evaluate the ability of AlphaFold2 to predict accurate tertiary structures of wildtype and mutated sequences of protein molecules. We do so on a benchmark dataset in mutation modeling studies. Our empirical evaluation utilizes global and local structure analyses and yields several interesting observations. It shows, for instance, that AlphaFold2 performs similarly on wildtype and variant sequences. The placement of the main chain of a protein molecule is highly accurate. However, while AlphaFold2 reports similar confidence in its predictions over wildtype and variant sequences, its performance on placements of the side chains suffers in comparison to main-chain predictions. The analysis overall supports the premise that AlphaFold2-predicted structures can be utilized in further downstream tasks, but that further refinement of these structures may be necessary.
Abstract. As of late 2019, the SARS-CoV-2 virus has spread globally, giving several variants over time. These variants, unfortunately, differ from the original sequence identified in Wuhan, thus risking compromising the efficacy of the vaccines developed. Some software has been released to recognize currently known and newly spread variants. However, some of these tools are not entirely automatic. Some others, instead, do not return a detailed characterization of all the mutations in the samples. Indeed, such characterization can be helpful for biologists to understand the variability between samples. This paper presents a Machine Learning (ML) approach to identifying existing and new variants completely automatically. In addition, a detailed table showing all the alterations and mutations found in the samples is provided in output to the user. SARS-CoV-2 sequences are obtained from the GISAID database, and a list of features is custom designed (e.g., number of mutations in each gene of the virus) to train the algorithm. The recognition of existing variants is performed through a Random Forest classifier while identifying newly spread variants is accomplished by the DBSCAN algorithm. Both Random Forest and DBSCAN techniques demonstrated high precision on a new variant that arose during the drafting of this paper (used only in the testing phase of the algorithm). Therefore, researchers will significantly benefit from the proposed algorithm and the detailed output with the main alterations of the samples.
Abstract. An organism’s transcriptome is the set of all transcripts within a cell at a certain time. We often analyze the transcriptome by quantifying gene expression and performing subsequent analyses such as a differential expression or a network analysis. Such analysis helps us in understanding and interpreting the functional elements of the genome. Many challenges limit the accuracy and ability to map all the RNA-Seq correctly into its genome sequence. Some of these challenges are exemplified when mapping sequences fall at exon junctions, sequences containing polymorphisms, multiple insertions or deletions, and reads falling partially or wholly within introns. One of the most significant problems is the loss of data occurring from the inability to map sequences when they align to multiple genomic locations, sometimes called ambiguous sequence mappings. In this paper, we present a novel method to increase the accuracy of gene expression estimation by relying on a statistical approach to increase the accuracy of mapping the ambiguous reads to their proper locations within the genome. This approach allows us to better identify significantly expressed genomic locations so we can accurately map ambiguous reads to their most likely accurate genomic locations and to define more precisely which genes are expressed throughout the genome. Due to its statical nature the approach can be easily combined with other existing mapping tools and mechanisms as well.
Abstract. Massive amounts of data gathered over the last decade have contributed significantly to the applicability of deep neural networks. Deep learning is a good technique to process huge amounts of data because they get better as we feed more data into them. However, in the existing literature, a deep neural classifier is often treated as a ”black box” technique because the process is not transparent and the researchers cannot gain information about how the input is associated to the output. In many domains like medicine, interpretability is very critical because of the nature of the application. Our research focuses on adding interpretability to the black box by integrating Formal Concept Analysis (FCA) into the image classification pipeline and convert it into a glass box. Our proposed approach pro- duces a low dimensional feature vector for an image dataset using autoencoder followed by a supervised fine-tuning of features using a deep neural classifier and Linear Discriminant Analysis (LDA). The low dimensional feature vector produced is then processed by FCA based classifier. The FCA framework helps us develop a glass box classifier from which the relationship between the target class and the low dimensional feature set can be derived. Further, it helps the researchers to understand the classification task and refine it. We use the MNIST dataset to test the interfacing between deep neural networks and the FCA classifier. The classifier achieves an accuracy of 98.7% for binary classification and 97.38% for multi-class classification. We compare the performance of the proposed classifier with Convolutional neural networks (CNN) and Random forest.
Abstract. MicroRNAs (miRNAs) are important regulators of gene expression in humans and many other organisms. Genetic variation in target sites potentially alters this regulation. Better understanding of patterns of nucleotide changes of these sites can provide new insights into human diseases such as cancer, bacterial and viral diseases. Studies of human variation of miRNA binding sites have been done before. However, we focus our study on miRNA-mRNA pairs that are known to be co-expressed. Unlike previous studies, this work considers combinations of several genetic variants, identifies those that are potentially evolutionary conserved, and considers how frequently alleles occur in the human population. New algorithm for matching the target sites has been described. Our findings confirm the putative functional significance of many alleles already reported in medical cancer-related literature, and suggest others not previously reported in literature. These alleles may be worth investigating further.
Abstract. Comparing with natural imaging datasets used in transfer learning, the effects of med- ical pre-training datasets are underexplored. In this study, we carry out transfer learning pre-training dataset effect analysis in breast cancer imaging by evaluating three popular deep neural networks and one patch-based convolutional neural network on three target datasets under different fine-tuning configurations. Through a series of comparisons, we conclude that the pre-training dataset, DDSM, is effective on two other mammogram datasets. However, it is ineffective on an ultrasound dataset. What is more, fine-tuning may mask the inefficacy of a pre-training dataset. In addition, the efficacy/inefficacy of DDSM on the target datasets is corroborated by a representational analysis. At last, we show that hybrid transfer learning cannot mitigate the masking effect of fine-tuning.
Abstract. An essential task in antibody/nanobody therapeutics discovery is to rapidly identify whether an antibody/nanobody has specificity and cross-reactivity to one or multiple tar- gets. Multiple target specificity and cross-reactivity of antibodies can be demonstrated by screening the third Complementarity Determining Region on the heavy chain (CDR-H3) of antibody sequences. However, the existing methods are costly and labor-intensive as repet- itive wet-lab experimentation is required to explore the sequences space. Here, we present a deep learning dimensionality reduction model based on Variational Autoencoder (VAE) and Residual Neural Network (Resnet), which we named VAEResDR. Our VAEResDR can efficiently learn the sequences’ key features while scaling down high-dimensional an- tibody sequences into a two-dimensional visualization representation for coherent analysis and rapid screening. We demonstrate that our VAEResDR can provide a tool to precisely analyze CDR-H3 sequences within the hidden patterns and effectively improve antibody/- nanobody CDR-H3 sequence clustering.
Abstract. An essential step to understanding how different functionalities of proteins work is to explore their conformational space. However, because of the fleeting nature of conforma- tional changes in proteins, investigating protein conformational spaces is a challenging task to do experimentally. Nonetheless, computational methods have shown to be practical to explore these conformational pathways. In this work, we use Topological Data Analysis (TDA) methods to evaluate our previously introduced algorithm called RRTMC, that uses a combination of Rapidly-exploring Random Trees algorithm and Monte Carlo criteria to explore these pathways. TDA is used to identify the intermediate conformations that are generated the most by RRTMC and examine how close they are to existing known inter- mediate conformations. We concluded that the intermediate conformations generated by RRTMC are close to existing experimental data and that TDA can be a helpful tool to analyze protein conformation sampling methods.
Abstract. In recent years, real-time control of prosthetic hands has gained a great deal of attention. In particular, real-time analysis of Electromyography (EMG) signals has several challenges to achieve an acceptable accuracy and execution delay. In this paper, we address some of these challenges by improving the accuracy in a shorter signal length. We first introduce a set of new feature extraction functions applying on each level of wavelet decomposition. Then, we propose a postprocessing ap- proach to process the neural network outputs. The experimental results illustrate that the proposed method enhances the accuracy of real-time classification of EMG signals up to 95.5 percent for 800 msec signal length. The proposed postprocessing method achieves higher consistency compared with conventional majority voting and Bayesian fusion methods.
Abstract. Biomaterials and biomedical implants have revolutionized the way medicine is practiced. Technologies, such as 3D printing and electrospinning, are currently employed to create novel biomaterials. Most of the synthesis techniques are ad-hoc, time taking, and expensive. These shortcomings can be overcome greatly with the employment of computational techniques. In this paper we consider the problem of bone tissue engineering as an example and show the potentials of machine learning approaches in biomaterial construction, in which different models was built to predict the elastic modulus of the scaffold at given an arbitrary material composition. Likewise, the methodology was extended to cell-material interaction and prediction at an arbitrary process parameter.
Abstract. Single-cell RNA-sequencing (scRNA-seq) is a high-resolution transcriptomic approach used to discover gene expression patterns among cell types to study precise biological functions. Unsupervised machine learning (clustering) is of central importance for the analysis of scRNA- seq data. It can identify putative cell types, uncover regulatory relationships, and track cell lineages and trajectories. A key issue in clustering scRNA-seq data is determining which clus- tering method is appropriate to use, since varied methods can yield diverse results. Current approaches usually focus on a one method and manually select a seemingly meaningful result. From a biological relevance perspective, it is vital to distinguish between normal and pathogenic cell types using marker genes. We present a learning framework for comparing outcomes of multiple scRNA-seq clustering methods to determine the most optimal results. We address the challenges of model selection and validation metrics in the context of traumatic brain injury (TBI) applications. We compare clustering performance of five clustering algorithms and two dimensionality reduction techniques implemented in both Seurat and Scanpy packages.