Download PDFOpen PDF in browserNetwork-based Machine Learning Approach for Structural Domain Identification in ProteinsEasyChair Preprint 49694 pages•Date: February 3, 2021AbstractIn the era of structural genomics, with a large number of protein structures becoming available, identification of domains is an important problem in protein function analysis as it forms the first step in protein classification. In the proposed network-based machine learning approach, NML-DIP, a combination of supervised (SVM) and unsupervised (k-means) machine learning techniques are used for domain identification in proteins. The algorithm proceeds by first representing protein structure as a protein contact network and using topological properties, viz., length, density, and interaction strength (that assesses inter- and intra-domain interactions) as feature vectors in the first SVM to distinguish between single and multi-domain proteins. A second SVM is used to identify number of domains in multi-domain proteins. Thus, it does not require a prior information of the number of domains. The domain boundaries are identified using k-means algorithm and confirmed with CATH annotation. Performance of the proposed algorithm is evaluated on four benchmark datasets and compared with four state-of-the-art domain identification methods. Its performance is comparable to other domain identification tools and works well even when the domains are non-contiguous. Available at: https://bit.ly/NML-DIP. Keyphrases: K-means, SVM, Structural domain identification in proteins, graph theory
|