Download PDFOpen PDF in browserClassification of Cancer Subtypes Based on Imbalanced Data SetsEasyChair Preprint 404016 pages•Date: August 16, 2020AbstractCancer is an important factor affecting human health. Many cancers contain different subtypes and have high complexities. Different subtypes have different mechanisms of occurrence, so the correct classification of cancer subtypes is essential for early diagnosis and preventive treatment. With the development of high-throughput technologies, the Cancer Genome Atlas (TCGA) project has been continuously improved to provide comprehensive cancer genome data. However, many of these cancer data have the characteristics of unbalanced sample distribution, high data feature dimensions, and many redundancy, which will affect the classification effectiveness of a few classes, thereby affecting the overall classification performance. In this paper, for the DNA methylation data of liver cancer, breast cancer, gastric cancer and three types of cancer, a model based on balanced feedback sampling and Tomek link is used. First, the balanced feedback sampling algorithm is used to sample the different subtypes, and then the Tomek Link is used to clean up the data and eliminate noise to obtain the optimal sample data. Use the equally divided Lasso algorithm for feature selection, remove redundant features, and avoid overfitting. Finally, the support vector machine, random forest and convolutional neural network are used to classify, and four commonly used classification performance evaluation indicators are used to verify the effect of the balancing method. Three sets of cancer data were classified by subtype, and the best classification effect was obtained on the gcForest model. Keyphrases: Cancer Subtype Classification, DNA methylation, imbalance, multiple classification
|