| Down Syndrome(DS)is a relatively common chromosomal disease at present,mainly due to genetic aberrations caused by the extra duplication of human chromosome 21,which affects the normal expression of proteins and causes the loss of normal functions such as learning and memory in DS patients.At present,DS occupies a high incidence in newborns and there is no effective drug treatment method.Therefore,exploring the protein expression related to DS has important guiding significance for finding effective drug targets and seeking effective directions for drug treatment.This thesis focuses on the public mouse protein expression data set.The main contents of the thesis are as follows:(1)Data preprocessing and key protein extraction of mouse protein.The mouse protein data studied in this thesis are filled with missing values,and the data range is normalized by the Min-Max standardization method.The Mann-Whitney U test method is used to compare the two pairs of groups in the normal mouse group,the trisomy mouse group,and between normal mice and trisomy mice,obtaining key proteins with significant differences in expression levels under different stimulation conditions.And the significance level is corrected by the Bonferroni correction method to eliminate false positives in multiple comparison experiments.(2)Combining extremely randomized trees(ET)and t-SNE method,this thesis proposes the ET-tSNE visual dimensionality reduction algorithm for high-dimensional protein data.In view of the difficulty in understanding the distribution structure of the high-dimensional data and the corresponding internal connections,the thesis considers using dimensionality reduction methods to visualize the high-dimensional protein data and enhances the interpretability of the high-dimensional data.Compared with other dimensionality reduction methods,the proposed ET-tSNE algorithm in this thesis has achieved better visualization results,and further explores the biological significance of protein data in two-dimensional space,meantime verifies the correctness of the extracted key proteins,and has better performance.(3)The thesis proposes a semi-supervised stacking ensemble learning classification(SSSELC)algorithm for the key protein classification.For scenarios where labels are scarce,such as computer-aided diagnosis and drug discovery,semi-supervised learning methods are more suitable.Meanwhile,the ensemble learning algorithm is introduced to further improve the effects of classification,which is complementary to the semisupervised method.Therefore,the proposed SSSELC algorithm in this thesis combines semi-supervised learning and Stacking ensemble learning model.Compared with other methods,the classification performance of the SSSELC algorithm has been significantly improved,and it has also achieved better experimental results when applied to the problem of multi-classified data. |