Font Size: a A A

Automatic Categorization Of Bioscience Literature Based On Imbalanced Data

Posted on:2015-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:B Y ShenFull Text:PDF
GTID:2180330452450099Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years, the ongoing breakthrough of bio-information experimentmethods together with the rapid development of data storage technologies bring theexplosive increase of experiment data and research literature in bioscience. Itbecomes a great challenge to retrieve interested knowledge effectively and efficientlyfrom inflating biomedical database. Along with the rise of text mining technologies,the usage of them in bio-information literature mining has been attached muchattention and is widely researched. By employing such techniques researchers can notonly discover knowledge and grasp the current status of study, but also build theirown interested biological database. Nowadays biological literature mining system hasbecome an important component of modern biological research.In the production of animal and plants, many vital economic characters arequantitative traits, therefore QTL (quantitative trait locus) are brought about todescribe the genes control quantitative traits. There are some QTL informationdatabases storing information about one or some species’ QTL. However most ofthem utilize manual efforts to gather and manage materials which may lead to theinformation incompleteness and may delay updates. So in this paper we attempts tointroduce text classification methods based on machine learning into the constructionprocess of bio-information database, classify targeted papers from a mass of corpus,and build an system automatically classify biomedical literature containing QTLinformation.The object of our research is the biomedical research corpus of particular speciesand the ultimate goal of this paper is to select papers contain the species’ QTLinformation from them, and provide the source data for the preparation of buildingour own QTL information database. During the process of research, we chose SVM(support vector machine) to do the categorization.The text samples for SVM learning process comes from several standard onlinebiomedical databases, we program web crawlers to trace the papers and downloadedthem. Because these papers coming from different database have different forms andmay contain unrelated information, we did some kinds of data cleaning. The corpus waiting to be classified comes from PubMed, through the quantitative analysis, wecame with the conclusion that our categorization task is based on imbalanced dataset.To solve the problem, in the phase of sample representation, we propose a strategy tocombine the word with phrase to construct sample vectors and the results show thatthe strategy obviously improve the performance of the classifier. On the data level,we did meticulous experiments to compare different re-sampling methods and pickthe one most tally with SVM. Meanwhile, we carefully compare the kernels andparameters optimization algorithms and came with a most suitable solution. Finally,we classified different datasets, some contain a single species’ research paper andothers may be a combination of research papers of several species. The resultsvalidate the effectiveness and adaptability of our system.
Keywords/Search Tags:text categorization, SVM, QTL, imbalanced data
PDF Full Text Request
Related items