Automatic Categorization Of Bioscience Literature Based On Imbalanced Data

Posted on:2015-11-16

Degree:Master

Type:Thesis

Country:China

Candidate:B Y Shen

Full Text:PDF

GTID:2180330452450099

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

In recent years, the ongoing breakthrough of bio-information experimentmethods together with the rapid development of data storage technologies bring theexplosive increase of experiment data and research literature in bioscience. Itbecomes a great challenge to retrieve interested knowledge effectively and efficientlyfrom inflating biomedical database. Along with the rise of text mining technologies,the usage of them in bio-information literature mining has been attached muchattention and is widely researched. By employing such techniques researchers can notonly discover knowledge and grasp the current status of study, but also build theirown interested biological database. Nowadays biological literature mining system hasbecome an important component of modern biological research.In the production of animal and plants, many vital economic characters arequantitative traits, therefore QTL (quantitative trait locus) are brought about todescribe the genes control quantitative traits. There are some QTL informationdatabases storing information about one or some speciesâ€™ QTL. However most ofthem utilize manual efforts to gather and manage materials which may lead to theinformation incompleteness and may delay updates. So in this paper we attempts tointroduce text classification methods based on machine learning into the constructionprocess of bio-information database, classify targeted papers from a mass of corpus,and build an system automatically classify biomedical literature containing QTLinformation.The object of our research is the biomedical research corpus of particular speciesand the ultimate goal of this paper is to select papers contain the speciesâ€™ QTLinformation from them, and provide the source data for the preparation of buildingour own QTL information database. During the process of research, we chose SVM(support vector machine) to do the categorization.The text samples for SVM learning process comes from several standard onlinebiomedical databases, we program web crawlers to trace the papers and downloadedthem. Because these papers coming from different database have different forms andmay contain unrelated information, we did some kinds of data cleaning. The corpus waiting to be classified comes from PubMed, through the quantitative analysis, wecame with the conclusion that our categorization task is based on imbalanced dataset.To solve the problem, in the phase of sample representation, we propose a strategy tocombine the word with phrase to construct sample vectors and the results show thatthe strategy obviously improve the performance of the classifier. On the data level,we did meticulous experiments to compare different re-sampling methods and pickthe one most tally with SVM. Meanwhile, we carefully compare the kernels andparameters optimization algorithms and came with a most suitable solution. Finally,we classified different datasets, some contain a single speciesâ€™ research paper andothers may be a combination of research papers of several species. The resultsvalidate the effectiveness and adaptability of our system.

Keywords/Search Tags:

text categorization, SVM, QTL, imbalanced data

PDF Full Text Request

Related items

1	The Research And Implement Of Incremental Chinese Text Automatic Categorization
2	Meteorological Text Categorization Feature Selection Method And Its Implementation On MapReduce
3	Research On Subject And Development Rule Of Statistics In China Based On Text Content Analysis
4	Improvement And Application Of Bayesian Logistic Regression Text Classification Model
5	Research On Classification Algorithm Of Typical Imbalanced Data Sets
6	Research On Classification Algorithm Of Meteorological Imbalanced Data
7	Research On Imbalanced Data Classification For Lithologic Identification Of Complex Reservoirs
8	Research On Interpretability Classification Method Based On Imbalanced Functional Data
9	Research On Credit Scoring Model Based On Imbalanced Data Sampling And Convolutional Neural Network
10	Statistical Analysis Of Massive Imbalanced Data With Multiclass Logistic Regression