Font Size: a A A

Research And Implementation Of Automatic Text Classification System Of Biomedical Informatics

Posted on:2015-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2298330467470415Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective:With the development of computer and network technology, an enormous amount of biomedical information accumulates in the internet. The characteristics of enormity, wide distribution and variety of the biomedical information make it increasingly difficult for users to make effective use of. Facing with such a huge amount of online biomedical information, it is necessary and pressing for people to effectively find and select the information to their needs. As a result, on the basis of research into Automatic Chinese Text Classification, the author aims to develop an automatic text classification system for automatic and accurate classification of numerous biomedical informatics so as to provide users active, timely and useful help, improving their work efficiency and giving important reference for decision-making. Methods:During the system development process, the most popular vector space model (VSM) was applied in the text representation model; visual studio2010served as the development platform; C++language was the programming language; Interface design was finished with Qt. ICTCLAS segmentation system of Chinese Academy of Sciences was used in prototype system building. TF-IDF measure was used for calculating word weight. Feature selection approach was IG. The k-Nearest-Neighbours (kNN) was the classification approach. Classification results were evaluated by MacroP, MacroR, MacroFl and MicroFl. The first writing of the system was done in accordance with various ideological prototype algorithm and then it was improved to eliminate the defects and shortcomings of the algorithm. A comparison was made between the prototype system and the improved one and the discussion was also made. Due to the lack of ready Chinese biomedical corpus, the corpus was a self-made one when tested and trained for classification system. Result: Some adjustments and improvements were made in the TF-IDF weight calculation features and KNN classification algorithm in the development of the system. And, TF-IDF-DF, a new selection method based on TF-IDF, was proposed. Conclusion:The proposed new selection approach proves better IG feature selection after testing and validating, improving the performance of classification system. It can classify biomedical information rapidly and accurately and provide help for organizing and retrieving biomedical information. A combination of the classification system with search engine can present quick, accurate and timely classified information for concerned users.
Keywords/Search Tags:Automatic text categorization, TF-IDF, Feature Selection, Classificationalgorithm, IG, KNN
PDF Full Text Request
Related items