Font Size: a A A

Research On Text Categorization Based On Modified Bayes Method And Its Application In NERMS

Posted on:2007-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2178360182496157Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text categorization is a very important research field of data miningand machine learning;its target is to label the new document appropriately.Also, Text categorization has always been the main task of machine learning,pattern identifying and data mining and it was used broadly in the area ofimage identification, voice identification, nature language management,medical treatment and web page categorization.The NERMS (Network Educational Resource Management System) isone of the scientific research items of the Science and TechnologyDevelopment Planning of Jilin Province. The main target of NERMS is toorganize and manage the various network educational resources in order toeasy-share and easy-obtain the educational resources, so as to quicken thedevelopment of the educational resources. To organize and manage thenetwork educational resources better, NERMS categorizes the resource into 6species according to the content of the resources as well as 12 speciesaccording to the file type. As a result, the administrator can manage theresource easier, and user can select and download the resource according tothe content of the resources. At present, the development of NERMS hasbeen almost completed, but the resource is still a lack. In order to add theeducational resources collected by the searcher NERMS collecting systeminto database according to its content, each resource should be labeled by adifferent subject. If manually, the work could be exact but very complicated,so we need a new tool to input resources and increase the accessing, theresource categorize tool was developed. In this thesis, a resource categorizingsystem based on modified Bayes method was implemented, and the labeledresource could service to NERMS directly.The resource categorization system based on modified Bayes method wasimplemented, and the labeled resource can be used by NERMS directly. Inthis thesis, the algorithm used in the stage of feature selection is TFIDFalgorithm which is the most prevalent standard when evaluating the featureand in the stage of categorization is Bayes Method which is recognized as asimple and effective probability categorization. During the implementation ofthis resource categorize tool, Chinese words segmentation and categorizationprecision improvement are the two key issues. Toward those, we implementthe Chinese words segmentation method based on the statistical model, andadd relatively coefficient during the stage of feature selection andcategorization to improve the categorization precision. This coefficient givenin this thesis is according to Chinese habit, it is calculated from feature valueof different part of speech or different position, and thus, the feature ofimportant position can be outstanding. Usually, these features can representthe document's categorization. The experiment shows that, the Modified BayesMethod has higher Precision and Recall percent than Bayes method.This system includes three main modules: pretreatment module,training module and categorization module. The main target of pretreatmentmodule is to select the feather which could be the best representation of thecategorization from a natural language document. Both the training set andthe unclassified set need to be disposed by the pretreatment module. After thepretreatment, the training module stat the feather according to thecategorization, save the result into database, create the categorization system.Categorization module classifies the unclassified document and labels themwith this categorization system. After labeling all of the resource collected bythe NERMS collecting system, the categorization can add the resource toresource database of NERMS, in order that user can select and download.This system chiefly categorizes the resource collected by the NERMScollection system from internet to 6 species according the content of thedocument, and categorizes the resource of other file type according theresource description. As a result, the system can greatly enrich the resource ofNERMS;reduce the workload for administrator when adding to the resourcedatabase and convenience user selecting and downloading resource. So far,the system has categorized lots of resources collected by the NERMScollection system and got satisfactory effects.
Keywords/Search Tags:Categorization
PDF Full Text Request
Related items