Font Size: a A A

Research And Implementation On Web Chinese Text Categorization Technology

Posted on:2015-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:X N WangFull Text:PDF
GTID:2298330452450109Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Under the background of this era of information technology, the rapiddevelopment of Web and popularization of the internet brought about muchconvenience for our daily life and work, and the internet has become the main sourceto acquire information. However, a lot of useless information existed as a result of theopenness and heterogeneity of the internet. It has become a current hot spot that howto effectively manage and find potentially valuable knowledge quickly and accuratelyfrom the huge number of network information. Classification is an importantapproach to deal with the complex Web content effectively,while the text is still themain presentation forms of the Web page, therefore text classification is the core tosolve this problem, and it is also the supporting technology for search engine, aswell as the retrieval and filtration of the information. The practical significance tostudy text classification lies in its broad applicability. Web Chinese text classificationis the combination of internet technology and conventional text classificationtechniques. Briefly, it utilizes a Web Chinese text which belongs to a known categoryto work out a classification model then it can confirm the category of an unknowntext. The whole process includes Web Chinese text pretreatment, selecting the feature,text representation, calculating word weight and sample sorting, etc.The present work summarized the background and current situation of theresearch, analyzed the basic thought, and did a lot of research on theory andimplementation on the basis of illuminating the key techniques of Web Chinese textclassification. On the theoretical side, some links during classifying were improvedafter comprehensively analyzing and summarizing the weaknesses of the pre-existingmethods. As the operating environment for Web is particular, texts in differentpositions should be dealt with in steps on the basis of different regions and weightsbefore feature selection. Conventional chi square statistics exists some drawbackssuch as focusing on document frequency at the cost of neglecting word frequency,lacking proper punishment on correcting uniform distribution of its own formula,choosing improper characteristic words from the ubiquitous untargeted categoryinstead of the infrequent target category. Therefore, the present work proposed word frequency compensating factor, category weighting factor and in-categorydistribution factor which were put in the last part of the conventional formula ascompensation factor to improve the primary method and a satisfactory result wasobtained. On the sorting algorithm side, KNN algorithm was studied in detail and itsmerits and demerits were summarized on the basis of deeply investigating itsprinciple. As a rough text similarity calculated from KNN algorithm by inner productformula was obtained (Some examples were in the following article), the presentinvestigation proposed a similarity approach coefficient to improve the method.After implementing related experiments, it was demonstrated that the improvedmethod showed varying degrees of progress on accuracy rate, recalling rate and F1value. On the implementation side, simple tool software for Web Chinese textclassification experiment was designed, including a Web page gathering module forbuilding a sample gallery, a category module for processing and classifying text and aassessing module for evaluating the final results. The main scheme and some keytechniques for designing the software were also presented.
Keywords/Search Tags:Web Chinese Text, Feature Selection, Chi-square statistics, KNNclassification algorithm, Text Categorization
PDF Full Text Request
Related items