Research And Implementation On Web Chinese Text Categorization Technology

Posted on:2015-07-24

Degree:Master

Type:Thesis

Country:China

Candidate:X N Wang

Full Text:PDF

GTID:2298330452450109

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

Under the background of this era of information technology, the rapiddevelopment of Web and popularization of the internet brought about muchconvenience for our daily life and work, and the internet has become the main sourceto acquire information. However, a lot of useless information existed as a result of theopenness and heterogeneity of the internet. It has become a current hot spot that howto effectively manage and find potentially valuable knowledge quickly and accuratelyfrom the huge number of network information. Classification is an importantapproach to deal with the complex Web content effectivelyï¼Œwhile the text is still themain presentation forms of the Web page, therefore text classification is the core tosolve this problem, and it is also the supporting technology for search engine, aswell as the retrieval and filtration of the information. The practical significance tostudy text classification lies in its broad applicability. Web Chinese text classificationis the combination of internet technology and conventional text classificationtechniques. Briefly, it utilizes a Web Chinese text which belongs to a known categoryto work out a classification model then it can confirm the category of an unknowntext. The whole process includes Web Chinese text pretreatment, selecting the feature,text representation, calculating word weight and sample sorting, etc.The present work summarized the background and current situation of theresearch, analyzed the basic thought, and did a lot of research on theory andimplementation on the basis of illuminating the key techniques of Web Chinese textclassification. On the theoretical side, some links during classifying were improvedafter comprehensively analyzing and summarizing the weaknesses of the pre-existingmethods. As the operating environment for Web is particular, texts in differentpositions should be dealt with in steps on the basis of different regions and weightsbefore feature selection. Conventional chi square statistics exists some drawbackssuch as focusing on document frequency at the cost of neglecting word frequency,lacking proper punishment on correcting uniform distribution of its own formula,choosing improper characteristic words from the ubiquitous untargeted categoryinstead of the infrequent target category. Therefore, the present work proposed word frequency compensating factor, category weighting factor and in-categorydistribution factor which were put in the last part of the conventional formula ascompensation factor to improve the primary method and a satisfactory result wasobtained. On the sorting algorithm side, KNN algorithm was studied in detail and itsmerits and demerits were summarized on the basis of deeply investigating itsprinciple. As a rough text similarity calculated from KNN algorithm by inner productformula was obtained (Some examples were in the following article), the presentinvestigation proposed a similarity approach coefficient to improve the method.After implementing related experiments, it was demonstrated that the improvedmethod showed varying degrees of progress on accuracy rate, recalling rate and F1value. On the implementation side, simple tool software for Web Chinese textclassification experiment was designed, including a Web page gathering module forbuilding a sample gallery, a category module for processing and classifying text and aassessing module for evaluating the final results. The main scheme and some keytechniques for designing the software were also presented.

Keywords/Search Tags:

Web Chinese Text, Feature Selection, Chi-square statistics, KNNclassification algorithm, Text Categorization

PDF Full Text Request

Related items

1	X ~ 2 Statistics-based Chinese Text Categorization Feature Selection Method
2	Extraction Of Chi-square Features In Chinese Text Classification And Improvement Of TF-IDF Weight
3	Research And Implementation Of Chinese Text Classification, Feature Selection Method,
4	Research On Improved Feature Selection And Classification Algorithm For Chinese Text
5	Research And Implementation Of The Automatic Chinese Text Categorization
6	Design And Realization Of Automated Text Categorization System For Chinese Documents Based On Relevancy
7	Research Of Chinese Web Text Categorization Based On KNN Algorithm
8	Research On Chinese Text Categorization Algorithms Based On Technology Text
9	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
10	Research On Local Feature Selection Of Chinese Text