The Research Of Chinese WEB Acquisition And Automatic Categorization

Posted on:2008-08-14

Degree:Master

Type:Thesis

Country:China

Candidate:H Z Wu

Full Text:PDF

GTID:2178360215974393

Subject:Computer application technology

Abstract/Summary:

Internet, which containing more and more information with its rapid development, is becoming an important knowledge-acquisition source in our work and life. However, because of its opening and heterogeneity, it is very difficult for users to obtaining the information their needed. So, how to handle and organize the vast number of online information has become an important research field. Traditionally, the web classification is manually. After read by the professor, the web classify in one or more category. Nevertheless, as the fast growth of the web, classifying web upon the artificial method will be inappropriate and difficult. In order to help the user locate the information easier, more scholars began to study the web classification technology.The Chinese Web acquisition and classification technology are researched in this dissertation. The content is as follows:1,This dissertation presents the web acquisition technology which integrating Google Web API into Java application to search and acquisition Web and introducing regular expression to find out the other URL in the Web.2,This dissertation describes the main procedures of Web classification in detail. The web pretreatment process, including web clean and Chinese words participle, is discussed. Then each kind of participle technologies are analyzes and ICTCLAS is introduced later.3,Three kind of Chinese text expression model are compared in this dissertation. It is the vector space model to be used in this dissertation. Then, this dissertation compares all sorts of feature select arithmetic. Because the function of the different lexical category word in the text is not same, the feature select based on part of speech is proposed to reduce the dimension of the characteristic vector. This method improves the efficiency of the feature selection by eliminating the noise information before computing the weight of feature words.4,Then this dissertation compares some kinds of text classification algorithm, introduces the KNN in detail and proposes the corresponding improvement method which rebuilding the feature vector of the text to improve the efficiency.5,The Chinese web classifier proposed in this dissertation is evaluated. It is indicated by the experiment that this method proposed in this dissertation improve the recall rate and efficiency of classification without damaging the precise of the classification.

Keywords/Search Tags:

Web acquisition, Chinese words participle, feature selection, text classification algorithm

Related items

1	The Research And Application Of Chinese Web Text Classification
2	The Research Of Chinese Web Text Orientation Classification
3	Key Technologies Research And Implementation Of Chinese Text Automatic Classification
4	Automatic Classification Research On Chinese Web Document Orientation
5	Research On Short Text Classification Of Chinese News Based On Machine Learning
6	Research And Improvement Of Automatic Classification Technology For Chinese Text
7	Research On Core Technology Of The Chinese Text Classification
8	Research And Improvement Of Feature Selection Algorithm In Chinese Text Classification
9	Research On Improved Feature Selection And Classification Algorithm For Chinese Text
10	Research And Application Of Micro-blog Acquisition Method Based On Feature Words