Font Size: a A A

Web Document Automatic Classification Based On Keywords

Posted on:2010-06-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2178360275977852Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of network,a large number of electronic documents are produced,automatic classification of Web documents than the manual classification with its the rapid,efficient,objective and other advantages,it's practical value be fully reflected in fact.Web document categorization has been gaining attention in the network information retrieval particularly conspicuous.The traditional classification of Web documents,compute similarity between documents by using the main terms of the cosine value,Because of the terms of the document is too large,at the same time the lack of anglicizing Web structure and a document semantic,resulting in the quality of Web document categorization is not high.In order to overcome the shortcomings of traditional methods,In this thesis, the use of document structure analysis and the improvement of the TF-IDF calculation based on the extraction of key words,a relationship based on the semantic Web document automatic classification methods,Web documents are object to deal with,through the Chinese word extraction,getting candidate words from web documents,then by anglicizing the document structure and calculating the weight of candidate keywords and extracting keywords.Using the level relationship of semantic structure in the HowNet,using improvement of the calculation of parameters and calculating candidate keyword semantic similarity between documents and building a topology map.Through the clustering algorithm proposed in this thesis,which is reasonable Optimize partition merging operation, and get the final result of the classification of Web documents.Keywords extraction method in this thesis can be expressed the document content very well. The improvement of vector space model can express the accuracy of article content, at the same time reducing the dimension of a Web Document Clustering,and using the semantic relationship of the documents,such as the synonymous words,and the similarity between documents are strengthened,and the efficiency of Web document categorization are improved.
Keywords/Search Tags:keyphrase of document, semantic similarity, clustering algorithm, HowNet, topology network, Chinese word segmentation
PDF Full Text Request
Related items