Web Document Automatic Classification Based On Keywords

Posted on:2010-06-10

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2178360275977852

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of network,a large number of electronic documents are produced,automatic classification of Web documents than the manual classification with its the rapid,efficient,objective and other advantages,it's practical value be fully reflected in fact.Web document categorization has been gaining attention in the network information retrieval particularly conspicuous.The traditional classification of Web documents,compute similarity between documents by using the main terms of the cosine value,Because of the terms of the document is too large,at the same time the lack of anglicizing Web structure and a document semantic,resulting in the quality of Web document categorization is not high.In order to overcome the shortcomings of traditional methods,In this thesis, the use of document structure analysis and the improvement of the TF-IDF calculation based on the extraction of key words,a relationship based on the semantic Web document automatic classification methods,Web documents are object to deal with,through the Chinese word extraction,getting candidate words from web documents,then by anglicizing the document structure and calculating the weight of candidate keywords and extracting keywords.Using the level relationship of semantic structure in the HowNet,using improvement of the calculation of parameters and calculating candidate keyword semantic similarity between documents and building a topology map.Through the clustering algorithm proposed in this thesis,which is reasonable Optimize partition merging operation, and get the final result of the classification of Web documents.Keywords extraction method in this thesis can be expressed the document content very well. The improvement of vector space model can express the accuracy of article content, at the same time reducing the dimension of a Web Document Clustering,and using the semantic relationship of the documents,such as the synonymous words,and the similarity between documents are strengthened,and the efficiency of Web document categorization are improved.

Keywords/Search Tags:

keyphrase of document, semantic similarity, clustering algorithm, HowNet, topology network, Chinese word segmentation

PDF Full Text Request

Related items

1	Subjective And Objective Combination Of Semantic Similarity Algorithm And Its Application
2	An Algorithm For Optimizing Word Similarity In "Knowledge Network"
3	Sentence Similarity Computing Combining Multi-features Based On HowNet
4	Chinese Semantic Similarity Dataset Construction And Word Embedding Fused Hownet
5	The Research Of Chinese Automatic Segmentation Method Based On HowNet Semantic Relevancy Computing
6	The Research Of Semantic Similarity Computing Algorithm Based On HowNet
7	Research On Chinese Spam Filtering Based On Semantic Body And Text Clustering
8	Research On Document Clustering Based On Semantic Similarity Of Hownet
9	The Research Of HowNet Based Word Similarity Computation And Its Application
10	A Chinese Word Level Segmentation Algorithm Based On Document Category