Font Size: a A A

The Research On Keywords Extraction From Chinese News Web Pages Based On Clustering

Posted on:2010-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:Q YinFull Text:PDF
GTID:2178360275978026Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of internet, the number of online documentation increases rapidly and "information explosion" becomes the main feature of this period. As a brief summary of the article, keywords can assist the users to know the content of article rapidly and improve the users' browsing speed. Besides, keywords play an important role in information retrieval, automatic summarization, document clustering/ classification and so on. Therefore, keywords extraction becomes the key technique to solve this problem.Keywords can be considered as a set of words which are the most important and semantically cohesive phrases and also have multiple properties in a document. Therefore, an automatic keywords extraction method based on clustering analysis using their multiple properties is proposed in this paper. Then, how to extract keywords from Chinese news web pages based on clustering becomes the key point.The main works are as follows:(1) The ideas, principles and implements of kinds of clustering methods are firstly probed into and their application prospects in keywords extraction are analyzed.(2) Owing to hierarchical clustering methods' advantage and the existed keywords extraction methods' fault, keywords extraction from Chinese news web pages based on hierarchical clustering analysis (KECA)using their semantic similarities as the clustering distance is proposed. This algorithm, which analyzes the important words in the text from the viewpoint of semantic, is not only feasible but also to make up for the shortcomings of the mechanical statistical methods. Meanwhile, it put an end to the limitations of machine learning and the difficulty of the lack of labled corpus.(3) In the face of the actuality that hierarchical clustering method is difficult to deal with outliers of keywords, a density based clustering method (KEDC) using words' co-occurrence property is introduced, which can find any shapes' clusters including outliers. In addition, aiming at extracting high precision keywords, we do some addition and pruning on the rough clustering results which are only based on the strong co-occurrence information. The words', link and similarity strength between the words which are not clustered and sub-clusters are calculated, the addition is to add words with higher co-occurrence link or semantic link strength into the sub-clusters. The pruning process is to remove the words which only have strong co-occurrence information but are not keywords.Both theoretical analysis and experimental results demonstrate the efficiency and effectiveness of the above two algorithms.
Keywords/Search Tags:Clustering, Keyword Extraction, Semantic Similarity, Co-occurrence
PDF Full Text Request
Related items