The Research On Keywords Extraction From Chinese News Web Pages Based On Clustering

Posted on:2010-11-30

Degree:Master

Type:Thesis

Country:China

Candidate:Q Yin

Full Text:PDF

GTID:2178360275978026

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid growth of internet, the number of online documentation increases rapidly and "information explosion" becomes the main feature of this period. As a brief summary of the article, keywords can assist the users to know the content of article rapidly and improve the users' browsing speed. Besides, keywords play an important role in information retrieval, automatic summarization, document clustering/ classification and so on. Therefore, keywords extraction becomes the key technique to solve this problem.Keywords can be considered as a set of words which are the most important and semantically cohesive phrases and also have multiple properties in a document. Therefore, an automatic keywords extraction method based on clustering analysis using their multiple properties is proposed in this paper. Then, how to extract keywords from Chinese news web pages based on clustering becomes the key point.The main works are as follows:(1) The ideas, principles and implements of kinds of clustering methods are firstly probed into and their application prospects in keywords extraction are analyzed.(2) Owing to hierarchical clustering methods' advantage and the existed keywords extraction methods' fault, keywords extraction from Chinese news web pages based on hierarchical clustering analysis (KECA)using their semantic similarities as the clustering distance is proposed. This algorithm, which analyzes the important words in the text from the viewpoint of semantic, is not only feasible but also to make up for the shortcomings of the mechanical statistical methods. Meanwhile, it put an end to the limitations of machine learning and the difficulty of the lack of labled corpus.(3) In the face of the actuality that hierarchical clustering method is difficult to deal with outliers of keywords, a density based clustering method (KEDC) using words' co-occurrence property is introduced, which can find any shapes' clusters including outliers. In addition, aiming at extracting high precision keywords, we do some addition and pruning on the rough clustering results which are only based on the strong co-occurrence information. The words', link and similarity strength between the words which are not clustered and sub-clusters are calculated, the addition is to add words with higher co-occurrence link or semantic link strength into the sub-clusters. The pruning process is to remove the words which only have strong co-occurrence information but are not keywords.Both theoretical analysis and experimental results demonstrate the efficiency and effectiveness of the above two algorithms.

Keywords/Search Tags:

Clustering, Keyword Extraction, Semantic Similarity, Co-occurrence

PDF Full Text Request

Related items

1	Keyword Extraction From News Web Pages
2	Research On Document Clustering Based On Semantic Similarity Of Hownet
3	Research On Terms Co-occurrence Based Models And Algorithms For Text Mining
4	Design And Implementation Of Information Extraction System Based On Improved TF-IDF Algorithm
5	Research On Keyword Extraction Algorithms Based On Semantic Features
6	Research And Application Of Text Feature Extraction Method Based On Word Co-occurrence Network
7	Keyword Automatic Extraction Based On Similar Documents
8	Research And Implementation Of News Keyword Extraction Method Based On Semantic Clustering And Weighted TextRank
9	Research On Keyword Extraction And Improved LSA Based On Co-occurrence Word
10	Research On Semantic Similarity Computation And Applications