Font Size: a A A

Keyword Extraction From News Web Pages

Posted on:2010-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:X H LiFull Text:PDF
GTID:2178360275977543Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of information technology and the popularity of the Internet have brought a geometric growth of online information.The retrieval and use of network information have become more and more difficult.How to organize and compress large volumes of information,how to search relevant information, and how to improve the efficiency of information access are now all important issues.Besides,there are no keywords in many news Web pages that are propagation vectors of network information.It will improve the users' browsing speed and efficiency of information access if we can extract and provide keywords from these web pages.From this point of view,this dissertation studies automatic keyword extraction from news web pages,which is an important research area in natural language processing and information retrieval.Based on existing research efforts both within China and overseas,as well as a detailed analysis and comparison of a variety of keyword extraction algorithms,two keyword extraction algorithms named KEUD and KELCC based on lexical chains and word co-occurrence,which do not depend on any language and corpus,are proposed in this dissertation.Experiments on randomly selected web pages have been performed to demonstrate the quality of the keywords extracted by our proposed algorithms.The main contributions of this dissertation are as follows:(1) We demonstrate that keyword extraction algorithms based on a semantic analysis has good application prospects,by both theory and experiments.Based on our keyword extraction algorithm KEUD,with word correlations to the analysis of word semantic similarity,our keyword extraction algorithm KELCC based on lexical chains and word co-occurrence is proposed through the analysis of word importance from the perspective of word correlations and word semantic similarity.(2) Ambiguity resolution of words is realized through the process of keyword extraction.The construction of a semantic structure depends on the meaning of each word in natural language processing.The denotation of a semantic structure with lexical chains requires the realization of polysemous word identification.By using the information provided in the context and a knowledge base,and checking the meanings of words in their whole context when constructing lexical chains,we eliminate the ambiguity of word polysemy through judging the relations between the meanings of polysemy and the contextual environment. (3) We can maximize an algorithm's performance when combining the specific applications of every feature selection method.Based on comparative experiments, we have selected keywords from candidate words by the use of effective features selected from articles,lexical chains and the knowledge base.(4) By correlations of words,we make use of those words that are not contained.Because the KEUD algorithm needs to evaluate semantic similarity and the computation of semantic similarity requires the supports of knowledge,it is difficult to deal with the extraction of words which are not contained.By adding a word co-occurrence module into the KEUD algorithm,the KELCC keyword extraction algorithm based on lexical chains and word co-occurrence enhances the extraction capacity of those words that are not contained through the consideration of the importance of Chinese words from the viewpoints of word semantic similarity and correlations.
Keywords/Search Tags:Keyword Extraction, Lexical Chains, Word Co-occurrence, Ambiguity Resolution, Semantic Similarity, Semantic Correlations
PDF Full Text Request
Related items