Font Size: a A A

Method Of Webpage Keyword Extraction Based On Word Span

Posted on:2016-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:G XuFull Text:PDF
GTID:2308330470960228Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Key words commonly used in main content indexing papers, information retrieval system using the keyword collection for readers to check. But in today’s society is the era of Internet, the amount of information Webpage on the huge network application, more and more rich, the importance of keywords.The foreign research on Webpage keyword extraction started earlier, USA IBM, H.P.Luhn first proposed the automatic indexing of keywords, today, nearly 60 years of development. ADM Turney for the first time in the automatic extraction of key phrases of the genetic algorithm and C4.5 decision tree machine learning method. A special method for the automatic extraction of keywords around Webpage Webpage, according to Webpage with ordinary text, make full use of various markers of Webpage with Webpage keywords automatic extraction.Keyword extraction algorithm commonly used statistical-based methods, methods based on semantic network based on the words of the method, the paper on the basis of existing algorithms given page keyword extraction method based on word span, relying on web surface special, make full use of various Web pages were analyzed to identify the text, and then use the position words in the article content appears first and last occurrence of the word appears in the text as well as over the total number of paragraphs and paragraphs of text ratio and other factors, improved algorithms weight the right formula, help reduce the impact on the local keyword extraction results, but also give full consideration to the proposed method of word frequency factor, POS factor, word location factor, word length factor, appears in the prompt word After other characteristics factors, these factors through weight calculation extract keywords. In addition, the application of high-frequency combination of words generated by this algorithm also help to improve the accuracy of the algorithm. The traditional method due considerations less feature items considered not much, so the overall effect is not as good as our algorithm. The results showed that: compared with the traditional algorithms, our algorithm has been significantly improved in the recall and precision, and with the increase in the number of test set, the more detailed test results. At the same time for different lengths and types of text, this algorithm have shown a strong stability, and no results deteriorated sharply phenomenon for a particular type of test set.
Keywords/Search Tags:Keyword extraction, Page keywords, weight calculation, the word span
PDF Full Text Request
Related items