Font Size: a A A

Word Network-based Keywords, Automatic Extraction Methods, And In The Chinese Web Page Classification In The Study

Posted on:2010-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:A G WenFull Text:PDF
GTID:2208360275491803Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Keywords automatics extraction algorithm can be divided into three methods: based on the statistics,based on word co-occurrence,based on the terms network.The most basic method of statistics based is the word frequency statistics method,which select the highest frequency words as keywords.The method is simple and rapid,but cannot extracts the words that contain the content of the document and have low frequency.Word co-occurrence based methods require too many parameters to set, and often cause the boundary question.Algorithm stability and accuracy are difficult to control.Keywords of terms network based method is based the average path length or clustering coefficient of the terms network model of the document.And the average path length or clustering coefficient is base on connectivity graph,so measuring importance of the non-connectivity model document become difficult.With the development of network technology and increasing number of web pages,in order to manage web information,often required to classify pages.Artificial classification has been unable to meet demand;a lot of automatic classifications of web pages have been put forward.Web page classification techniques are mainly used in search engines,information retrieval,public opinion monitoring,and website management and so on.In this paper,basing on the problems of existing keywords automatic extraction, and combining with the demand for web page classification,research the following:1. Keywords extraction base on terms network;2.Using pages keywords for feature selection,then the classification of web pages.The main research results obtained the following:(1) A new strategy of text keywords automatic extraction base on terms network: define the average inverse path length and effective clustering coefficient to meet the non-connected graph.Using delete words method to measure the importance of the words.The method selects keywords by considering the two new target losses of terms network after deleting the words.It can be better to measure the importance of words.(2) Chinese word segmentation base on the similarity of words:different word segmentation results will affect the accuracy of keywords automatic extraction.Many Chinese words have often the same mean;the word similarity of two terms is similarity of their means.The similarity of word and itself is 1.This paper gives a threshold value base on experimental result.If two or more words have more than the threshold similarity,treat them as one word.The segmentation strategy can effectively improve the accuracy of keywords extraction.(3) Using of keywords of web pages to make feature selection can effectively reduce the dimension of the document vector model.In the experiment,compared to the results of other feature selection algorithms for classification,keywords feature selection is feasible.Calculating the importance of words based on the two new definitions of the terms network makes the keywords extracting having more accurate.Using keywords to web page classification make the loss information less,and classification results are satisfactory.
Keywords/Search Tags:word similarity, keywords automatic extraction, terms network, Chinese web pages classification
PDF Full Text Request
Related items