Font Size: a A A

Based On The Key Pages Of Information To Improve The Hits Algorithm, And Location Information Extraction Method

Posted on:2010-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:H S ChenFull Text:PDF
GTID:2208360275491495Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the enhancement of informatization,more and more data is shared by the Internet.In a large database like the Internet,how to obtain the necessary information is a difficult issue.This involves of two core problems:how to acquire the important web pages and how to extract the structured information of web pages.Search engines are one of the Internet information retrieval tools.Because of their general-purpose uses,they must treat each web page fairly.And therefore,they are not suitable to handle the problem of information retrieval of specific domains.There is a considerable portion of documents which are unstructured or semi-structured.The traditional information extraction methods are generally based on structured data from text documents.Therefore,how to extract information from web pages has become one of the research hotspot in recent years.This also led to a new research sub-area,namely,web information extraction.This paper focuses on the research of a method of acquiring important web pages and extracting structured information from them.First of all,by the analysis of the advantages and disadvantages of two link analysis methods,namely HITS and PageRank,this paper chooses HITS algorithm as the basic method.In the corresponding experiment,it shows that the traditional HITS method has two flaws:ignorant of new web pages and vulnerable to "spam links",so it is not suitable for applications which deals with up-to-date information such as news.Based on the previous research which solves the "spam links" problem by adding a filter,this paper presents a new algorithm called TimeWeightedHits,which adds a time factor to improve the HITS method further.The experiments proved that. it can filter the unwanted web pages effectively and acquire the appropriate up-to-date important web pages.Secondly,to extract information from these web pages,the paper presents a position-based information extraction method.By simulating the rendering process of web browsers,it gets the exact position information on the browser screen of each tag of HTML documents to extract the position features.Then it uses parts of the pages set retrieved by TimeWeightedHits method as the training set to let SVM generate a classification model,which is used to predict the test set.Compared with another method which based on manually defining heuristic rules,it improves accuracy greatly.Moreover,it reduces the work during training process.
Keywords/Search Tags:Link Analysis, web information extraction, SVM
PDF Full Text Request
Related items