Based On The Key Pages Of Information To Improve The Hits Algorithm, And Location Information Extraction Method

Posted on:2010-02-20

Degree:Master

Type:Thesis

Country:China

Candidate:H S Chen

Full Text:PDF

GTID:2208360275491495

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the enhancement of informatization,more and more data is shared by the Internet.In a large database like the Internet,how to obtain the necessary information is a difficult issue.This involves of two core problems:how to acquire the important web pages and how to extract the structured information of web pages.Search engines are one of the Internet information retrieval tools.Because of their general-purpose uses,they must treat each web page fairly.And therefore,they are not suitable to handle the problem of information retrieval of specific domains.There is a considerable portion of documents which are unstructured or semi-structured.The traditional information extraction methods are generally based on structured data from text documents.Therefore,how to extract information from web pages has become one of the research hotspot in recent years.This also led to a new research sub-area,namely,web information extraction.This paper focuses on the research of a method of acquiring important web pages and extracting structured information from them.First of all,by the analysis of the advantages and disadvantages of two link analysis methods,namely HITS and PageRank,this paper chooses HITS algorithm as the basic method.In the corresponding experiment,it shows that the traditional HITS method has two flaws:ignorant of new web pages and vulnerable to "spam links",so it is not suitable for applications which deals with up-to-date information such as news.Based on the previous research which solves the "spam links" problem by adding a filter,this paper presents a new algorithm called TimeWeightedHits,which adds a time factor to improve the HITS method further.The experiments proved that. it can filter the unwanted web pages effectively and acquire the appropriate up-to-date important web pages.Secondly,to extract information from these web pages,the paper presents a position-based information extraction method.By simulating the rendering process of web browsers,it gets the exact position information on the browser screen of each tag of HTML documents to extract the position features.Then it uses parts of the pages set retrieved by TimeWeightedHits method as the training set to let SVM generate a classification model,which is used to predict the test set.Compared with another method which based on manually defining heuristic rules,it improves accuracy greatly.Moreover,it reduces the work during training process.

Keywords/Search Tags:

Link Analysis, web information extraction, SVM

PDF Full Text Request

Related items

1	Research On Technologies Of Distributed Link Extraction And DNS Cache
2	Research And Design Of Link Information Extraction System Based On Path Synopsis
3	The Information Extraction Of Netdisk And Information Retrieval
4	Data Link Signal Feature Analysis And Recognition
5	Discovering Important Bloggers Based On Content And Link Analysis
6	Web Named Entity Extraction Based On Link Path Search
7	Research On Design And Optimization Of Web Resource Reptiles Based On Slicing
8	Research On Extraction And Analysis Of Information Networks And Rank Related Problems In INTERNET
9	Design And Implementation Of Web Crawler For Given Page
10	Research On Entity Relation Recognition In Information Extraction