Research And Implementation Of Page Object Extraction Model For Vectical Search Engine

Posted on:2010-07-10

Degree:Master

Type:Thesis

Country:China

Candidate:S Wang

Full Text:PDF

GTID:2178360275484514

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet techniques, the information on the Internet increases exponentially. Because of the heterogeneity and dynamic nature of online resources, one important research focuses on how to retrieve and deal with these great capacities of online documents. Web information extraction is a task that involves automatically extracting specific types of information from semi-structured documents, such as web pages, forms structured data, and then populates database slots for queries. Web information extraction is an important method of improving the performance of search engine, especially the vertical search engine. This thesis mainly studies relative algorithms on applying web information extraction to vertical search engine.This thesis firstly summarizes the main techniques used in web information extraction, then analyzes the architecture of web information extraction system in vertical search engine and lists three kernel processes of web information extraction: Template Detection, Wrapper Generation and Data Extraction. For the limitation of traditional techniques in vertical search engine, the thesis proposes corresponding solutions on that.For Template Detection, this thesis proposes a new algorithm to compute the similarity of structure for pages based on the DOM tree edit distance, the algorithm assigned different values to different nodes according to their weights to layout. The experimental results show page clustering adopted the new algorithm work better than the traditional methods.For Wrapper Generation and Data Extraction, this thesis proposes the algorithm of wrapper generation joint with layout-based clustering, this algorithm combine the computation of similarity of page structure in Template Detection and Wrapper Generation to improve the whole process. For Data Extraction, this thesis gives the definition of page object and proposes the algorithm of wrapper matching based on tree alignment.The experimental results show the new algorithms can save time and human resources with high precision and recall guaranteed, this improvement make the technology of web information extraction better suited to commercial vertical search engine.At last, this thesis discusses the optimization of process for commercial search engine, including the optimization of crawling and wrapper matching based on analysis of URL patterns and quality of page information. Besides, we gives the design and implementation of Web inforamtion extraction system applied in commercial vertical search engine, actually, we successfully applied our algorithms and design in a vertical search engine-GeeSeek BLOG search engine based on Silverlight technology on .Net framework, the practise shows that this system efficiently improves the user experiences of searching.Nowadays, most researchers on Web information extraction focus on how to extract information from the constructed web pages, however, so much information is stored in databases of servers which sited different places, and how to extract thesevalueble information is our next work.

Keywords/Search Tags:

Web Information Extraction, Vertical Search Engine, Wrapper Detection, Wrapper Generation, DOM Tree

PDF Full Text Request

Related items

1	Research Of A Suffix Tree Based Automatic Wrapper Generation Method
2	Research For Information Extraction Based On Wrapper Model Algorithm
3	Web Page Attribute Extraction Method Research
4	Algorithm Research For Text Information Extraction Based On Wrapper Model
5	A Web News Extraction Method Based On Filtering Noise Wrapper
6	SRAM Wrapper Automatic Generation Tool Development And Software Implementation
7	Application of wrapper methods to non-invasive brain-state detection: An opto-electric approach
8	Research On Wrapper Adaptation In Web Data Integration
9	Research And Implementation Of Intelligent Comparison Shopping On Internet
10	Research On Automatic And Efficient Technologies For Web Information Extraction