Font Size: a A A

Research And Implementation Of Page Object Extraction Model For Vectical Search Engine

Posted on:2010-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2178360275484514Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet techniques, the information on the Internet increases exponentially. Because of the heterogeneity and dynamic nature of online resources, one important research focuses on how to retrieve and deal with these great capacities of online documents. Web information extraction is a task that involves automatically extracting specific types of information from semi-structured documents, such as web pages, forms structured data, and then populates database slots for queries. Web information extraction is an important method of improving the performance of search engine, especially the vertical search engine. This thesis mainly studies relative algorithms on applying web information extraction to vertical search engine.This thesis firstly summarizes the main techniques used in web information extraction, then analyzes the architecture of web information extraction system in vertical search engine and lists three kernel processes of web information extraction: Template Detection, Wrapper Generation and Data Extraction. For the limitation of traditional techniques in vertical search engine, the thesis proposes corresponding solutions on that.For Template Detection, this thesis proposes a new algorithm to compute the similarity of structure for pages based on the DOM tree edit distance, the algorithm assigned different values to different nodes according to their weights to layout. The experimental results show page clustering adopted the new algorithm work better than the traditional methods.For Wrapper Generation and Data Extraction, this thesis proposes the algorithm of wrapper generation joint with layout-based clustering, this algorithm combine the computation of similarity of page structure in Template Detection and Wrapper Generation to improve the whole process. For Data Extraction, this thesis gives the definition of page object and proposes the algorithm of wrapper matching based on tree alignment.The experimental results show the new algorithms can save time and human resources with high precision and recall guaranteed, this improvement make the technology of web information extraction better suited to commercial vertical search engine.At last, this thesis discusses the optimization of process for commercial search engine, including the optimization of crawling and wrapper matching based on analysis of URL patterns and quality of page information. Besides, we gives the design and implementation of Web inforamtion extraction system applied in commercial vertical search engine, actually, we successfully applied our algorithms and design in a vertical search engine-GeeSeek BLOG search engine based on Silverlight technology on .Net framework, the practise shows that this system efficiently improves the user experiences of searching.Nowadays, most researchers on Web information extraction focus on how to extract information from the constructed web pages, however, so much information is stored in databases of servers which sited different places, and how to extract thesevalueble information is our next work.
Keywords/Search Tags:Web Information Extraction, Vertical Search Engine, Wrapper Detection, Wrapper Generation, DOM Tree
PDF Full Text Request
Related items