Research On Object Extraction Based On The Object-level Vertical Search In Specfic Field

Posted on:2016-08-27

Degree:Master

Type:Thesis

Country:China

Candidate:L Mei

Full Text:PDF

GTID:2308330473454392

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the development of information technology, various kinds of websites are available on the Internet and provide different kinds of information. However, how to find the exact information we need is a new challenge. Search engines such as Google came up in this background, helping users to locate information they need quickly. Search engine was developed from original catalogic search which ran part-artificially to current mainstream full text search engine and vertical search engine. Currently, the most mature full text search engine has its own drawbacks which cannot achieve ideal effect of recalling ratio and precision ratio in some specific domains.Vertical search engine makes up for some drawbacks in the specific domains, which can collect more related information. But it only provides users with webpage links as the result, just like what the full-text search does.Hence, a new search technology called object level vertical searching has been created. This search technology can search on the specific domain based on the objects. The query results are objects which are relatively few rather than a series of webpage links.But the object information extraction module in the existing object-level search engine is semi-automatic and needs a lot of manpower to obtain a prior knowledge of the Webpage when marking some webpages. Therefore, we improve RoadRunner automatic extraction algorithm, design and implement automatic information extraction module of object level vertical search engine in this thesis.In comparison with previous work, our contribution is summarized as follows:(1) We improve simple tree matching algorithm to make it more accurate when detecting similarity. Original simple tree matching algorithm treats all the tags nodes of Webpage DOM tree structure uniformly, without considering the particularity of iterative tags. Improved algorithm performs certain processing for iterative tags,then will execute like the original.(2) We improve attribute labeling module of RoadRunner algorithm which crosswise mark different wrapper extracting objects with their association, improving the attribute labeling rate of extracted data. RoadRunner algorithm uses property marking technology which analyzes optimal attribute names of every attribute values using coordinate distance between attribute names and values. But most attribute values in the Internet dose not have attribute names, so the labeling method of RoadRunner algorithm has some defects.At the end of this paper, the improved algorithm is implemented for object information extraction module, and tested in the domain of books searching.

Keywords/Search Tags:

Object level search engine, Web information extraction, RoadRunner algorithm, Property marking

PDF Full Text Request

Related items

1	Web Object Extraction Retrieval System Design And Implementation
2	Design And Implementation Of Vehicle Quality Issues Tracking Information System
3	Research On Web Information Extraction Technology In Vertical Search Engine
4	Research And Implementation Of Page Object Extraction Model For Vectical Search Engine
5	Research On Several Key Issues Of Crawler In Search Engine
6	Intellectual Property Search Engine Analysis And Design
7	Emantic Research And Implementation On Agricultural Vertical Search Engine
8	Research And Application Of An Elimination Algorithm For Redundant Information On Search Engine's Result
9	On The Research And Development Of A Video Search Engine For Chinese Web
10	Study On The Key Aglorithm Of Verticle Search Engine In Silk Area