With the development of information technology, various kinds of websites are available on the Internet and provide different kinds of information. However, how to find the exact information we need is a new challenge. Search engines such as Google came up in this background, helping users to locate information they need quickly. Search engine was developed from original catalogic search which ran part-artificially to current mainstream full text search engine and vertical search engine. Currently, the most mature full text search engine has its own drawbacks which cannot achieve ideal effect of recalling ratio and precision ratio in some specific domains.Vertical search engine makes up for some drawbacks in the specific domains, which can collect more related information. But it only provides users with webpage links as the result, just like what the full-text search does.Hence, a new search technology called object level vertical searching has been created. This search technology can search on the specific domain based on the objects. The query results are objects which are relatively few rather than a series of webpage links.But the object information extraction module in the existing object-level search engine is semi-automatic and needs a lot of manpower to obtain a prior knowledge of the Webpage when marking some webpages. Therefore, we improve RoadRunner automatic extraction algorithm, design and implement automatic information extraction module of object level vertical search engine in this thesis.In comparison with previous work, our contribution is summarized as follows:(1) We improve simple tree matching algorithm to make it more accurate when detecting similarity. Original simple tree matching algorithm treats all the tags nodes of Webpage DOM tree structure uniformly, without considering the particularity of iterative tags. Improved algorithm performs certain processing for iterative tags,then will execute like the original.(2) We improve attribute labeling module of RoadRunner algorithm which crosswise mark different wrapper extracting objects with their association, improving the attribute labeling rate of extracted data. RoadRunner algorithm uses property marking technology which analyzes optimal attribute names of every attribute values using coordinate distance between attribute names and values. But most attribute values in the Internet dose not have attribute names, so the labeling method of RoadRunner algorithm has some defects.At the end of this paper, the improved algorithm is implemented for object information extraction module, and tested in the domain of books searching. |