Extraction Technology Research, Based On Ontology Can Be Customized Web Information Intelligence

Posted on:2007-07-09

Degree:Master

Type:Thesis

Country:China

Candidate:X D Wu

Full Text:PDF

GTID:2208360182493754

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The information in WWW is enormous, distributed, dynamic, heterogeneous and unstructured. The user can't find a suitable way to make use of the information, and the traditional internet information retrieval can not satisfy users' need. People ask for web mining technology to obtain detailed, structured information from internet. Web mining technology is aimed to extract user interested and implicate pattern or information from large amount of web documents. But most existed web mining systems have some drawback such as they can only be applied to few website and they need a lot of professional training. So they are not suitable to extract information from different web sources and various representations.In this paper, we propose an information extraction algorithm to overcome the drawback in other systems. And we also implement it in UTStarcom mobile phone information service system successfully. Our algorithm is based on html structure and ontology, can analyze webpage structure and extract information automatically. It is highly robust and adaptive.The first chapter initially introduces researching meanings and backgrounds, so that the topic of this paper is proposed.Chapter 2 introduces the history of information extraction, and also analyzed several representative systems. We also explain the concept of ontology and some relative work about using ontology in information extraction system.Chapter 3 gives the ontology model ORM used in our system. We use object-relation-model to construct target ontology. By parsing ORM description, we can get target constants, keywords and database schema for further use.Chapter 4 focuses on eliminating noises from webpage. By simplifying and merging html tag tree, we construct our html structure tree. Then we make use of similarity of noise blocks in different pages and extra feature in single block to purify webpage.Chapter 5 proposes our information extraction algorithm. With the help of several heuristic hypothesizes, we use ontology to extract information from table and general records, store the result to database automatically.In Chapter 6, our implementation detail is introduced, and also the evaluation criterion. A performance test is applied on our system and certain existing products, and the result indicates that our system has certain advantage over other products, so as to validate the work of the paper in improving system performance.Chapter 7 summarizes the work in this paper, and proposes some future work.

Keywords/Search Tags:

Web information extraction, HTML structure tree, ontology, object-relation-model

PDF Full Text Request

Related items

1	Based On The Html Pages Of Web Information Extraction
2	Research Of Web Information Extraction Based On Table Structure
3	Research On The Technology Of The Web Employment Information Extraction Based On The HTML
4	The Research On Web Information Extraction Based On HMM
5	HTML tag tree generator for Web ontology extraction
6	Adaptive Web Information Extraction Method Research Based On Ontology
7	The Literature Information Retrieval And Matching From The Web
8	Whole Structure, Its Representation And Reasoning
9	Ontology-Based Structured Information Extraction From Web Pages
10	Pattern-Based Information Extraction From HTML Documents