Font Size: a A A

Extraction Technology Research, Based On Ontology Can Be Customized Web Information Intelligence

Posted on:2007-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:X D WuFull Text:PDF
GTID:2208360182493754Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The information in WWW is enormous, distributed, dynamic, heterogeneous and unstructured. The user can't find a suitable way to make use of the information, and the traditional internet information retrieval can not satisfy users' need. People ask for web mining technology to obtain detailed, structured information from internet. Web mining technology is aimed to extract user interested and implicate pattern or information from large amount of web documents. But most existed web mining systems have some drawback such as they can only be applied to few website and they need a lot of professional training. So they are not suitable to extract information from different web sources and various representations.In this paper, we propose an information extraction algorithm to overcome the drawback in other systems. And we also implement it in UTStarcom mobile phone information service system successfully. Our algorithm is based on html structure and ontology, can analyze webpage structure and extract information automatically. It is highly robust and adaptive.The first chapter initially introduces researching meanings and backgrounds, so that the topic of this paper is proposed.Chapter 2 introduces the history of information extraction, and also analyzed several representative systems. We also explain the concept of ontology and some relative work about using ontology in information extraction system.Chapter 3 gives the ontology model ORM used in our system. We use object-relation-model to construct target ontology. By parsing ORM description, we can get target constants, keywords and database schema for further use.Chapter 4 focuses on eliminating noises from webpage. By simplifying and merging html tag tree, we construct our html structure tree. Then we make use of similarity of noise blocks in different pages and extra feature in single block to purify webpage.Chapter 5 proposes our information extraction algorithm. With the help of several heuristic hypothesizes, we use ontology to extract information from table and general records, store the result to database automatically.In Chapter 6, our implementation detail is introduced, and also the evaluation criterion. A performance test is applied on our system and certain existing products, and the result indicates that our system has certain advantage over other products, so as to validate the work of the paper in improving system performance.Chapter 7 summarizes the work in this paper, and proposes some future work.
Keywords/Search Tags:Web information extraction, HTML structure tree, ontology, object-relation-model
PDF Full Text Request
Related items