Ontology-Based Structured Information Extraction From Web Pages

Posted on:2008-12-05

Degree:Master

Type:Thesis

Country:China

Candidate:G W Yue

Full Text:PDF

GTID:2178360242956652

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the constant development of the Internet, a large number of useful data are beingaccumulated on the Web. Extracting and integrating information from Web has graduallybecome the hotspot. The information of Web pages often appears in the form of HTML, butthe data represented by HTML lack of strict standards and restrictions, lack of definitestructure and schema, which makes it hard for computers to analyze the semantics. Therefore,it is necessary to extract and integrate information from Web. One of the advantages ofintegrating information in a uniform format is that it can facilitate the automated processingabout data and facilitate inspection and comparison about data. Information extractiontechnologies will not attempt to fully understand the entire document; it is only to analyze thepart of related information contained in the document.According to the data characteristics of the knowledge-intensive Web sites such aselectronic commerce sites, we present an ontology-based information extraction model whichcan extract structured information from Web pages. The main tasks are as follows:(1) Compare information extraction with information retrieval, introduce the principles,major tasks and evaluation metrics of information extraction, and analyze the characteristicsand existing problems of current information extraction systems.(2) Introduce the basic knowledge about ontology, discuss the theories and methods ofontology-based information extraction and present a common ontology-based informationextraction model. Using ontology technology in information extraction systems can eliminatethe phenomenon of semantic heterogeneity. Being independent of the data model, ontologycan be used as the stable concept interface for the data source.(3) Introduce PAT tree technology and construct PAT trees of the sample pages, fromwhich data schema of Web pages can be extracted. A PAT tree is an improved "suffix tree"which is used to store all the possible substrings of the source string. In the phase of extractingdata schema from sample pages, three principles of regularity, compactness and distributionare presented.(4) According to the ontology learning methods and the ontology editor ProtÃ©gÃ©, a simpledomain ontology about books is constructed. In the end, we transform the ontology to theOWL file and the domain ontology is formally described by the ontology representationlanguage OWL.(5) Present an algorithm about rules generation. The algorithm can generate extraction rules with the help of domain ontology and the rules can conduct the specific action ofinformation extraction. Extraction rules are generated from the domain ontology, andsupervise the process of domain ontology construction; meanwhile, domain ontology is usedas a guide to standardize extraction rules and exclude invalid rules. Extraction rules and thedomain ontology learn from each other and influence each other.(6) We take the Web pages of the site "China Books" as experimental objects. We alsoapply the rules generated by the above algorithm to extract structured information to thedatabase, which can help us test and analyze: the model's performance. The experimentalresults prove that the system has achieved a good performance in evaluation metrics, recalland precision.

Keywords/Search Tags:

Web Information Extraction, PAT Tree, Data Schema, Ontology Learning, Domain Ontology, Extraction Rules

PDF Full Text Request

Related items

1	Adaptive Web Information Extraction Method Research Based On Ontology
2	Construction And Implementation Of Domain Ontology Based On Plain Text
3	Query Results Dealing Technology Of Deep Web Based On Ontology And Conceptual Schema
4	Domain Ontology-based Web Information Extraction Technology
5	A Research On Chinese Information Extraction Based On Construction Of Domain Ontology
6	The Application Research Of A Non-Taxonomic Relation Extraction Method Of Ontology
7	Research On Web Information Extraction Based On Domain Knowledge
8	Study On The Methods Of Discipline Terminology Ontology Learning Based On Chinese Unstructured Text From Digital Library Domain
9	Research Of Domain Ontology Concept Extraction Based On Association Rules
10	Research On Domain Ontology Learning Based On Chinese Texts