Font Size: a A A

Ontology-Based Structured Information Extraction From Web Pages

Posted on:2008-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:G W YueFull Text:PDF
GTID:2178360242956652Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the constant development of the Internet, a large number of useful data are beingaccumulated on the Web. Extracting and integrating information from Web has graduallybecome the hotspot. The information of Web pages often appears in the form of HTML, butthe data represented by HTML lack of strict standards and restrictions, lack of definitestructure and schema, which makes it hard for computers to analyze the semantics. Therefore,it is necessary to extract and integrate information from Web. One of the advantages ofintegrating information in a uniform format is that it can facilitate the automated processingabout data and facilitate inspection and comparison about data. Information extractiontechnologies will not attempt to fully understand the entire document; it is only to analyze thepart of related information contained in the document.According to the data characteristics of the knowledge-intensive Web sites such aselectronic commerce sites, we present an ontology-based information extraction model whichcan extract structured information from Web pages. The main tasks are as follows:(1) Compare information extraction with information retrieval, introduce the principles,major tasks and evaluation metrics of information extraction, and analyze the characteristicsand existing problems of current information extraction systems.(2) Introduce the basic knowledge about ontology, discuss the theories and methods ofontology-based information extraction and present a common ontology-based informationextraction model. Using ontology technology in information extraction systems can eliminatethe phenomenon of semantic heterogeneity. Being independent of the data model, ontologycan be used as the stable concept interface for the data source.(3) Introduce PAT tree technology and construct PAT trees of the sample pages, fromwhich data schema of Web pages can be extracted. A PAT tree is an improved "suffix tree"which is used to store all the possible substrings of the source string. In the phase of extractingdata schema from sample pages, three principles of regularity, compactness and distributionare presented.(4) According to the ontology learning methods and the ontology editor Protégé, a simpledomain ontology about books is constructed. In the end, we transform the ontology to theOWL file and the domain ontology is formally described by the ontology representationlanguage OWL.(5) Present an algorithm about rules generation. The algorithm can generate extraction rules with the help of domain ontology and the rules can conduct the specific action ofinformation extraction. Extraction rules are generated from the domain ontology, andsupervise the process of domain ontology construction; meanwhile, domain ontology is usedas a guide to standardize extraction rules and exclude invalid rules. Extraction rules and thedomain ontology learn from each other and influence each other.(6) We take the Web pages of the site "China Books" as experimental objects. We alsoapply the rules generated by the above algorithm to extract structured information to thedatabase, which can help us test and analyze: the model's performance. The experimentalresults prove that the system has achieved a good performance in evaluation metrics, recalland precision.
Keywords/Search Tags:Web Information Extraction, PAT Tree, Data Schema, Ontology Learning, Domain Ontology, Extraction Rules
PDF Full Text Request
Related items