Font Size: a A A

Adaptive Web Information Extraction Method Research Based On Ontology

Posted on:2013-01-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:C X LiFull Text:PDF
GTID:1228330377951695Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet produced a huge number of web information. The retrieval and processing of such information is considerably limited due to the diversity and heterogeneity of the web pages. Information extraction can be dedicated to convert web pages into structured data to meet related applications, such as vertical search engine and data mining. At the same time, to implement the aim of semantic web, it needs to convert web documents into a "Web of data". OBIE (ontology based information extraction) can automatically generate the semantic annotation of documents and make the web information machine-understandable, e.g., semantic-enabled data.In agriculture domain, there has accumulated vast agricultural information from Internet, including supply-demand information, price information, agricultural technology, market dynamics, agricultural news, agribusiness, agriculture video etc. However, it is difficult to take full advantage of these resources because they lack of consistent semantic expression. Due to the limitation of the knowledge level of agricultural user, it is especially hard to retrieve information they needed by themselves. Fortunately, vertical search engine for agriculture can overcome it, which integrates the heterogeneous distributed data sources to meet user’s demands and breaks through the information bottlenecks of the "three rural" users in facing massive agricultural web resources. This dissertation presents an adaptive web information extraction model based on ontology, provides enriched structured-data of vertical search engines for agriculture (http://www.sounong.net) as well as the applications of structured-data mining, and finally services the construction of national agriculture informationization. The main contents of the dissertation are summarized as follows.1. For web page data with characteristics of openness, heterogeneous and evolvability, an adaptive model based on ontology is constructed for web information extraction. The model is implemented by modular structures, separates the algorithm and domain ontology knowledge, reuses function modules, facilities the dynamic update of system function, and reduces the cost in interdisciplinary transplantation.2. For the difficulties of building and maintaining ontology by experts manually, in order to utilize the web resources effectively, an unsupervised ontology learning method based on pattern matching is proposed by querying web to retrieve the related resources and by analyzing it with syntactic parsing. This method can relieve the scalable limitation of corpora and update ontology automatically to fit with the evolution of web resources. Finally, based on the proposed method, the relation ontology on protein-protein interaction is constructed and evaluated.3. For the requirements of Semantic Web and Linked Data, in order to implement the annotation of the meta-data of web pages and mine the relationship of various types of data, we present a relation extraction method based on ontology. The method extracts relations through analyzing the syntactic structure of sentences and the interaction relation words. Also it is validated on public biological literature and results show that it can yield significant results. The proposed algorithm traverses sentences by single-pass, and makes the computation of relation extraction on web-scale documents efficiently.4. AJAX has adopted widely and traditional web crawler cannot retrieve, analyze and process these AJAX data, therefore a model which extracts multi-records styled AJAX data based on domain ontology is proposed to explore, extract and annotate the dynamic AJAX data. Experiments on supply-demand entities and price entities for agricultural products validate the effectiveness of our method.5. For extracting data from single-record web pages, wrapper-based or rule-based methods cannot adapt the variation of web page structures. Therefore an information extraction and annotation model is proposed based on entity attributes classification of single-record styled web pages. The model analyzes the page structure characteristics of information contents, constructs attribute classifiers for extracting and annotating entity attributes adaptively.6. A platform is implemented based on ontology and the proposed information extraction methods. The platform consists of two components:the adaptive web information extraction system based on agricultural ontology, which is applied to vertical search engine for agriculture and the related data mining applications in agriculture domain; the prototype system of named entity relation extraction, which is the basis of the applications in relation extraction of agriculture domain.
Keywords/Search Tags:information retrieval, information extraction, ontology learning, adaptiveinformation extraction, domain resources discovery, relation extraction, ontology-based information extraction
PDF Full Text Request
Related items