Font Size: a A A

Research On Web Information Extraction Based On Domain Knowledge

Posted on:2017-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:W YuFull Text:PDF
GTID:2308330485964241Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, Web has become a major source of global technology information and has a huge amount of growing data. However, due to the dynamism of Web pages, the diversity of their contents and the complexity of their structures, these make people very difficult to obtain the desired information from the Web. The appearance of Web information extraction technology which extracting unstructured or semi-structured information accurately from Web pages, and presenting these information in a structured form, can provide an effective way for people to obtain useful information. However, traditional Web information extraction technology cannot describe the semantics of the information in pages formally, which makes semantics of the extracted results ambiguity and have poor accuracy.The ontology can have a good formal description in knowledge of specific areas, and promote the performance of Web information extraction. Therefore, Web information extraction technology based on domain ontology is proposed. Domain-based Web information extraction is a technology combing domain ontology with information extraction technology. The description for information of particular fields using ontology knowledge makes the Web information extraction technology have more remarkable extracting results in particular areas.This paper discusses the theories of Web information extraction, the Semantic Web and ontology and analyzes and summarizes main ways of building domain ontology. It chooses an ontology construction method based on knowledge engineering, and illustrates methods and rules of building domain ontology using Sina microblogging as an example. On this basis, this paper conducts research on the Web information extraction method based on domain knowledge in the background of Sina Weibo and Deep Web respectively, the main work is as follows:(1)This paper analyzes seven main methods of building domain ontology, compares advantages and disadvantages of the various methods and applications. For scenarios in this article, it selects the construction method based on knowledge engineering. Using Sina microblogging as example, It describes in detail herein concept of domain ontology, relationships, class attributes and collection of common basic principles, elaborates constructive tools, constructive regulations and construction process, and tests the formal description and consistency of ontology.(2)There are some problems existing in the traditional ontology-based Web information extraction processes which use one information item as the smallest unit, such as poor semantic relationship of extracted information and poor extraction accuracy. Weibo user information extraction method based on ontology is proposed. This method uses a two-level matching method dividing pages into different levels of semantic information and extracting information using information block as the minimum extraction unit. The experimental results show that, compared with traditional information extraction method, the proposed method effectively improves the accuracy and the recall of information extraction.(3)It determines entity regions in Deep Web pages. First, it uses the breadth-first algorithm to remove noise information contained in the page DOM tree; Then, according to the DOM tree node similarity principle it can more precisely locate the page data area; Finally, it uses VSM (vector space model) method to determine the cosine of the regional entity.(4)Because the most of Deep Web information extraction methods have poor extraction results for depending on the structures of Web pages and ignoring the semantic meanings and relations contained in the pages, this paper presents a post-processing method of Deep Web entity information extraction based on domain ontology. It semantically annotates the entity guided the domain ontology, and then adds the quantitative annotation results to the computation of entity and ontology similarity. Then, this method gets the improved algorithm to obtain the sub-tree with the maximum of entity and ontology similarity. Experiment shows that, compared with existing algorithms, the algorithm can raise its F value by extracting the entity information on weather, books, shopping sites for testing.
Keywords/Search Tags:Domain Ontology, MicroBlog, Deep Web, Ontology Building, Determining Entity region, Web information extraction
PDF Full Text Request
Related items