Font Size: a A A

Study Of The Semantic Data Extraction Method Based On Structural Analysis Of The Webpage

Posted on:2016-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:S Y WangFull Text:PDF
GTID:2348330488974107Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the information on the Internet is showing explosive growth, In the mass of data users want to obtain knowledge based on the specific domain becomes more and more difficult, The development of knowledge base technology provides the possibility for the inquiry based on domain knowledge, In order to build a better knowledge base, we need to make full use of the information on the Web, So the technology of Web semantic data extraction is a hot issue in the present research. In recent years, a variety of Web semantic data extraction technology emerged, These methods can solve some problems in the process of Web semantic data extraction, But there are still shortcomings, The way Which is based on domain ontology extraction method using ontology to describe the domain knowledge obtained good effect, However, the construction of domain ontology requires domain experts to compile the ontology knowledge base in this domain, The extraction method based on DOM tree can complete the extraction of semantic data by using DOM tree, But the generated rule is not universal. To solve the above problems and analysis of a large number of pages, this thesis presents a method of semantic data extraction based on the structure analysis of the page, The way which using Web page structure and standard domain ontology to extract semantic data Can make full use of domain ontology of knowledge base to describe the accuracy and comprehensiveness, and mine characteristics of DOM tree structure, improve the accuracy of semantic data extraction and recall. Specific research results are as follows:1) In this thesis, we analyze the semantic web and knowledge base technology, research and summarize the typical Web semantic information extraction technology, and propose the method of semantic data extraction based on web structure analysis, and propose the semantic data extraction into two phases, The first phase is to construct the core ontology for a specific web site based on the standard domain ontology and the HTML structure. The second stage is to expand the core ontology based on the page structure analysis;2) Combining the standard domain ontology and the structure of the page, the method of constructing the core ontology for a specific network is designed, and the method of constructing the core ontology using the domain ontology and the Web page structure information is proposed,In order to solve the problem that the standard domain ontology mapping web page content is not flexible enough, proposed a method of using TF-IDF algorithm to expand the domain ontology, In the problem of computing the similarity of two words, this thesis uses the Word2 vec method to compute the semantic similarity based on the corpus of the word, In order to accurately map the concept and the content of the page node in the ontology, the semantic similarity algorithm of the node text in the ontology is designed, For the Web page is not standardized, the use of existing open source tools to correct the error in the HTML format to get the Web page DOM tree structure;3) Based on the existing information, the core ontology expansion algorithm based on the page structure analysis is designed, In order to explore the structural information in the page, and puts forward the definition of a DOM subtree, design page data structure algorithm for region partition network site structure information of the DOM tree form, is proposed to calculate the two tall DOM subtree structure similarity algorithm, is given based on page structure analysis of core ontology expansion algorithm process.
Keywords/Search Tags:Knowledge base, Semantic Web Technology, Domain ontology, Word vector
PDF Full Text Request
Related items