Font Size: a A A

Study On Data Extraction And Semantic Annotation For Specific Field Deep Web

Posted on:2012-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z YangFull Text:PDF
GTID:2218330368492245Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,information resources which are hidden in web databases have received extensive attention because of its large amount of data and structure integrity.These information resources are displayed to users in the form of HTML pages after users have submitted search queries on web query page.And researchers usually call these information resources Deep Web. To maximize the use of these Deep Web resources,those semi-structured and unstructured data on the web page need to be extracted through a variety of technical means. Meanwhile, in order to make the extracted data possess a higher use value, semantic annotations must be added to these data so that they can be understood by machines.This paper studies the technology of information extraction and data annotation in Deep Web for specific field. At first, lead type information of nodes into extraction of data records and then achieve the semantic annotation based on ontology. Finally, a prototype system is designed combined with my project experience. The main research work of this paper include:1) This paper gives a presentation about the development history,evaluation criteria and related technologies of Web information extraction in brief and analyses existed information extraction method in depth.2) Combined with characteristics of result page itself in Deep Web and using the features of vision and content on page layout, a page purification method is proposed which involves tag filter, visual feature filter and content rule filter. The experiments show that the approach can effectively improve the efficiency and precision of subsequent data extraction.3) This paper has proposed a new extraction method of data record based on node type. First, the HTML node is divided into four types: block type, style type, text type and image type. And then assign each type a weight value. Thirdly, calculate entropy value for each property node in data records of result page according to different node type. Finally determine the nodes which represent corresponding data records with the entropy value and achieve extraction of nodes in data records. Compared with other methods, this method has higher efficiency.4) View the domain ontology as the global schema followed by web databases and achieve mapping from ontology to schema through the methods such as kernel density and K-L divergence and so on. The experiment shows that the approach has some certain advantages.5) Design an information integrated platform oriented to the biomedical field based on above work.
Keywords/Search Tags:Deep Web, Information Integration, Page Purification, Information Extraction, Semantic Annotation
PDF Full Text Request
Related items