Font Size: a A A

Research On Key Techniques Of Entity Search For Deep Web

Posted on:2010-10-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y KouFull Text:PDF
GTID:1228330371450192Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The immense scale of the Web has rendered it as the most important information resource relied by enterprises and users. With the increase of Web databases, accessing Deep Web is becoming the main method to acquire information. But the data supplied by Deep Web is unavailable for traditional search engines. And currently there is not a novel search engine which can be suitable for Deep Web. Because of the large-scale unstructured content, heterogeneous result and dynamic data in Deep Web, there are some new challenges for Deep Web search. Thus it is important to solve the problem about searching the valuable data from Deep Web pages. Our goal is to search the best target entities that can meet users’needs from the large-scale Deep Web pages. The approaches about relationship knowledge construction, entity extraction, entity ranking and entity deduplication are researched in this dissertation. Here our work includes the following major aspects:(1) A Deep Web entity search mechanism called DWESM is presented. By analyzing the technique characteristics of traditional page-level search and vertical search, the hierarchy model of DWESM is presented, which includes the components of relationship knowledge construction, entity extraction, entity ranking and entity deduplication. By inheriting the basic idea of vertical search, entities contained in Web pages are considered as the operational units, which can make the search more professionally and throughly.(2) A semantics and statistical analysis based relationship knowledgement construction model called SS-KCM is presented, which includes text matching model, semantics analysis model and group statistics model. Also a three-phase gradual refining strategy is adopted, which includes text initial matching, semantic relationship abstraction and group statistics analysis. And based on text characteristics, semantic information and constraints, the relationship among entities are identified. By performing the self-adaptive knowledge maintenance strategy, the content of entity relationship knowledge database can be more complete and effective. The experiments demonstrate the feasibility and effectiveness of the key techniques of SS-KCM.(3) A DOM-tree based Deep Web entity extraction model called D-EEM is presented. A DOM-tree based automatic entity extraction strategy is performed in D-EEM to determine the data regions and the entity regions respectively, which can improve the accuracy of extraction by considering both the textual content and the hierarchical structure in DOM-trees. Also based on the Web context and co-occurrence, a semantic annotation method is proposed to benefit the process of data integration effectively. The experiments show that D-EEM is superior in the accuracy and efficiency of extraction.(4) An entity-level ranking model called LG-ERM for Deep Web query based on local and global scoring is presented. More rank influencing factors including the characteristics of entities, the importance of Web sources, as well as the entity relationships are considered and quantified. By combining local and global scoring in ranking, the query result can be more accurate and effective to meet users’needs. The experiments demonstrate the feasibility and effectiveness of the key techniques of LG-ERM.(5) An entity deduplication model based on multiple similarity calculators is presented. According to different characteristics of attributes, a series of similarity calculators are defined to suit different attribute types. The strategies of similarity calculation and uncertain records processing are also proposed. The experiments show that our approach is superior in the accuracy and efficiency of duplicate identification.(6) We design and implement the prototype system of DWESM, which applies the theories and approaches about relationship knowledge construction, entity extraction, entity ranking and entity deduplication proposed in this dissertation. The system shows validity and efficiency of these theories and approaches.In summary, this dissertation dedicates to study fundamental problems related to relationship knowledge construction, entity extraction, entity ranking and entity deduplication. And an entity search mechanism for Deep Web is presented, which can effectively solve the problem of result extraction, ranking, deduplication and consolidation. Lots of theoretical analysis and experiments show that these approaches are efficient and effective. We hope that these approaches and techniques could make some contributions to developing Deep Web search systems.
Keywords/Search Tags:Deep Web, entity search, relationship knowledge, entity extraction, entity ranking, entity deduplication
PDF Full Text Request
Related items