Font Size: a A A

Deep Web Entity Identification Method Used In The Field Of Online Books

Posted on:2011-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y H LiFull Text:PDF
GTID:2178360308454331Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The establishment of the deep web integration system is the current research focus, however, different sites on the same entity data describing the existence of differences in the expression form, lead to the having numerous redundant information, and put the user to inconvenience in inquiring information of data. The entity identification is a vital link in the deep web integration system, its purpose is to eliminate duplicate data results, that reduces data redundancy. This article makes an in-depth research at the entity identification of deep web data integration.For the Chinese book field of the deep web entity recognition, by analyzing carefully each site description of book information form, we find the form of the description for the same book is often difference and the form of the description for the different book may be very similar on the different online bookstores,. In light of the above, this paper proposes a deep web entity identification method based on improved Jaccard coefficient and domain ontology, at the attribute of book, if we direct application Jaccard coefficient to calculate the similarity of the text attributes, can't resolve a particular attribute value appears very similar to other properties are exactly the same situation. This paper has two improvements to Jaccard coefficients in the process of entity identification, the first, one word after text participle to increase the weight coefficient, the second, the contain relationship of string match is determined by the coefficient m. Use the improved Jaccard coefficient to calculate the similarity of text attributes,and identify well the books entity. The paper is a combination of domain ontologies in the thesaurus for book properties match, to address the existence in the author property in both English and publishing properties in shorthand. As each property has different importance during identification, this paper obtains weight of each attribute by AHP, and then calculates the weighted sum of physical similarity to integrate duplicate entities. Experimental results show that the method has high accuracy in Chinese books field of deep web.
Keywords/Search Tags:Deep Web, Entity identification, Jaccard coefficients, Domain ontology, AHP
PDF Full Text Request
Related items