Font Size: a A A

Research On Deep Web Query Relaxation And Entity Identification

Posted on:2013-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:X YuFull Text:PDF
GTID:2248330395952410Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the amount of information within theDeep Web is increasing explosive. So, more and more scholars are concerning aboutthe related research on Deep Web. Obtaining information through Deep Web hasbecome the main way for user to get professional information. However, failurecannot be avoided during the process of retrieving information, namely no results oronly few results return. Thus, the research on Deep Web query relaxation is extremelyimportant. Due to there are amount of data sources in Deep Web, cover a very widerang, even if the same domain also have multiple data sources to provide result, resultin the existence of massive redundant data. Because the results are from different datasources, the descriptions of the same entity are different. The users are unwilling toget the duplicated results. The target of entity identification is to delete the duplicateddata from the results, identify the same entity. That is to reduce the degree of dataredundancy and improve user satisfaction.Based on the analysis of the existing research on Deep Web, to solve theproblem that large number of irrelevant results may return by using the existingmethod of query relaxation, a solution for flexible query relaxation in Deep Web isproposed in this paper. Firstly, search the experience value base, if the experiencevalue exists in the base, relax by the value, otherwise, relax by attribute and filter theresults, extract the experience value into the experience value base. Secondly,construct the relationship graphic of data sources and get the importance of theattributes. In the process of query relaxation, build the relationship graphic of datasources to avoid passive relaxation. Use this relationship diagram of data sources toconduct flexible query relaxation to the attribute value from the least importantattribute. Sort the entities in the returned results set by the similarity of the user’squery requirement. Return the entities which are the most similarity to the user’srequirement.In the book field, the book’s description normally contains some specialsymbols and abbreviation of publishers. However, the method which is used for entity identification presently can not handle this problem very well. After doing a lot ofresearch on the related methods, a method of Deep Web entity identification based onpublic substring is proposed in this paper to process special symbols, and a synonymsdatabase is created to preprocess the data for the case of abbreviated attributes value.First of all, get the weight, similarity threshold and dissimilarity threshold by theiterative training on the training set. Next, conduct data pretreatment to every attributevalue, and calculate the similarity of the attribute value which has been pretreatment.Lastly, implement the same entity identification through weighted sum of everyattribute’s similarity.Finally, an experiment is conducted to verify that the method of flexible queryrelaxation based on experience value can get higher customer satisfaction and themethod of entity identification based on public substring can achieve more accuracy.
Keywords/Search Tags:Deep Web, query relaxation, entity identification, flexible relaxation, Web entity
PDF Full Text Request
Related items