Font Size: a A A

Research On Methods Of Entity Resolution In Dataspaces

Posted on:2020-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:C SuFull Text:PDF
GTID:2428330575461954Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Entity resolution technology is widely used in database management,data integration and information retrieval to identify two different records pointing to the same entity.Traditional entity resolution methods are mainly used for data sets with semantic mapping.However,with the advent of the information age,entity information is distributed in various data sources,and the semantic information of each data source is difficult to be unified.When the user collects the information,since the information may be repeated,it will waste the storage space if stored directly in the data space,and waste the time and hardware resources when processing.The data will be deduplicated through entity resolution to achieve the function of data cleaning.In order to adapt to the multi-source heterogeneous data environment such as dataspace,this paper proposes a record partitioning method based on record graph.By using data preprocessing,the possible matching data are placed in one block,and the accurate matching operation of records is only carried out in the block,so as to improve the computational efficiency.By calculating the weighted sum of label similarity and relational similarity between records,a record graph is constructed,with data record as the node of the graph and similarity as the weighted edge.Due to the low similarity of many pairs of records,it is difficult to match them.According to the characteristics and application requirements of the data set,the appropriate pruning method is selected to trim the record graph to reduce the relative redundancy of entity resolution,and the record graph is divided into blocks.In this paper,the real data set is used to evaluate the method,and the experimental results are analyzed.Because of the duplication of the attribute values of the heteronymy attributes in the block,the attribute mapping cluster is obtained by using the attribute values to map the attributes of the data in the block.Since a higher weight assigned to high-quality attributes is helpful to improve the accuracy of entity resolution,the weight of mapping set is distributed by calculating the goodness of attribute mapping set,and the weighted sum of similarity of mapping attributes is calculated and matched with the preset threshold.In the process of calculating attribute similarity of record pair,the edit distance of attribute value is calculated by expression method,and the information of matched record pair is integrated.The principle of integration is to merge the common information of record pairs and retain the characteristic information to feedback the most comprehensive entity content to users.In this paper,the real data set is used to evaluate the method,and the experimental results are analyzed,indicating that the method has a good adaptability to the data space environment.
Keywords/Search Tags:Entity Resolution, Dataspace, Tag-style Blocking, Property mapping, Information Merging
PDF Full Text Request
Related items