Font Size: a A A

Research On Parallel Optimization Clustering Method Based Document Organization Name Disambiguation

Posted on:2021-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y HeFull Text:PDF
GTID:2518306548990459Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Constructing knowledge graph of papers in professional fields can provide useful knowledge services for researchers in this field.Due to the importance of the organization name in the paper knowledge graph,its knowledge extraction becomes an important link in constructing the paper knowledge graph.Since the extracted organization entity representations of the knowledge graph often have multiple ambiguities,it is necessary to disambiguate these organization names.The popular solution is Synonym Polymerization(SP).Since the extracted organization entity is unlabeled data,in the process of SP,the main method is text clustering.In the process of text clustering,the calculation method of similarity between texts and the clustering algorithm will affect the results of SP.Therefore,this paper focuses on the methods of feature weighting,text clustering and clustering acceleration in organization name representation.The main innovations are as follows:1)In order to solve the problem of terms with little frequency and high noise in short text,a term weighting algorithm by Density-Distribution-Based Term Frequency(DDBTF)is proposed and applied to the text similarity task.The algorithm studies the distribution characteristics between the term frequency and the number of term frequency categories.First,it performs non-homogeneous compression based on the term frequency density distribution,and then uses the compound Gaussian function to fit the compressed feature distribution,and finally obtains the weighting function of the term.Experiments show that in the text similarity task of short text dataset,the Pearson correlation coefficient of DDBTF algorithm is average 24.1% higher than that of unweighted algorithm,average 4.9% higher than that of TF-IDF algorithm,and average 1% higher than that of SIF algorithm.2)To solve the problem of high time overhead of density clustering,a GPU-accelerated density clustering algorithm is proposed.The main idea of this algorithm is to extract the parallel blocks of distance calculation by designing the entity name vector,calculate the distance matrix between the current point and all other points in the data set on the GPU,and finally generate a cluster of current points according to the obtained distance matrix.Experimental results show that when the size of the dataset is 1000000,the speedup ratio is 6.40 compared with the original density clustering algorithm.3)To solve the problem that density clustering is easy to merge adjacent clusters,A secondary classification algorithm based on regional feature and density is proposed.The algorithm uses the geographical constraints in organization name as a feature to divide the first-epoch cluster result,and we design the distance function of the density classification algorithm to amplify the effect of different expressions between the organization entity descriptions on the entity expression,and solves the problem of multiple different entities in the same cluster caused by density distance generalization to improve the accuracy of organization name aggregation.Experiments show that,compared with the one-epoch clustering algorithm,the accuracy of this algorithm is improved by 20.81%,and the accuracy of organization aggregation is improved.Based on the above research,this paper constructs an organization name disambiguation system,which implements the text preprocessing,entity word recognition,feature weighting,clustering,and organization names standardizing.The result of organization name disambiguation has been successfully applied to the construction of medical literature knowledge graphs.Experiments show that after processing by the system,276,882 organization name expressions are reduced to 182,547,and 34.07% organization entity descriptions are eliminated.
Keywords/Search Tags:organization name disambiguation, term weight algrithom, parallel optimization, desity-based clustering method, Secondary classification
PDF Full Text Request
Related items