Font Size: a A A

Text Similarity Measuring Method Based On Heterogeneous Information Network

Posted on:2022-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:Q W MaFull Text:PDF
GTID:2480306476483164Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a basic task,text similarity measurement has a wide range of influences on text-based classification,clustering and sorting.Exised text similarity measuring methods often ignore structured information and background information in unstructured text data.Only considering the word granularity or phrase granularity in the text can not meet the needs of the text similarity measurement task well.To solve the above problems,this dissertation proposes a text similarity measuring method based on heterogeneous information network(HINSim),which transforms the text similarity measuring into a node similarity measurement in a weighted heterogeneous information network.Expanding the feature granularity of the text,using the explicit semantic information of the text,combined with the structural characteristics of the heterogeneous information network,provide new ideas for text similarity measurement.The main work of this dissertation includes the following aspects:(1)A weighted text heterogeneous information network was constructed.Combined with the world knowledge base,a weighted text heterogeneous information network(Text-WHIN)is constructed,in which text is represented as a specific type of node.Firstly,perform semantic analysis on a given text set and perform semantic filtering on the semantic analysis results to generate entity type nodes.Then,use text preprocessing and feature weighting methods to generate word type nodes.Finally,the link relationship in the weighted text heterogeneous information network is weighted,and the PMI value between words or entities,and the TF-IDF value between words or entities and text are used as the link weights between different types of nodes.The unstructured text is expressed as a structured heterogeneous information network,the text feature granularity is enlarged,and the structured information and explicit semantic information of the text are fully utilized,which enhances the interpretability of the text information.(2)An ?-Page Rank-Nibble subgraph partition algorithm based on meta-path was proposed.The pruning strategy is adopted to divide the subgraph of the weighted text heterogeneous information network with complex network mode.First,mine the meta-paths of text-type nodes in the text heterogeneous information network.Using the meta-path-based ?-Page Rank-Nibble subgraph partitioning algorithm,the partial graph of a given text node set is obtained.Finally,according to the partial graph,the commuting matrix of the meta-path is calculated and stored.Compared with existing algorithms,this algorithm reduces the space complexity and the time complexity of subsequent similarity calculations.(3)A node similarity measuring method in heterogeneous information networks was proposed.According to the commuting matrix of the meta-path,the similarity of the text type nodes in the weighted text heterogeneous information network is measured.First,the One Path Sim node similarity measuring method based on specific meta-path is proposed to measure the similarity of text nodes under each specific meta-path.According to the path instance,assign corresponding weight to each meta-path.Finally,combining the weight information of multiple meta-paths,the similarity measuring method of All Path Sim based on meta-path set is used to measure the similarity of text type nodes.Compared with other node similarity measuring methods,the correlation coefficient of the measurement results of this algorithm has improved in different degrees on different text data sets.(4)Verification and analysis of text similarity measurement method based on heterogeneous information network.Using two text data sets 20 NG and GCAT and two English sentence pair data sets SICK and MSRP,combined with the world knowledge base Freebase,the algorithm proposed in this dissertation is used for experimental verification and analysis.In terms of meta-paths,the influence of different meta-path lengths on the results of similarity measurement is explored.The experimental results show that the optimal element path length is 4.In terms of graph pruning,using the meta-path-based ?-Page Rank-Nibble algorithm to divide subgraphs has significant time and space cost savings compared with processing the entire graph.In terms of node similarity measurement,the All Path Sim coupling similarity measurement method has strong advantages compared with other node similarity measurement methods.In addition,compared with other typical text similarity measuring methods,the HINSim method has improved measurement results on different data sets.Experimental results show that the HINSim method can make full use of text semantic and structural information,and obtain more effective text similarity measurement results.
Keywords/Search Tags:Similarity measurement, Weighted heterogeneous information network, Meta-path, Text mining
PDF Full Text Request
Related items