Font Size: a A A

Short Text Representation And Similarity Measurement Method Based On Heterogeneous Information Network

Posted on:2022-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:X F LvFull Text:PDF
GTID:2518306746986309Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the deep integration of computer technology into social life,more and more short text information is spread on the network platform.Short text similarity measurement has a wide range of applications in text information retrieval,intelligent question answering systems and other fields.It is difficult to represent and measure the similarity of short text due to its short content and insufficient semantic features.To solve the above-mentioned problem,we convert the short text similarity measure into the node similarity measure of the heterogeneous information network in the perspective of heterogeneous information network,and calculate node similarity based on meta-path.It is consisted of three parts: Constructing Short Text Heterogeneous Information Network;Meta-Path Mining Based on Heterogeneous Information Network;Computing explicit semantic similarity between different short texts based on mined meta-paths.Our research contents are as follows:(1)Short Text with Enriched Feature Representation MethodTo address the data sparsity problem of short text,we transform the short text into a short text heterogeneous information network,and combine external knowledge to enrich the short text features.First,the external knowledge base and LDA(Latent Dirichlet Allocation)model are used to obtain the entity and textual topic information of the short text,and six different short text expansion methods are designed;Then,a robust heterogeneous information network framework HTE(HIN Text Enrichment)is constructed to fully integrate external knowledge with the short text based on the six short text expansion methods;Finally,the experimental results on two datasets show that the framework is robust and can integrate multiple types of additional information,which can effectively alleviate the data sparsity problem of short texts.(2)Mining Significant Meta-path in schema-rich Heterogeneous Information NetworksThere are a large number of meta-paths in complex heterogeneous information networks,and take a lot of time and space cost to mine meta-paths.In order to reduce the computational cost,we propose the efficient pruning mining algorithm FPPM(Fast Path Pruning Mining)to efficiently and rapidly mine the meta-paths on the short text heterogeneous information network and obtain the commuting matrix and the path instance matrix of each meta-path.First,we design the efficient meta-path mining algorithm FPM(Fast Path Mining)to reduce the time cost of meta-path mining by screening high-weight meta-paths while generating meta-paths;Then,we prune the short text heterogeneous information network based on the generated meta-paths and obtain the small network containing only specific meta-paths individually to reduce the space cost for the next step of similarity metric based on meta-paths;Finally,through time efficiency analysis,storage comparison experiments and comprehensive comparison of the algorithms,the FPPM algorithm can effectively improve the time efficiency and reduce the space cost.(3)Weighted Similarity Measurement Method Based on Multi-source Information FusionSince the traditional meta-path metric does not fully consider the link attribute variability and cannot accurately measure node similarity in the short text heterogeneous information network,we adopt the weighted similarity measurement method to measure the similarity of short text.First,we use a multi-granularity weighting method based on meta-paths and path instances,fusing multiple external knowledge base entity information and text information with three weighting methods;Then,we use a weighted similarity measure WASim based on multi-source information fusion,combining different object link weights and different meta-path weights to calculate the similarity of short text type nodes under different meta-paths;Finally,the experimental results show that the adopted method in this paper leads the mainstream method BERT by 7.43% accuracy and can perform similarity measures on short texts with high accuracy.
Keywords/Search Tags:Text mining, Short text, Heterogeneous information network, Similarity measure, Meta path
PDF Full Text Request
Related items