Font Size: a A A

Research On Text Clustering And Same Name Disambiguation Algorithm Based On Hybrid Feature And Meta-path

Posted on:2022-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhouFull Text:PDF
GTID:2518306539492094Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Existing text clustering algorithms have problems such as inaccurate division and cold start of the same-name disambiguation algorithm.In response to these problems,this paper combines a variety of feature extraction algorithms and proposes a long text clustering algorithm based on Word2vec-Textrank and DMM.After that,the meta-path information in the text is introduced and the K-means clustering algorithm based on meta-path is proposed.Finally,the study The word vector information and meta-path information in the text are fused,and the same name disambiguation algorithm based on Word2 vec and meta-path is proposed.The main research work and results are as follows:1 A long text clustering algorithm WTDMM based on Word2vec-Textrank and DMM is proposed.For feature information extraction in long text,this algorithm first uses the Word2 vec model to construct word vectors for the long text,and then sorts the sentences in the long text through the Textrank key sentence extraction algorithm to remove useless information in the text,and then uses DMM to the processed text data is clustered.Experiments on Sohu and 20 NG news data sets show that the WTDMM algorithm can effectively extract key information in long texts.2 A K-means clustering algorithm MP-KMS based on meta-path is proposed.To mine the potential information in the text to improve the quality of text clustering.This algorithm first extracts the entity information in the text,then builds a meta-path network through the relationship between entities,uses the meta-path information to calculate the similarity between the texts,and finally uses the K-means clustering algorithm to cluster the texts.Experiments on the Disambiguation and Metapath2 vec datasets show that the MP-KMS algorithm can effectively mine the potential information in the text.3 A disambiguation algorithm WP-ND with for the same name based on Word2 vec and meta-path information is proposed.Aiming at the cold start problem of the same name disambiguation,a method of fusing word vector and meta-path information is proposed.This method uses Word2 vec to construct the word vector information of the text,obtains the similarity matrix between the texts,and uses the meta-path information to calculate another similarity matrix between the texts,mixes the two similarity matrices,and then uses OPTICS clustering algorithm for text clustering to achieve the purpose of disambiguation.Experiments on the OAGWho Is Who and AMiner datasets show that the WP-ND algorithm can get a better disambiguation effect with for the same name.Main research contributions: Combining multiple feature extraction algorithms to extract key information from long texts,design and implement a key information extraction algorithm;construct meta-path information based on the connections between entities in the texts,and realize the hidden information in the text mining;Aiming at the word vector information and meta-path information in the text,an effective fusion algorithm is proposed.After that,the feasibility and effectiveness of the algorithm are verified through experiments.
Keywords/Search Tags:Text clustering, Same Name disambiguation, Word2vec, Meta-Path, K-means, OPTICS
PDF Full Text Request
Related items