Research On Text Clustering And Same Name Disambiguation Algorithm Based On Hybrid Feature And Meta-path

Posted on:2022-08-21

Degree:Master

Type:Thesis

Country:China

Candidate:S Zhou

Full Text:PDF

GTID:2518306539492094

Subject:Computer Science and Technology

Abstract/Summary:

Existing text clustering algorithms have problems such as inaccurate division and cold start of the same-name disambiguation algorithm.In response to these problems,this paper combines a variety of feature extraction algorithms and proposes a long text clustering algorithm based on Word2vec-Textrank and DMM.After that,the meta-path information in the text is introduced and the K-means clustering algorithm based on meta-path is proposed.Finally,the study The word vector information and meta-path information in the text are fused,and the same name disambiguation algorithm based on Word2 vec and meta-path is proposed.The main research work and results are as follows:1 A long text clustering algorithm WTDMM based on Word2vec-Textrank and DMM is proposed.For feature information extraction in long text,this algorithm first uses the Word2 vec model to construct word vectors for the long text,and then sorts the sentences in the long text through the Textrank key sentence extraction algorithm to remove useless information in the text,and then uses DMM to the processed text data is clustered.Experiments on Sohu and 20 NG news data sets show that the WTDMM algorithm can effectively extract key information in long texts.2 A K-means clustering algorithm MP-KMS based on meta-path is proposed.To mine the potential information in the text to improve the quality of text clustering.This algorithm first extracts the entity information in the text,then builds a meta-path network through the relationship between entities,uses the meta-path information to calculate the similarity between the texts,and finally uses the K-means clustering algorithm to cluster the texts.Experiments on the Disambiguation and Metapath2 vec datasets show that the MP-KMS algorithm can effectively mine the potential information in the text.3 A disambiguation algorithm WP-ND with for the same name based on Word2 vec and meta-path information is proposed.Aiming at the cold start problem of the same name disambiguation,a method of fusing word vector and meta-path information is proposed.This method uses Word2 vec to construct the word vector information of the text,obtains the similarity matrix between the texts,and uses the meta-path information to calculate another similarity matrix between the texts,mixes the two similarity matrices,and then uses OPTICS clustering algorithm for text clustering to achieve the purpose of disambiguation.Experiments on the OAGWho Is Who and AMiner datasets show that the WP-ND algorithm can get a better disambiguation effect with for the same name.Main research contributions: Combining multiple feature extraction algorithms to extract key information from long texts,design and implement a key information extraction algorithm;construct meta-path information based on the connections between entities in the texts,and realize the hidden information in the text mining;Aiming at the word vector information and meta-path information in the text,an effective fusion algorithm is proposed.After that,the feasibility and effectiveness of the algorithm are verified through experiments.

Keywords/Search Tags:

Text clustering, Same Name disambiguation, Word2vec, Meta-Path, K-means, OPTICS

Related items

1	Research On The Application Of Person Name Disambiguation Based On Improved Agglomerative Hierarchical Clustering
2	Text Mining Based On Clustering Algorithm
3	Design And Implementation Of Author Name Disambiguation System Based On Two Step Clustering
4	Research On Short Text Clustering Of Social Networks Based On Word2vec
5	Research On Graph Neural Network-Based Name Disambiguation Algorithm
6	Research On Text Clustering Algorithm Based On Deep Learning Feature Extraction
7	Research On Entity Disambiguation Technology Based On Information Network Representation Learning
8	The Research And Application Of Text Clustering Based On Improved K-means Algorithm
9	Text Clustering Based On K-means Algorithm And Realization
10	Research On Disambiguation Of Same Authors In Academic Collaboration Network