Font Size: a A A

Research On Text Clustering Algorithm Based On ALBERT

Posted on:2022-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2518306737956479Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The computer industry is thriving,and people's information and collected data continue to increase.As a common tool for data mining,a clustering algorithm can effectively analyze the relationship between data and process massive data efficiently.This paper conducts in-depth research on text clustering algorithms,including text distance calculation methods and clustering partitioning algorithms.Most of the past work uses Word2 vec to implement text vectorization,which does not reflect the multi-layered characteristics of words and cannot solve the problem of polysemy;In response to these problems,this paper uses ALBERT to process text vectorization,which can better represent text features;Most of the document distance computation methods used in the clustering process are traditional distance computation mathematical formulas,which cannot accurately represent the distance between document.this paper proposes a Long text distance computation model base on ALBERT.According to the characteristics of the text in the THUCNews data set,the text is segmented and passed through ALBERT Process the segmentation matrix,and uses Bi LSTM to generate the position matrix.The sum of the two matrices is sent to the Transformer encoder for feature extraction.Finally,the two text matrices are pooled,spliced,and sent to the fully connected layer to output the distance between the two documents through the activation function.Since the text is processed by ALBERT into vectors as high-dimensional data,and the density clustering algorithm does not perform well on high-dimensional data sets,the K-means algorithm in the partition clustering algorithm is suitable for high-dimensional data sets,but the performance of the algorithm is very dependent on The selection of the center point of the initial class.Aiming at the instability caused by the random initialization of the K-means algorithm,a text clustering algorithm combining density and division is proposed.The text density is defined according to the text distance,and the set suitable as the center point of the initial class is selected by the density,and then adopted The farthest distance selection idea is to gradually select the initial cluster center points,and finally divide the data set according to the principle of distance nearness,update the cluster center and re-divide until the clustering result is stable.Experiments show that on the THUCNews news data set,the ALBERT model can represent text features very well;The long text distance computation model based on ALBERT can more accurately represent the distance between two texts;the text clustering algorithm that combines density and division is in the text Excellent performance on clustering problems.
Keywords/Search Tags:Clustering algorithm, Document distance, Text clustering, Text vectorization, ALBERT
PDF Full Text Request
Related items