Research On Text Clustering Algorithm Based On ALBERT

Posted on:2022-03-25

Degree:Master

Type:Thesis

Country:China

Candidate:L Liu

Full Text:PDF

GTID:2518306737956479

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The computer industry is thriving,and people's information and collected data continue to increase.As a common tool for data mining,a clustering algorithm can effectively analyze the relationship between data and process massive data efficiently.This paper conducts in-depth research on text clustering algorithms,including text distance calculation methods and clustering partitioning algorithms.Most of the past work uses Word2 vec to implement text vectorization,which does not reflect the multi-layered characteristics of words and cannot solve the problem of polysemy;In response to these problems,this paper uses ALBERT to process text vectorization,which can better represent text features;Most of the document distance computation methods used in the clustering process are traditional distance computation mathematical formulas,which cannot accurately represent the distance between document.this paper proposes a Long text distance computation model base on ALBERT.According to the characteristics of the text in the THUCNews data set,the text is segmented and passed through ALBERT Process the segmentation matrix,and uses Bi LSTM to generate the position matrix.The sum of the two matrices is sent to the Transformer encoder for feature extraction.Finally,the two text matrices are pooled,spliced,and sent to the fully connected layer to output the distance between the two documents through the activation function.Since the text is processed by ALBERT into vectors as high-dimensional data,and the density clustering algorithm does not perform well on high-dimensional data sets,the K-means algorithm in the partition clustering algorithm is suitable for high-dimensional data sets,but the performance of the algorithm is very dependent on The selection of the center point of the initial class.Aiming at the instability caused by the random initialization of the K-means algorithm,a text clustering algorithm combining density and division is proposed.The text density is defined according to the text distance,and the set suitable as the center point of the initial class is selected by the density,and then adopted The farthest distance selection idea is to gradually select the initial cluster center points,and finally divide the data set according to the principle of distance nearness,update the cluster center and re-divide until the clustering result is stable.Experiments show that on the THUCNews news data set,the ALBERT model can represent text features very well;The long text distance computation model based on ALBERT can more accurately represent the distance between two texts;the text clustering algorithm that combines density and division is in the text Excellent performance on clustering problems.

Keywords/Search Tags:

Clustering algorithm, Document distance, Text clustering, Text vectorization, ALBERT

PDF Full Text Request

Related items

1	Text Clustering And Its Application Based On CFSFDP Algorithm
2	The implementation of dynamic document organization using the integration of text clustering and text categorization
3	Chinese Text Clustering Based On Text Similarity
4	Research On Text Spectral Clustering Algorithm Based On Normalized Compression Distance
5	Reasearch On The Telecommunication Complaint Text Clustering Based On Improved CFSFDP Algorithm
6	Research Of Text Clustering Based On NMF Algorithm
7	Text Clustering Research Based On Semantic Distance
8	The Research And Application Of Text Clustering Based On Improved K-means Algorithm
9	The Study And Application Of New Clustering Algorithms In Image Processing And Text Clustering
10	The Research On Text Document Clustering Technology Based On Ant Colony