Font Size: a A A

Research And Application Of Feature Selection And Text Representation In Text Clustering

Posted on:2022-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z G LuoFull Text:PDF
GTID:2518306764477054Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the emergence of 5G technology and the rapid development of computer technology,it has become easier to publish and obtain information through the Internet.Faced with the dilemma that it is difficult for humans to deal with the explosive growth of information,text clustering can efficiently manage and organize texts and mine the internal relationship between texts,which is beneficial to solve the problems caused by information explosion and information redundancy.There are many key factors that affect the effect of text clustering,this thesis focuses on the two key technologies of feature selection and text representation in text clustering,and mainly completes the following three parts:(1)A text clustering method based on novel feature selection method(TC-NFSM)is proposed.Aiming at the high-dimensional problem in text clustering,this thesis studies analyzes the feature selection problem from a mathematical point of view;explores the application of binary particle swarm optimization algorithm in text feature selection,in view of the shortcomes of the algorithm,such as premature convergence and insufficient search ability,dynamic control learning factor,inertia factor and constriction factor are introduced and genetic operators are integrated to improve the search ability of the algorithm.Combined with the fitness function based on the weight of feature words,a novle feature selection method(NFSM)is proposed.Based on NFSM,combined with the traditional clustering algorithm K-Means,the TC-NFSM text clustering method is proposed,which effectively reduces the number of features.(2)A text clustering method based on a novel text representation model and deep clustering(TC-NTRM-DC)is proposed.Aiming at the problem of lack of semantics in text clustering,this thesis considering the context information in the text and the entities and feature words contained in the text,introduce Bert pre-training model combined with feature weighting scheme based on entity and feature words,a novel text representation model(NTRM)is proposed.On this basis,combined with the deep clustering method,a TC-NTRM-DC text clustering method is proposed,which effectively reduces the impact of the problem of lack of semantics in traditional text clustering.(3)Design and implement a text clustering system.Based on the two clustering methods TC-NFSM and TC-NTRM-DC proposed in this thesis,combined with the needs of practical scientific research projects,a text clustering system is designed and implemented,which includes six modules: data normalization,text preprocessing,text clustering,data management,system interaction and system management.The two text clustering methods proposed in this thesis both effectively improve the accuracy of text clustering and other indicators,improve the effect of text clustering,and have certain reference value for the development of text clustering research.The text clustering system based on two text clustering methods provides the text clustering function,and has certain application value in the fields of document management and public opinion analysis.
Keywords/Search Tags:Text Clustering, Feature Selection, Text Representation, BPSO algorithm, Bert
PDF Full Text Request
Related items