With the rapid development of the Internet,the amount of text data has grown exponentially,it has become a research hotspot that is how to efficiently obtain useful information and knowledge from these data.Text mining is a method of automating the discovery of hidden patterns,trends,and knowledge from text data,and can extract useful information and knowledge from a large amount of unstructured text data,with broad application prospects.The most important and fundamental application of text mining is text classification and clustering.This thesis delves into the field of text mining,specifically text classification and text clustering algorithms,based on the context of news reporting.The main work of this thesis includes:(1)In order to comprehensively consider the role of titles and text in news articles,this thesis proposes a dual-channel news text classification algorithm that integrates title and text.Currently,news text classification models only focus on either classifying headlines or non-differentiated,generalized news text classifications,which can easily overlook the distinctive topic features contained in titles,as well as the rich contextual and complex semantic features present in the body text.To address these issues,we give an optimized model to extract news features in this thesis.The classification model used in this thesis employs different feature extraction algorithms for title and text to achieve classification,obtaining the probability for each category in the dual-channel,at the same time,in order to obtain probabilities for all categories associated with the classification model,we take into full account the different features contained in each and weight the outputs of each category’s probability using an attention mechanism.Experimental results on the THUCNews dataset show that the proposed model outperforms other hybrid models,demonstrating the effectiveness of the proposed model.(2)In order to address the problem of being unable to extract effective sequence information and accurately cluster news articles in VAE,the thesis proposes a news text clustering algorithm based on VAE-GRU.In news text clustering,sequence information can have a significant impact on the results.In natural language,words often appear in a certain order,and changes in this order can bring different meanings and contexts to the sentence.In other words,the same combination of words may have different semantics if their order is rearranged.Therefore,when clustering text,we need to consider the role of sequential information in order to better capture the semantics and internal structure of the text data.To address this issue,this thesis proposes using the BERT pre-trained model to obtain text features containing context information,and using GRU networks to implement the encoder and decoder of the VAE model,so we can not only achieve dimensionality reduction but also capture sequence information more effectively.The model exhibited better performance than the other hybrid models on the Today’s Headlines News Dataset,THUCNews Dataset,People’s Daily News Dataset,and Sohu News Dataset.In summary,this thesis explores the field of text mining in the context of news reporting and focuses on two specific tasks: news text classification and news text clustering.By applying these methods to a massive corpus of news reports,valuable information can be automatically extracted,enabling rapid and accurate classification of news articles.This is of significant importance for further understanding the underlying information and trends in news events. |