Font Size: a A A

Research On News Texts Classification Based On Keyword Extraction And BERT Word Embedding

Posted on:2022-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2518306608467434Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid increase of Internet users,the number of news texts in the network shows an explosive growth trend.How to classify and manage these massive news texts efficiently has become one of the hot research topics.However,the text structure of network news is different from that of ordinary text.In the traditional way,the headline is treated as a part of the main text,ignoring the role of news headline,resulting in unsatisfactory classification effect.Therefore,a text classification algorithm suitable for news text is needed to classify and sort out the text and mine valuable information.Based on the above problems,in order to improve the accuracy of news text classification,this dissertation presents the research on news text classification based on keyword extraction and BERT word vector.The main work of this dissertation is as follows:(1)Acquisition of news topic center based on BERT word vector and text feature extraction.By crawling and analyzing the news text and its categories of news websites,the news text is labeled to obtain a supervised corpus,and the number of topics is determined according to the labeling.Then,the feature words and their corresponding weights of each topic are obtained through TF-IDF between topics,the feature word weight set under each topic is constructed,the feature word weight set is transformed into feature vector weight set through BERT model,and the news topic center of the topic is obtained through the weighted sum of vectors.(2)Keyword extraction of news text based on TF-HF-IDF and LDA model.Firstly,considering the unique structure of news text,a new news text sign extraction method TF-HF-IDF is proposed,which is combined with the traditional LDA model to form a tLDA model for the field of news text.After extracting the text features of news text,tLDA will cluster according to the previously determined number of topics.According to the obtained topic distribution and topic word distribution,select the first n words of word distribution as keywords and adjust the weight of keywords according to their probability to extract the keywords of news text and their corresponding weights.(3)News text classification based on BERT word vector and keywords.The keyword of each news text is transformed into the corresponding word vector through BERT model,and the word vector weight set of each news text is constructed.Then,the center vector of the keyword set is obtained by weighted vector sum operation as the topic vector of the article.Finally,the cosine similarity between the topic vector and each news topic center vector is used to judge whether the corresponding article belongs to the category of news text and the experiment is used to find the best threshold to divide it into this category.According to the above research methods,the corresponding experiments are designed.The results show that the algorithm proposed in this dissertation has good results.Compared with the traditional LDA model and BERT model,macro accuracy,macro recall and macro F1 value are improved by 11.2%,11.8%,11.6%and 1.9%,2.1%,2.1%respectively.Figure[24]table[17]reference[68]...
Keywords/Search Tags:Text classification, The field of public information, LDA model, BERT model
PDF Full Text Request
Related items