Font Size: a A A

Research On Government Text Classification Algorithm Based On Ensemble Learning

Posted on:2022-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:K S JiFull Text:PDF
GTID:2518306332952419Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet technology,and the country vigorously promotes the openness and digitization of government information,more and more government text are published on government websites.Since the State Council issued and implemented the "New Generation Artificial Intelligence Development Plan" in July 2017,artificial intelligence has begun to be placed in the national strategic level system layout,and the task of building an intelligent and digital government is imminent.According to the 47 th "Statistical Report on China's Internet Development Status" released by the China Internet Network Information Center(CNNIC),as of December 2020,the number of Internet users in my country has reached 989 million,an increase of 85.4million from March 2020.Internet penetration rate Up to 70.4%.However,the information on the Internet is characterized by a large amount of data and various types.The problem of how to identify government text in a timely and efficient manner through the government text number or title of the government text needs to be solved urgently.Currently,there is no system dedicated to government text recognition,although government text number and title of government text can be searched for on the Internet.However,these results are not all government text that are ultimately required.The results are mainly composed of the following categories: news reports of government text,forwarding of government text by companies,and interpretation of government text by public accounts from the media.At the same time,government text often only occupy a small part of the returned results.It is particularly important to establish a dedicated government text recognition system for the special text.Aiming at the problem of using government text number or government text title to search for government text containing a large amount of impurities,this paper innovatively proposes an government text recognition algorithm based on ensemble learning(En GTC),and builds an government text recognition system based on the original search results to improve the return The accuracy of the government text in the results.First,this article processes the data crawled from the Internet to obtain a dedicated government text dataset.Secondly,considering the fixed format of government text,this article uses Text CNN,which can better capture local features,as one of the basic classifiers.In order to add more semantic information,this article uses word2 vec to represent the text.And added an attention mechanism to Text CNN to add contextual information for each word,which improves the expressive ability of the model.However,due to the varying length of the government text,simply using Text CNN will cause information loss.Therefore,this paper adopts the BiLSTM,which is more adaptable to the length of the space,as another basic classifier.This model can not only alleviate the problem of long-term dependence but also adds the contextual information of the text.Finally,this article trains the basic classifiers based on the ensemble learning idea of bagging.However,due to the small number of basic classifiers in this article,the voting method cannot be used.Therefore,this article combines the ideas of the voting method with practical applications,Put forward a combination strategy of conditional judgment ensemble learning.In this paper,the government text recognition algorithm based on ensemble learning is tested and verified on the self-made official document data set.The final government text recognition accuracy rate is 90.53%,the accuracy rate is 97.12%,and the F1 score is 95.71%.The best comprehensive result,and the original The rate of HIT@10 searching based on the government text number increased from 81% to 86.53%,and the rate of HIT@3 increased from 72.12% to 81.32%;the rate of HIT@10 searching using government text titles increased from 83.28% to 87.94%,and HIT@3 From 76.83% to 82.51%.
Keywords/Search Tags:Government Text Classification, En GTC, Heterogeneous Ensemble Learning, BiLSTM, Attention+Text CNN
PDF Full Text Request
Related items