Font Size: a A A

Research On Chinese Web Forum Based On Natural Language Processing

Posted on:2022-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:T C XuFull Text:PDF
GTID:2517306494973089Subject:Statistics
Abstract/Summary:PDF Full Text Request
In the previous,it is mainly divided into qualitative and quantitative two methods of analysis that is used on network forum.Among them,quantitative research mostly uses the construction of indicators or text statistics tools to mine the information.In this thesis,we combine the latest natural language processing model with the characteristics of the data in the web forum,and try to construct the task of Text Classification and Named Entity Recognition(NER)to extract the deep features of users from the text data in the web forum.Through observation,it is found that the observed user behaviors in web forums can be divided into two types:Posting topic and commenting reply.In addition,the topic published by the user contains various views and opinions of the user.And the data of commenting reply contains the emotional relationship between the two users.In this thesis,we collect all the topic data of 1stto 27thMay,2020 in Shanghai Stock Exchange Forum and the comment response data under each topic.For the research of subject data,in this thesis,we designs two tasks,namely text classification and NER to extract the corresponding user characteristics from the two perspectives of topic data categories and users'opinions on stocks.For text classification task,topic data were divided into eight categories by manual annotation,and a total of 10230 items were marked.In the marked data set,we compare the three models,which are Bi LSTM,Bi LSTM+ATTENTION and Bert Text Classifier,and the results show that Bert Text Classifier is significantly better than the first two.For the NER task,the topic data on May 1 to May 10 were annotated.At the same time,considering the small number of entity names obtained,three different data enhancement methods were adopted.Finally,5297 favorable entities were obtained,and 1501 unfavorable entities were obtained.Bi LSTM,Bi LSTM+CRF and Bert+Bi LSTM+CRF are compared.And the results show that the BERT+BERSTM+CRF model can better extract the two types of labeled entities.According to the research on the comment response data,it is observed that the relationship between users can be divided into three types:support,neutral and opposition.Based on this,the collected comment response data were divided into three different categories,and a total of 8,000 groups of comment response data were marked.Similarly,the classification performance of Bi LSTM,Bi LSTM+Attention and Bert Text Classifier is compared,and it is found that Bert Text Classifier is still better.Finally,according to the user characteristics extracted from the subject data and comment response data and the relationship between users,the User-Face database of Shanghai Stock Exchange Forum is constructed.And the database was used to explore the use of user community discovery,and it was found that Louvain algorithm was used to divide users into eight communities.By averaging the theme distribution characteristics of all users in the community,we get the theme distribution characteristics of the community,and then achieve a quantitative research on the theme publishing behavior of users in Shanghai Stock Exchange Forum during May 1to May 10.
Keywords/Search Tags:Chinese Web Forum, Natural Language Processing, text classification, Named Entity Recognition, complex network
PDF Full Text Request
Related items