Font Size: a A A

Research On Web Information Extraction And Sentiment Classification Based On Forum

Posted on:2020-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:X BanFull Text:PDF
GTID:2518306518964049Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of Internet technology has added the channels of expression,the forums have increasingly become the window to receive the public opinion.The online emotional expression represented by comments tend to be massive and scattered,which has continuously appeared in its web page.Based on this,it is of great significance to complete the data mining accurately as well as eliminate the redundancy of massive information.The processed contents establish the good foundation for public opinion analysis,which is an important foothold of this research.The paper takes the forum page as the research object,studying the forum information extraction algorithm as well as sentiment classification algorithm.The main work of this paper includes the following points:(1)In view of the fact that huge noise information exists in the forum pages,the extraction accuracy tend to be low.To cope with this situation,the paper adopts the page blocking algorithm based on HTML tag to block the forum page firstly,and calculates the link density ratio of each text block to recognize the target text block,which effectively removes noise information such as advertisements and navigation bars in the forum page.Then the paper introduces the concept of standard values,which refers to the number of floors that identify the comment information in the forum pages.Finally,the paper integrates the features of the similarity of the position structure of the comment in the forum pages and the similarity that the deeper nodes in the DOM tree can represent the overall similarity.The paper presents the depth-weighted DOM subtree similarity algorithm to make the comment extraction,and the number of extracted comment will be compared with the standard value,which aims to improve the extraction accuracy.(2)According to the fact that the sentiment analysis based on the traditional neural network can not fully exploit the context semantic knowledge.The paper presents a sentiment classification model of BiGRU(Bi-directional Gated Recurrent Unit)based on multiple attention mechanism,BiGRU+Multi-attention.First,the model uses word2 vec model to quantify the pre-processed web text so that it can be input into the BiGRU model to screen the semantic feature,with utilizing three kinds of sentiment linguistic mechanisms including the sentiment word,intensity word and negation word via attention mechanisms to capture the underlying emotion features in the sentence,while effectively highlighting the critical words in the text for the sentiment classification to make up the defects of the single attention mechanism,then the optimal performance of the classification model can be obtained by adjusting the neural network model parameters.Finally,the public dataset is used to verify the feasibility and effectiveness of the proposed model.Extensive experimental results demonstrate that the web information extraction algorithm based on partitioning and depth-weighted DOM subtree can remove the noise with maximum efficiency in the forum page,and the accuracy of extraction is obviously improved.Meanwhile,the F value of BiGRU sentiment classification model based on multiple attention mechanisms reached 94.5%,which showed 4.5%higher than the sentiment classification model based on BiGRU and effectively improved the accuracy of sentiment classification.
Keywords/Search Tags:Forum, Page Blocking, DOM Tree, BiGRU, Attention Mechanism, Sentiment Classification
PDF Full Text Request
Related items