| Social media is becoming more and more ingrained in people’s daily lives in the digital age.Because users can share their comments and life events on social media at anytime,anywhere,the amount of social media text data is increasing at an exponential rate.The application of text categorization algorithms to efficiently manage and organize this enormous text resource,and then to explore the rich and useful information contained within social media texts,is of considerable scientific and practical importance.However,due to the peculiarities of social media texts,such as their short text length and irregular linguistic expression,the organization and categorization of texts in social media presents significant hurdles for current text classification methods.Previous scholars have presented several viable solutions to the text categorization problem.However,there are still a few problems that need to be addressed,primarily the following:(1)Manual feature engineering relies on specialist knowledge and involves a significant amount of human and material resources,which hinders the promotion and deployment of text categorization techniques.(2)When modeling text,sequential models simply treat text as sequence and are incapable of capturing the latent structured information in text.(3)The short length of social media texts easily leads to a lack of semantic integrity,and some current methods are incapable of learning a complete knowledge space from these texts.(4)There is a considerable volume of comment data used to communicate users’ opinions in social media text data,which comprises users’ personal subjective emotions,and some existing approaches are incapable of identifying and classifying these emotions correctly.In order to address these issues,this dissertation proposes a social media text classification method based on syntax knowledge and sememe knowledge which is based on both the modeling approach of text and the knowledge space carried by the text classification model.The primary contribution of this dissertation consists of the following:(1)A method for categorizing text based on syntactic knowledge is proposed.Some existing solutions handle the challenge of text categorization by creating large deep neural networks to sequentially model and learn the text.Although these methods have achieved some success,they disregard the fact that the text’s underlying organization is not entirely serialized.Text contains rich,position-independent structured information between words,in contrast to the written serialized form.This work provides a text classification approach based on syntactic structure modeling,with the goal of fully exploiting structured information in text by combining syntactic knowledge.In order to obtain the embedding representation of syntactic structure,this work employs the distributed tree embedding technique to map the syntactic trees and their subtrees to a high-dimensional vector space in a recursive manner.In order to extract the potential features in the text more comprehensively,this dissertation also performs deep learning of the sequence information in the text by using the encoder of Transformer.The sentence embedding is obtained by combining the structured and sequential information,and finally,the classification results are obtained by combining the multilayer perceptron classifier.The experimental results show that the proposed model outperforms most traditional models in terms of classification accuracy on four datasets,in which a 1.66% improvement is achieved on the AG’s News dataset.Then,the case study experiments were used to figure out what role syntactic information plays in the text classification process.(2)A text classification method based on knowledge is proposed to address the subjective sentiment and the lack of semantic integrity in social media texts.Some methods rely excessively on complex deep neural network models and want to mine as much different potential information as possible from brief texts.The learning and recognition of sentiment often rely on certain human knowledge,and the human knowledge carried by text is limited.Hence,the knowledge space carried by the text classification models learned on a specific corpus is also limited.Therefore,in this dissertation,we propose to incorporate sememe linguistic knowledge from external knowledge bases into text classification models,aiming to increase the human knowledge contained in the models and thus improve the accuracy of sentiment recognition.Sememe knowledge is closer to the nature of language in terms of natural language understanding.In order to effectively incorporate the sememe knowledge,we employ the Sememe-LSTM model to incorporate the sememe information into the sentence-level embedding vector.Syntactic knowledge is also effective explicit knowledge that helps to expand the knowledge space of the model,so we use a syntax encoder to learn and integrate syntactic knowledge.Ultimately,the experiments demonstrate the effectiveness of the proposed method and the four standard classification evaluation metrics on the SARC dataset outperform the current state-of-the-art methods.Ablation experiments and case study experiments show that the semantic knowledge that has been added is useful and effective.Therefore,the research presented in this dissertation on the problem of social media text classification methods has important theoretical research value and practical application value for solving the problems of imperfect modeling methods for text data classification in social media,difficulties in recognizing emotions in text data,and inaccurate classification of semantically incomplete text. |