Font Size: a A A

Research On Text Processing And Mining Algorithms For Social Media

Posted on:2023-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:Q F YaoFull Text:PDF
GTID:2558306914472824Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of social media text,the existing text processing and mining algorithms are facing some problems,such as low deduplication accuracy,high labeling cost,poor personalization,serious long-tailed phenomenon and so on.Based on the specific application scenarios of social media and the classical text processing and mining algorithms,exploring high-performance and high-accuracy social media text processing and mining algorithms has broad application prospects and important practical significance.Considering the characteristics of text in social media,this thesis designs an efficient text deduplication algorithm and a text detection method for social media spam,and further realizes a feature fusion based personalized news recommendation and an accurate social media text classification for long-tailed topics.The specific work completed is as follows:(1)In terms of efficient text deduplication,according to the real-time processing characteristics of social media text data,this thesis proposes a text content and structure aware deduplication algorithm based on local sensitive hash.The text representation based on local sensitive hash is used to realize high-performance text similarity computing.At the same time,heuristic text deduplication strategies are designed according to the various characteristics of the content and structure of social media text,which further optimize text deduplication results on the basis of meeting highperformance computing.In terms of performance calculation,compared with the semantic vector model,the deduplication method proposed in this thesis is 17%and 13%faster on two datasets with different sizes,while maintaining the same accuracy.(2)In terms of text detection algorithm for social media spam,in order to automatically detect all kinds of spam text in social media,and reduce the labeling cost and subjective errors as much as possible,a likelihoodaware spam text detection algorithm based on the deep flow-based generation model is proposed.The deep flow-based generation model for continuous data is extended to discrete unstructured text,and the modeling of normal text is realized based on text embeddings.At the same time,the likelihood anomaly score is obtained directly,which avoids the dilemma of requiring additional definition of anomaly score.On all classes of the public standard dataset 20 newsgroups,the anomaly text detection algorithm proposed in this thesis is better than the current popular deep anomaly detection model,with a maximum AUC improvement of 44%.(3)In terms of personalized news recommendation based on feature fusion,in order to capture users’ multiple interests and realize the interactive modeling of user behavior sequence and candidate news,this thesis proposes a personalized news recommendation algorithm based on feature fusion for multiple user interests.The feature fusion mechanism is feature interaction based on networks of multiple experts.On the open standard dataset MIND,the news recommendation algorithm proposed in this thesis has improved by 2.1%,2.8%,2.7%and 1.3%respectively in AUC,MRR,NDCG5 and NDCG10 compared with the current popular personalized news recommendation model.(4)In terms of accurate social media text classification for long-tailed topics,according to the long-tailed characteristic of the topics of social media text,this thesis takes different values of labels as the starting point and proposes a two-stage text classification algorithm based on selfsupervised and semi-supervised learning.In the self-supervised training stage,the negative value of labels is considered and the inherent label bias problem is overcome by learning label-free initialization feature information.In the semi-supervised training stage,the positive value of labels is considered and the information of unlabeled text data is fully mined.On an enterprise-level multi-topic dataset,the accuracy of the twostage algorithm proposed in this thesis is improved by 3.4%compared with standard TextCNN.
Keywords/Search Tags:social media, text deduplication, spam text detection, personalized news recommendation, text topic classification
PDF Full Text Request
Related items