Font Size: a A A

Research On Multi-granularity Microblog User Interest Portrait Based On NWD Integrated Algorithm

Posted on:2021-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhangFull Text:PDF
GTID:2518306470964089Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the mobile internet era,social media platforms such as Weibo have sprung up,and the number of connected users and user-generated data has shown explosive growth,which has led to the birth of social media big data.How to mine the valuable information and knowledge contained in the data source has always been the focus of researchers in industry and academia,and user interest portrait is one of the ways to effectively use this data source.For enterprises,user interest portrait is the basis for completing personalized recommendations,accurate marketing and service methods upgrade,and even strategic decisions;for users,user interest portrait is an effective way to avoid information bombardment.Therefore,researching user interest portrait is of great significance and value to both enterprises and users.However,in the research of user interest portrait based on microblog data,due to the problems of informality,conciseness,and "information overload" in microblog texts,and the difficulty of obtaining microblog data,which has always been made it difficult to build microblog user interest portrait effectively.In view of the above problems,the following three aspects of work have been done to try to solve or mitigate these problems,and some relevant conclusions are drawn:(1)More than 20 million pieces of Sina microblog data are crawled in a web crawler way,and original datasets of user interest portrait,more than 100 thousand pieces of microblogs,were constructed using hashtag to support related experiments in this thesis.The dataset can also be used by scholars to conduct their research on microblog user interest portrait in the future.(2)A new word discovery algorithm from the perspective of support is raised to deal with the informality of microblog text,exploring the ubiquitous internet phrases,new words and achieving more accurate word segmentation and semantic understanding.The experimental results show that the new word discovery algorithm proposed in this thesis is better than the existing mainstream new word discovery algorithms based on pointwise mutual information and branch entropy.(3)Based on comprehensive consideration of the specificity of microblog text,including the introduction of Simhash algorithm to tackle the problem of "information overload",and the using of bidirectional long short term memory networks to extract semantic features to deal with the conciseness of microblog text's,a supervised combinatorial algorithm framework integrating NWD algorithm is proposed,and is named NWD-Bi-LSTM-XGBoost.The experimental results show that the macro-average F1 score and AUC value of coarse-granularity(primary)interest tag model are reached to 88.1% and 83.8% and that of fine-granularity(secondary)interest tag model are 74.5% and 67.4%,respectively.Indicating that the NWD-Bi-LSTM-XGBoost algorithm framework can effectively construct multi-granularity microblog user interest portrait.Compared to other benchmark models,the macro-average F1 score and AUC value of the models increase by 3%?5% due to ensemble of the NWD algorithm,which is superior to the existing new word discovery methods.Additionally,compared with the static word vectors trained by the skip-gram algorithm,the dynamic word vectors generated by BERT-Base perform better in multi-granularity microblog user interest portrait.The maximum improvements on mF1 score and AUC value reached 4.5% and 4.1%,respectively.
Keywords/Search Tags:New Word Discovery, Bidirectional Long Short-Term Memory, Extreme Gradient Boosting, Multi-granularity, Microblog User Interest Portrait
PDF Full Text Request
Related items