Font Size: a A A

Research On Microblog Topic Extraction Method Based On Text Semantic Information

Posted on:2022-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q ZouFull Text:PDF
GTID:2518306506489664Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Microblog has the characteristics of strong interaction,diverse expression methods,and diverse content.Once released,it has attracted a large number of users.In just a few years,the number of users has reached 100 million,and it is still on the rise.It is one of the mainstream social media in China.Users of different ages and fields have broadened the information dimension of microblog data,and complex social networks have increased the speed of information dissemination.The above-mentioned characteristics of microblog make it a key place for information sharing and dissemination,accumulating massive amounts of data.In the era of big data,it's important for us to do some research to obtain,save and extract valuable information from these data.The topic extraction technology,used to mine the topic information hidden in the text,is one of the research directions of data mining.Once we obtain the exact subject of the microblog text,we can provide a solid foundation for subsequent applications such as public opinion monitoring,public opinion guidance,information retrieval,personalized recommendation and so on,which has certain research significance.However,compared with traditional long texts,microblog texts are shorter,the language format is not standardized,and the content is fragmented,making the traditional topic extraction technology unable to accurately obtain the topic of the microblog text.To solve this problem,this paper analyzes the characteristics of microblog data,studies the existing technologies,and proposes the Microblog Topic Extraction method based on Text Semantic Information,which consists of the Text Clustering model based on Semantic Information and the Microblog Keyword Extraction model based on Directed Co-word Network.The Text Clustering model based on Semantic Information mines the deep semantics of texts,performs text clustering based on these,and divides microblog texts with similar semantics into the same cluster for information supplement;the Microblog Keyword Extraction model based on Directed Co-word Network uses the clusters obtained in the previous step to construct directed co-word networks to extract the core words to represent the text topic.The specific process is as follows:First,mining the deep semantic information of the text,using low-dimensional vectors that contain the deep semantic information of the text to represent the microblog text,and performing text clustering.The Text Clustering model based on Semantic Information uses the Latent Dirichlet Allocation model that introduces a weighting strategy to extract the topic probability distribution of the microblog text,and the text-topic vectors are obtained,which contain the text topic distribution information;uses the Word2 Vec model to mine the potential semantic associations of words and obtain vectors correspond to words,constructs matrixes of words vectors to represent the texts,and then use the Spatial Pyramid Pooling on the matrixes to extract features from different scales to obtain low-dimensional text-word meaning vectors containing potential semantic associations of words.The text-topic vector and text-word meaning vector are two types of low-dimensional vectors containing deep semantic information of the text.Comprehensively uses them to represent the microblog text and calculates the text similarity.The text clusters are divided by the Single-Pass Clustering algorithm.Second,the text clusters obtained in the previous step are used to construct directed coword networks,and the core words are extracted to represent the microblog topic.In the Microblog Keyword Extraction model based on Directed Co-word Network,the text clusters obtained in the previous step are used as the unit,a single text is used as the co-occurrence window,and the semantic,the sequence and the co-occurrence frequency between words are comprehensively considered to construct the directed co-word networks.What's more,it uses the weight of the directed edge as the importance distribution index,uses the Page Rank algorithm to calculate the weight of the word;sorts by the weight,and outputs multiple words with high importance to represent the text topic.In order to test whether the Microblog Topic Extraction method based on Text Semantic Information proposed in this paper is effective,this paper successfully built the model framework and selected a real microblog dataset for experiment.Results of the experiment show that using text-topic vectors and text-word meaning vectors to represent texts not only enriches the semantic information of the original vectors,but also effectively reduces the dimensionality of the original vectors,improves the accuracy and efficiency of the text clustering;using the directed co-word networks to represent texts can retain richer information,using the improved Page Rank algorithm to evaluate the weight of words can effectively improve the importance of the words matched with important words,increase the weight of words that are more relevant to the topic,so that the finally extracted keywords can more accurately represent the text topic.
Keywords/Search Tags:microblog, topic extraction, semantic information, text clustering, the Directed Co-word Network
PDF Full Text Request
Related items