Font Size: a A A

Research Of Topic Evolution Analysis On Short Text Streams

Posted on:2020-01-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:W GaoFull Text:PDF
GTID:1368330590954123Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,applications such as weibo,e-commerce,forum,and wechat have become more and more popular.Many of data streams generated by these platforms are dynamically generated and updated in real time.Therefore,the quickly-updated short text streams make us urgently need an effective analysis tool to monitor the topics and topic evolution from these short texts in real time.This is of great significance to public opinion persuasion,social network analysis,hot event detection,emerging topic tracking and so forth.Topic evolution analysis for short text streams is an important method to extract topics and their evolution process.The method can analyze hot events on social media in real time,thus assisting the monitoring department to respond in a timely manner.However,due to the short length and non-standard terminology of short text streams,traditional topic evolution analysis methods face the following four problems:(1)It is time and memory consuming when storing and mining a large amount of short text streams with high redundancy and noise.(2)Existing pseudo-document-based topic models usually need external auxiliary information,which cannot be generalized to tackle more general forms of short texts.(3)Due to the limited length of short texts and the sparse context features,topic modeling for short texts is ineffective.(4)Existing topic evolution analysis methods mainly focus on long texts such as news and web pages,and only use text features to measure the relevance of topics.However,short texts are short in length and lack textual features.It is difficult to accurately measure the relationship between topics by textual features alone.In this paper,we propose a framework for topic evolution analysis of short text streams,which attempt to solve the problem of compression sampling,clustering,topic extraction and topic evolution analysis.The main points of this paper can be summarized as the following four points:(1)To solve the problem of massiveness and low quality of short text streams,this paper proposes a compression sampling framework for short text streams based on compressed sensing.The framework firstly implements high quality short text sampling through a greedy algorithm based on the Shannon entropy,then compresses short text streams using the compressed sensing theory framework.Finally,our method restores short text streams using a redundant dictionary.At the same time,this paper proposes a MapReduce-based parallelization algorithm to improve the compression efficiency.The experimental results on large-scale short text datasets show that the proposed method outperforms baseline methods in terms of running time,compression ratio and so on,and can be directly used in short text analysis tasks.(2)To solve the problem of the poor generalization ability of existing pseudodocument-based topic models,this paper proposes a short text clustering method based on word embeddings.Firstly,a new short text similarity measure method is proposed.This method can decompose short text distance into sparse distance between words,and then capture semantically related word pairs.Secondly,based on the K-medoids algorithm,this paper proposes a clustering algorithm refered as K-same,which realizes the same number of short texts in each category while clustering.This algorithm further alleviates the sparsity problem and lays the foundation for high-quality topic extraction.The experimental results show that the proposed method is a general method to solve the sparse problem of short text topic modeling.(3)To solve the sparseness of short texts and the poor performance of traditional topic models on short texts,this paper proposes a new topic model.The proposed model uses global semantic correlations to encourage related words to share the same topic label,which can improve the coherence of learned topics.Local semantic correlations are used to effectively identify the senses of ambiguous words,hence the irrelevant words can be filtered.The experimental results on two real-world short text datasets show that the proposed model outperforms other models on several evaluation metrics such as topic coherence and text classification accuracy.(4)To solve the lack of correlation analysis and poor interpretability among topics in the existing topic evolution analysis methods,this paper proposes a new topic evolution model OCCTM.The model first divides short texts into their corresponding time slices,and then extracts the high-quality topics and relationships between topics in each time slice.Finally,the topic evolution relationship between different time slices is measured by KL divergence,and the topic evolution graph is automatically generated.The experimental results on real-world short text datasets show that the quality of topics extracted by OCCTM is better than the state-of-the-art models,and the topic evolution graph generated by OCCTM can help people to quickly understand the evolution process of hot events and the relationship between core topics and subtopics.
Keywords/Search Tags:Short text stream, topic extraction, topic evolution analysis, topic model, compressed sensing, Transformer model
PDF Full Text Request
Related items