| With the development of the internet,there exists a huge amount of text.Analyzing these texts is an important natural language processing task.Topic modeling is a kind of common method to analyze texts.These models construct a joint probability distribution of documents,topics and words and generate topics according to the distribution.These topics are wildly used in applications such as search engine,knowledge graph,advertise recommendation,public opinion analysis,etc.Short texts from social media are one kind of common text on the internet.Unlike regular texts,the average length of short texts is no more than 20 words which is very short.Conventional topic models cannot perform well on these texts.Conventional topic models need word co-occurrence information to sample topics,which is very sparse in short texts.To overcome this problem,researchers propose different kinds of topic models specified for short texts.But all these methods cannot provide sufficient and semantic related word co-occurrence information.Among these methods,only self-aggregated methods can provide sufficient word co-occurrence information.But self-aggregated methods cannot avoid incorporating non-semantic word co-occurrence information.Therefore,we adopt the idea of self-aggregation and propose a series of methods that providing sufficient and semantic related word co-occurrence information.1.Self-aggregated topic models need to define the number of long documents explicitly.But an inappropriate number of long documents will lead to poor performance.If the number of long documents is too large,models cannot provide sufficient word co-occurrence.If the number of long documents is too little,models will incorporate many non-semantic word co-occurrence information.So we use the Dirichlet process to automatically generate the number of long documents.The appropriate number of long documents will be automatically defined according to the scale of the short texts’ corpus.Then,we analyze short texts and discover that the semantic relationships between short texts show the power-law behavior.Inspired by this phenomenon,we propose a Pitman Yor process self-aggregated topic model(PYSTM)which aggregates short texts following the power-law distribution through a Pitman Yor process.Compared to the multi-nominal distribution,the power-law distribution represents the inner law of short texts and can avoid aggregating short texts together that are not similar.The experimental results show that our model is superior to other state-of-the-art methods.2.Model PYSTM aggregates short texts following a power-law distribution.But there exist some texts in any short texts corpus that do not follow the power-law distribution.These texts will bring noisy information into PYSTM model.To overcome this problem,we propose a nested Dirichlet process topic model incorporating document embeddings(DESTM),using document embeddings instead of the power-law distribution for aggregating short texts.Document embeddings are calculated by short texts and words in short texts.Then each short text is converted to a vector.The similarities between short texts are represented by the distance between vectors.However,document embeddings of short texts contain a lot of noisy information resulting from the sparsity of word co-occurrence information.So we discard noisy information by changing the document embeddings into global and local semantic information.The global semantic information is the similarity probability distribution of the entire corpus and the local semantic information is the distances between similar short texts.Then we defined a threshold and discard distances that below the threshold.Finally,we adopt a nested Dirichlet process to incorporate these two kinds of information.The experimental results show that our model performs better than other methods,also superior to PYSTM.3.Model DESTM incorporates document embeddings.But because of the sparsity of word co-occurrence in short texts,document embeddings information is not sufficient.So the semantic information about short texts is still not enough and cannot avoid incorporating non-semantic word co-occurrence information.To overcome this problem,we generate a mixture model which mixing both local embeddings and global embeddings(WDETM).WDETM constructs a mixture model that mixing document embeddings and word embeddings through a probability distribution.Document embeddings aggregates short texts that are semantically similar.Word embeddings aggregates short texts that can add more semantically related word co-occurrence.So the mixture of these two kinds of embeddings can provide more sufficient information.WDETM also incorporates word embeddings into the sampling procedure of the distribution of topics and words.But word embeddings are calculated according to word co-occurrence information.The sparsity of word co-occurrence will reduce the similarity between words.So we defined a threshold and discard similarities under the threshold.Finally,we use the nested Dirichlet process to incorporate the mixture embeddings and use the Polya Urn model to incorporate word embeddings.The experimental results show that our model is superior to DESTM,PYSTM,and other state-of-the-art methods. |