With recent advances in deep learning,text mining,as an important sub-field of software engineering,is transitioning to the direction of intelligentizing gradually.However,due to the limitation of traditional information processing techniques and mining approaches,its progress is still quite tardy.Especially after the web2.0 era,amounts of information are conveyed by text published on the internet,such as newswire on the online news website,encyclopedia entry on the online encyclopedia,tweets on social media and product reviews on online shopping platform.These texts often contain various semantic entities,such as ’topic’,’product aspect’and ’event’.And it will cost too much manpower to read and analyze the text in the corpus and this kind of text analysis procedure could not deal with the increasing scale of the text corpus.Therefore,designing approaches that could automatically mine hidden topics and events from large scale unstructured and un-annotated text will help to improve the intelligentized degree of text mining research field.Also,it will further promote the transformation of the entire software engineering discipline from informatization to intelligentization.Topic models aim to discover knowledge from text automatically.As an important tech-nique for mining and understanding text content,it has been successfully applied to various software engineering tasks(such as information extraction and text mining).However,tra-ditional topic models still face several challenges:1)traditional topic models only use word co-occurrence information contained in the text and it is not easy to incorporate external se-mantic knowledge into the modeling process;2)traditional topic models often use approximate inference method for model learning,such as mean-field variational inference and collapsed Gibbs sampling,these approximate algorithms require sophisticated derivation and have lim-ited extensibility;3)most traditional topic models could only mine independent topics and the semantic correlations between extracted topics could not be captured accurately,this is not con-ducive to the macro understanding of text corpus.To address these challenges,this paper takes word embeddings trained by neural-based word representation technique as external semantic knowledge and uses the Weighted Polya Urn(WPU)scheme and generative neural network as the main learning framework to devise topic models for extracting topics,product aspects and events from online texts.More detailed,the main work and contributions are summarized as:(1)To address the limitation that traditional topic models only use word co-occurrence in-formation and could not obtain high-quality topic,this paper devises a novel Weighted Polya Urn(WPU)sampling scheme and further proposes the Weighted Polya Urn scheme Latent Dirichlet Allocation(WPU-LDA)by incorporate WPU into the learning framework of Latent Dirichlet Allocation.With the help of word embeddings trained by neural-based word representation technique and WPU sampling strategy,the proposed WPU-LDA is able to consider the seman-tic relations between each word and different topics dynamically during inference procedure and could group the words with similar semantic meaning into the same topic to improve the topic quality;(2)To address the limitation that the inference algorithm of traditional topic models require sophisticated derivation and such models often have limited extensibility,under the framework of the Generative Adversarial Nets,this paper proposes a novel Adversarial-neural Topic(ATM)model which is the first attempt of adversarial training in neural topic modeling.ATM model utilizes a generator network to build the projection function from document-topic distribution to document-word distribution.Besides,it uses a discriminator network to identify the real inputs from fake ones.During the adversarial training phase,the output signal of the discriminator could help the generator to capture the semantic pattern lying behind the text automatically.Meanwhile,ATM could generate a semantic representation for each word in vocabulary;(3)To address the limitations that traditional topic models need sophisticated inference procedure and ATM couldn’t provide document-topic distribution for unseen text,under the framework of the Bidirectional Generative Adversarial Nets,this paper proposes a novel Bidi-rectional Adversarial-neural Topic(BAT)model.By incorporating an encoder network from document-word distribution to document-topic distribution,the BAT model solves the draw-back of ATM and is more suitable for the downstream task,such as text clustering.Besides,to further improve topic quality and model topic correlations,the Bidirectional Adversarial Topic model with Gaussian(Gaussian-BAT)is extended from BAT.Gaussian-BAT models each topic with a multivariate Gaussian in the word embedding space to incorporate word relatedness and the correlations between topics could be captured by these multivariate Gaussian distributions;(4)To address the limitations that traditional topic models often obtain low-quality top-ics,need sophisticated inference procedure and could not model topic correlations accurately,this paper proposes a Variational Gaussian-neural Topic Model(VaGTM)based on variational autoencoder.VaGTM models each topic with a multivariate Gaussian in the word embedding space to incorporate external word relatedness in the decoder network and the semantic cor-relations between topics could also be captured by these topic-associated Gaussian.Besides,to solve the issue that the word embedding of topic-associated words may not exactly follow a multivariate Gaussian,the Variational Gaussian-neural Topic Model with Invertible neural Projections(VaGTM-IP)is extended from VaGTM.VaGTM-IP incorporates a flow-based in-vertible neural projection to transform the word embedding to a new representation space which is more suitable for neural topic modeling and further improve the quality of extracted topics;(5)Finally,to further validate that adversarial training based neural topic models have good extensibility,this paper proposes a novel Adversarial-neural Event(AEM)model to ex-tract open domain hot events from event texts(tweets and news articles).In AEM,each event is defined as a 4-tuple<date,location,person,keyword>and each element is represented with a topic.To mine hot events,AEM utilizes a generator network to build the projection function from document-event distribution to document-entity distribution,document-location distribu-tion,document-date distribution and document-keyword distribution.It also uses a discrimina-tor network to provide the supervision signal which could guide the training of the generator network.After adversarial training,the generator could mine the event-related entity distribu-tion,location distribution,date distribution and keyword distribution from the text.Besides,thanks to GPU acceleration,the AEM model has a higher execution efficiency compared with traditional event extraction models. |