Font Size: a A A

Hierarchical Semantic Structure Based Text Stream Mining

Posted on:2017-10-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:D TuFull Text:PDF
GTID:1318330512983428Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a basic communication form of human beings,text is an important part of unstructured data.Compared to data in other form,the text data is usually more valuable.Therefore,automatic text analyzing and mining is always a hot research topic of computer science.Today the amount of text data is increasing quickly and continuously all the time on the internet,which can be seen as many text streams.Unlike traditional text data,the text stream data have two characteristics:1)many data in the text streams are of low quality and hard to extract semantic relation from them;2)the patterns in text streams are dynamically evolving,and the changes should be detected by text stream mining methods.These characteristics raise challenges for the existing text mining methods.Present text stream mining methods are not complete and new methods are needed to deal with these challenges.As a common organization form of data,hierarchical structure can reflect the inherent structure of the data more precisely and is an important way to implement a non-parametric method,which can automatically adapt to the evolving patterns in a text stream.In this paper,hierarchical structures are applied and three methods are proposed from the aspects of concept hierarchy construction,rare category detection,and topic modeling.In addition,a semi-supervised online hierarchical topic model is proposed based on these methods to analyze the text streams.The contributions of this paper are as follows:1)To improve the extracted semantic relation quality of existing concept hierarchy construction methods on the short and informal data,e.g.,blogs and reviews,a multi-way concept hierarchy construction method is proposed based on a compound semantic distance metric.The proposed compound semantic distance combines the advantages of semantic dictionary distance and context distance,which ensures its application scope and the quality of the relations.Meanwhile,an improved multi-way agglomerative clustering is presented in this paper,which can preserve the relative distances between the concept pairs while the traditional agglomerative clustering methods not.In addition,a concept hierarchy similarity metric is extended in this paper to solve the problem of duplicate matchings.The experiment results show that,the proposed construction method can generate concept hierarchies that have higher similarity with the ground truth.2)To detect new patterns from concept hierarchies or topic hierarchies,a rare category detection method is proposed by exploiting hierarchical density based clustering.Usually it is useful to detect novel documents or topics in social networks and news streams,and anomaly detection methods play an important role in novel data mining.To overcome the drawback of existing methods,a relative constraints based kernel mean shift clustering method-RKMS is presented,which have higher scalability and is more adaptive to hierarchical clustering scenario compared to its original form.In addition,a new rare category detection method is proposed based on RKMS.Compared to contrasting methods,it does not need predefined class number and can gradually optimize the models by combing active learning and semi-supervised learning.The experiment results show that,the proposed rare category detection method performs better than contrasting methods under linear and non-linear cases.3)To detect and track topics in continuous text stream,an online hierarchical topic model-HONMF(hierarchical online non-negative matrix factorization)is proposed.Most existing online topic models arrange the found latent topics in a flatstructure,and treat each discovered topic as distinct elements,which ignore the potential relationships between them,i.e.,this limits the representation ability of the methods.To address this,a hierarchical online sparse NMF is presented by extending the online dictionary learning,and a mechanism is proposed to control the structure of a topic hierarchy by referring the mean shift clustering method.In addition,metrics have been proposed to detect emerging topics and fading topics in an existing topic hierarchy and the evolving process of the topic hierarchy can be achieved based on these metrics.The experiment results show that,OHNMF can generate topics with better quality in less time,and can track the evolving process of the topic hierarchies.4)To validate the research line of this paper and improve the performance of HONMF,a semantic relation based semi-supervised hierarchical online NMF-SSHONMF is proposed,which integrates the above research works in a process.First,it uses a semantic dictionary and training documents to generate a task specific concept hierarchy,and adjusts the original input document matrix based on the semantic relations in it.Second,it performs HONMF to detect topic hierarchies in a text stream and select hint documents from the topic hierarchies according to the criteria of the proposed rare category detection method.Third,it exploits the hint documents to learn a similarity metric and integrates it in the latter topic detecting process.The experiment results show that,the SSHONMF get better topic quality than HONMF,which proves the rationality and effectiveness of the proposed research line.
Keywords/Search Tags:text mining, concept hierarchy construction, hierarchical clustering, rare category detection, topic model
PDF Full Text Request
Related items