Font Size: a A A

Research On Text Mining In Complex Information Networks

Posted on:2018-09-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:H F SunFull Text:PDF
GTID:1318330518997027Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer and internet technologies, our capabilities for both generating and collecting data are substantially enhanced. Data mining can intelligently help us transform the vast amounts of data into useful knowledge and information. It can automatically and conveniently extract patterns which are implicitly stored or captured in large databases and represent knowledge. Text mining is an important research field in data mining. At present, the text mining technology is widely used in the fields of news, finance and so on.There is more and more information is constructed by interrelated multi-typed data, such as social media. In such kind of information, some of the text are in the form of short with the characters of less key information and more noise, which is defined as complex information networks in this paper. However, the performances of traditional text mining techniques are not suitable to complex information networks. Since the traditional text mining techniques cannot learn the relationship between different types of data to improve the text analysis. This paper summarizes the author's main research work in the complex information networks,including semantic annotations in short text content, embedding the multi-typed data, document clustering and monitoring the target events. Through in-depth study, we made the following major contributions:(1) We propose a semantic annotation system for short text. To make the text retrieval and index more accurate, several systems were proposed to annotate the semantic of key words using Wikipedia as a catalog. For the short text, lack of the sufficient text content, it reduces the accuracy of annotations. To solve this problem, we not only consider the correlation between the semantics of the naming entities, but also takes into account the probability of the most commonly used semantics of name entities. The proposed method not only has better accuracy than other existing algorithms, but also has low time complexity and its time complexity is O(n), which can guarantee the ability of the system to realize real-time response online.(2) We proposed a multi-typed embedding algorithm in complex information network environment. As an important data representation way, vectorized representation plays an essential role in many data mining applications. Now, more and more applications are based on complex information network. On the other hand, most traditional embedding methods are based on single typed data, and cannot be directly applied in data with network structures. In this paper, we propose an embedding method, named as Multi-Typed Data Embedding (MTDE), vectorized represents the data in complex information network. It achieves Latent Spaces for each typed data and a multi-typed latent translational space by a probabilistic model based on Gibbs sampling method. First, it embeds the objects in network not only considering the relationships in same typed data, but also the network structure. Second, it provides a translational space to make the comparison of different typed data available. Thus, we can utilize MTDE to compare different typed data in more data mining applications. Our experiments on DBLP show that MTDE learns high-quality embedding. Moreover, other data mining tasks, e.g. Clustering,based on MTDE achieve a better performance than the state-of-the-art methods.(3) We proposed text clustering algorithm to meet the complex information network environment. Document clustering is a core problem for document based research. Most research utilites the bag of words model to represent the documents. The conventional clustering algorithms seek to find a most representative features space to reduce the high-dimensional spaces to improve the cluster performance. However, there are lots of words are useless for clustering task. They decrease the similarity of the same topic documents. The dimensional reduction technique can not avoid this bad influence. For this purpose, we propose a novel document clustering framework based on a newly defined document representation,called discriminative indexing based on centrality measure (DICM), to group documents into meaningful semantic categories. Different from the conventional clustering methods, DICM attempts to discover the most discriminative features for document representation. We use a centralization measure of terms in the term co-occurrence graph to identify each term's discrimination ability. By transforming the term features into a low-dimensional discriminative space, the documents related to the same semantic definitions are usually close to each other. A series of evaluations show that DICM is intuitively appealing and demonstrate superior experimental results compare to four other state-of-the-art competing algorithms on seven data sets in terms of clustering accuracy and mutual information. The discussion between term frequencyinverse document frequency (tf-idf) and the discriminative indexing shows DICM provides better discriminative power for document clustering.(4) We proposed a specific event monitor system in the complex information network environment. Twitter is used as a real-time social sensor in the proposed method. To solve the challenges of detecting a targeted event from the fragmented and noisy tweets, we devise a probabilistic framework to integrate the textual, temporal, and spatial information to identify the event. To improve the accuracy of outage detection, we propose a supervised topic model with a heterogeneous information network. The proposed technique is tested with real tweets and outage cases. The numerical results demonstrate the effectiveness of the proposed methodology. The comparison between the proposed method and support vector machine and statistics Bayesian method shows the accuracy of the developed model.
Keywords/Search Tags:heterogeneous information networks, semantic annotation, text clustering, embedded representationl, event monitoring
PDF Full Text Request
Related items