Research On Text Mining In Complex Information Networks

Posted on:2018-09-13

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H F Sun

Full Text:PDF

GTID:1318330518997027

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of computer and internet technologies, our capabilities for both generating and collecting data are substantially enhanced. Data mining can intelligently help us transform the vast amounts of data into useful knowledge and information. It can automatically and conveniently extract patterns which are implicitly stored or captured in large databases and represent knowledge. Text mining is an important research field in data mining. At present, the text mining technology is widely used in the fields of news, finance and so on.There is more and more information is constructed by interrelated multi-typed data, such as social media. In such kind of information, some of the text are in the form of short with the characters of less key information and more noise, which is defined as complex information networks in this paper. However, the performances of traditional text mining techniques are not suitable to complex information networks. Since the traditional text mining techniques cannot learn the relationship between different types of data to improve the text analysis. This paper summarizes the author's main research work in the complex information networks,including semantic annotations in short text content, embedding the multi-typed data, document clustering and monitoring the target events. Through in-depth study, we made the following major contributions:(1) We propose a semantic annotation system for short text. To make the text retrieval and index more accurate, several systems were proposed to annotate the semantic of key words using Wikipedia as a catalog. For the short text, lack of the sufficient text content, it reduces the accuracy of annotations. To solve this problem, we not only consider the correlation between the semantics of the naming entities, but also takes into account the probability of the most commonly used semantics of name entities. The proposed method not only has better accuracy than other existing algorithms, but also has low time complexity and its time complexity is O(n), which can guarantee the ability of the system to realize real-time response online.(2) We proposed a multi-typed embedding algorithm in complex information network environment. As an important data representation way, vectorized representation plays an essential role in many data mining applications. Now, more and more applications are based on complex information network. On the other hand, most traditional embedding methods are based on single typed data, and cannot be directly applied in data with network structures. In this paper, we propose an embedding method, named as Multi-Typed Data Embedding (MTDE), vectorized represents the data in complex information network. It achieves Latent Spaces for each typed data and a multi-typed latent translational space by a probabilistic model based on Gibbs sampling method. First, it embeds the objects in network not only considering the relationships in same typed data, but also the network structure. Second, it provides a translational space to make the comparison of different typed data available. Thus, we can utilize MTDE to compare different typed data in more data mining applications. Our experiments on DBLP show that MTDE learns high-quality embedding. Moreover, other data mining tasks, e.g. Clustering,based on MTDE achieve a better performance than the state-of-the-art methods.(3) We proposed text clustering algorithm to meet the complex information network environment. Document clustering is a core problem for document based research. Most research utilites the bag of words model to represent the documents. The conventional clustering algorithms seek to find a most representative features space to reduce the high-dimensional spaces to improve the cluster performance. However, there are lots of words are useless for clustering task. They decrease the similarity of the same topic documents. The dimensional reduction technique can not avoid this bad influence. For this purpose, we propose a novel document clustering framework based on a newly defined document representation,called discriminative indexing based on centrality measure (DICM), to group documents into meaningful semantic categories. Different from the conventional clustering methods, DICM attempts to discover the most discriminative features for document representation. We use a centralization measure of terms in the term co-occurrence graph to identify each term's discrimination ability. By transforming the term features into a low-dimensional discriminative space, the documents related to the same semantic definitions are usually close to each other. A series of evaluations show that DICM is intuitively appealing and demonstrate superior experimental results compare to four other state-of-the-art competing algorithms on seven data sets in terms of clustering accuracy and mutual information. The discussion between term frequencyinverse document frequency (tf-idf) and the discriminative indexing shows DICM provides better discriminative power for document clustering.(4) We proposed a specific event monitor system in the complex information network environment. Twitter is used as a real-time social sensor in the proposed method. To solve the challenges of detecting a targeted event from the fragmented and noisy tweets, we devise a probabilistic framework to integrate the textual, temporal, and spatial information to identify the event. To improve the accuracy of outage detection, we propose a supervised topic model with a heterogeneous information network. The proposed technique is tested with real tweets and outage cases. The numerical results demonstrate the effectiveness of the proposed methodology. The comparison between the proposed method and support vector machine and statistics Bayesian method shows the accuracy of the developed model.

Keywords/Search Tags:

heterogeneous information networks, semantic annotation, text clustering, embedded representationl, event monitoring

PDF Full Text Request

Related items

1	A Study And Implementation Of Semantic Annotation For Chinese Text
2	Research On Information Retrieval Of Heterogeneous Information Networks
3	Design And Implementaion Of Semantic Analysis And Annotation Function For Restful Services
4	Study On Routings Of Large-scale WSN For Event Monitoring Applications
5	Research On Event-Semantic-Oriented Network Representation Learning In Heterogeneous Information Networks
6	Network Resources Annotation Based On Chinese FrameNet Ontology
7	Study Of Chinese Event Information Extraction Based On Hownet Semantic Relation
8	The Full-Text Semantic Annotation System Based-on Chinese Wikipedia
9	Heterogeneous Information Based Financial Event Detection
10	Event-oriented Text Knowledge Discovery And Representation