Font Size: a A A

Research On Some Key Issues In Short Text Information Extraction

Posted on:2017-05-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:L Z ZhengFull Text:PDF
GTID:1108330485951626Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the fast development of information technology, the Internet has become a more and more important part in peoples’ everyday life. The rapid development of In-ternet brings vast volume of information, which is growing in an explosive way. How to access, store and make full use of such data in an effective way has become a big research issue nowadays. Among those massive Internet information, there is one type of information which is growing in an amazing speed, i.e. messages that published by users in various social network platforms. Those type of information has many new characteristics compared to traditional web articles, which we summarized as short in text length, irregularity in grammar, arbitrariness in language, etc. We name those in-formation as ’short text’ information. The most representative short texts are microblog posts, product reviews, BBS forum comments, etc. The emergence of short texts infor-mation brings both opportunities and challenges to traditional Web information extrac-tion tasks. The massive and extensive information stored in short texts makes it valuable to perform information extraction on short texts. Representative information tasks in-clude event extraction and event analysis, sentiment analysis, knowledge graph mining, etc. However, characteristics of short texts brings challenges to those traditional tasks, therefore, proposing new methods to extract valuable information from short texts has become research focus in recent years.In this paper we focus on the research of several key issues in short texts informa-tion extraction, i.e., microblog event extraction, microblog event semantic element ex-traction and product review sentiment analysis. Microblog event extraction aims at ex-tracting events from microblogs based on user requirements. The large number of users in microblog platform brings vast volume microblog information every day and night. Those microblog texts contain great number of events, which makes microblog plat-form a better medium for news spreading than traditional news publish agents. In this view, finding effective ways to extract events from microblogs is an meaningful task. For an extracted microblog event, how to describe the event in a completely and intu-itive way is another important issue. We use the concept of 5W1H in journalism studies to describe an event, since this way of representation can fully describe an microblog event. Thus how to extract event 5W1H semantic elements in microblog texts which full of arbitrary written styles is of great valuable. Product review sentiment analysis aims to extract sentiment tendency in product reviews published by users. Nowadays online shopping is the first choice for many consumers. Mining sentiment tendency in product reviews can not only help consumers make their decisions, but can also guide sellers to improve their product and make more profits.In this paper we focus on the some key issues in short text information extrac-tion which described above and propose series of solutions. Our contributions can be summarized as followed:1. Aiming at the task of microblog event extraction, we discover that named entity is the key component of an event. The distribution of different kinds of named entities may various for different kind of events, thus making use of the informa-tion of named entities may improve the performance of event extraction. Based on this discovery we define "event type" as the probability distribution of differ-ent types of named entities in an event. In our method we propose a machine learning based method to extract event type from microblogs automatically. We then extract microblog events by employing clustering methods. In our process of event clustering, we make use of our extracted event type, which improves the performance of event clustering.2. Aiming to resolve the issue that existing methods of event representation could not completely describe an event, we utilize the concept of event 5W1H seman-tic elements for event description. The characteristics of microblog texts lead poor performance on microblog event semantic extraction by using traditional methods, thus in this paper we propose new methods for this task. As to When and Where elements, we propose a granularity-based coarse-fine method, in this method we consider time/location information in different granularities and ex-tract semantic elements from coarse granularities to fine granularities. As to Who, What and Whom elements, we propose a term clustering and linking method. In this method we resolve the issue that presentations for a same entity may vari-ous by performing clustering process on terms in different sentence constituents. We then link clusters in different sentence constituents together to form semantic element tuples by considering term co-occurrence information. Our methods im-prove the performance of event semantic element extraction since it resolve the problem of entity presentation variation to some extent.3. Aiming at the task of product review sentiment analysis, we find that user may carry various sentiment tendencies for different product attributes in a product re-view. Thus traditional sentiment analysis which based on sentences, paragraphs, etc. is no longer applicable in product reviews. In this paper we propose a frame-work for multi-dimensional sentiment analysis on E-commerce reviews to resolve the issue above. In this method, for an input product review, we first split the review text into short sentences by employing an convolutional neural network based method. After sentence splitting, there could be at most one product at-tribute described by user in one short sentence. For each short sentence, we map it with a product dimension and finally perform sentiment classification under this dimension. As to the issue that a same sentiment word may display various senti-ment polarity under different product dimension, we employ a semi-automatic ap-proach to construct a dimensional sentiment lexicon, by utilizing our dimensional sentiment lexicon, we improve the performance of product sentiment analysis.The studies in this dissertation resolve the problems brought by the characteristics of short texts in information extraction tasks to some extent. We propose several new methods which include microblog event type extraction method, event type-based mi-croblog event extraction method, microblog event 5W1H semantic element extraction method and multi-dimensional product review sentiment analysis method. We validate our method on real datasets. Our studies can offer some new references for information extraction techniques on short texts.
Keywords/Search Tags:Information Extraction, Short Texts, Microblog Event Extraction, 5W1H, Sentiment Analysis, Product Reviews, Sentiment Word Enlargement
PDF Full Text Request
Related items