Font Size: a A A

Text Representation And Classification With Deep Learning

Posted on:2017-01-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y YanFull Text:PDF
GTID:1108330485450013Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the widely-spread applications of information technology and development of information construction, textual information is growing explosively. How to attain valid information from many resources becomes the focus point. The extraction and classification of textual contents will be the basic approaches to solve the problem of managing textual information. Among the related studies, the text representation is of great importance in text classification. However, the traditional text representation ignores semantic relations, as well as introducing many manual factors when selecting the features, and also the data facing the characteristics of various degrees of high dimensionality and highly sparsity, which leads that the text representation of high-level cannot effectively represent the text information. The diversity of the forms of existing textual information, the wealth of the themes of text and the uneven distributions of datasets bring great challenges to text categorization. The conventional classification methods are no longer suitable for the classification. Therefore, the designs of brand new high-level semantic representation and text classification algorithms have become hot topics.In recent years, through the unique hierarchical network which can extract features from the low-level information, deep learning become a good solution of these problems. The extraction of the high-level text representation and the establishment of efficient text classification model are provided with strong support. Deep learning technologies have made significant progress in the field of image, speech and natural language processing and demonstrate the potential values. Therefore, this paper carries out a study on text representation and text classification based on the deep learning models, and obtains the following results:1. A HDBN model for text representation and classification based on deep belief networks.BOW contains only the information of word frequency, and explores word embedding vector-based text keywords representations in order to solve high-dimensional and high sparsity characteristics. For this semantic representation, we proposed a new model which called HDBN (Hybrid Deep Belief Network) based on the integration of DBN and DBM, better learning high-level text representation. The results of text categorization and text retrieval showed that the high-level text representation based on HDBN model in the two-dimensional spatial visualization results have outstanding performances.2. A B-CNN model based on CNN and fusion DBM for text representation and classification.For the sake of the characteristics of biomedical abstract that there are a large number of professional and abbreviations, the length of the document is short and the topics are various, we designed a new input representation (DSE) of text based on Wikipedia and named entity features extension and embedded word vectors. For this semantic representation, we proposed a model which is called B-CNN. This model based on CNN extracts local features in document, combines the local features with the DBM fused global features and can better learn the high-level representation of the document. By clustering labels and label co-occurrence relationship, the model builds label tree with the hierarchical, and designs effective architectures to implement the label tree. In addition, we also derive the error propagation of this model so that the whole model is based on supervised training and fine-tuning. Experimental results shows that B-CNN model get good performances not only in biomedical text, but also in other areas.3. A LSTM2 model for text representation classification.Traditional text representation requires fixed the dimension of text input, which has a great impact on the dataset where the differences among the length of the documents are relatively large. Especially for multi-label dataset, the distribution of the samples is extremely uneven and the sparse samples account for a large proportion, which seriously affects the classification performances. For this problem, we proposed LSTM2 model based on sequence forecast. Based on word sequence input with word embedding, the model solves the text input length problem, and effectively extracts the high-level text representation. By analyzing potential relationships between labels and documents, and adding local features to documents, the model improves the probability of sparse label prediction. In addition, this model learns to build semantic label tree based on Parser, modifies the document labels according to the sequences, and makes better predictive categories based on LSTM model. Experimental results showed that LSTM2 model not only achieves arbitrarily length of text input which is an effective solution to the problem of sparse label prediction, but also solves the problem of the explosion and disappearance in gradient descent back propagation process.
Keywords/Search Tags:Text classification, Text representation, Deep learning, Word embedding
PDF Full Text Request
Related items