Font Size: a A A

Research On Analysis And Computation Methods For Short Text With Deep Learning

Posted on:2017-02-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:1108330485450025Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet and the mobile devices, users can easily express their emotions, opinions and comments through the Internet and the mobile platform, which produces a huge amount of text information. Among these text data, the short texts have become the main carrier for the users to transmit information. Thus, the analysis and computation of the short text have gradually become a hot research in natural language processing. However, for the characteristics of casual expression and irregular grammar, the traditional processing method will lead to sparse representation and the loss of semantic information in short text computation, and lead to word matching failure and words out of vocabulary in Chinese word segmentation, and lead to lack of semantic representation in words and characters, which demonstrates that the traditional methods are not fully suitable for the short text computation. With the development of deep learning, feature learning is becoming a new branch of machine learning. Therefore, it is important to study the related problems of short text with the semantic representation and deep learning. It is significant for the application of short text.Aiming at the above problems, according to the characteristics of short text, semantic representation, Chinese word segmentation, and short text similarity computation are studied in this dissertation with the theories and methods of deep learning. And a complete short text computing framework is formed. The main contents and innovative works of the dissertation are as follows:(1)To extract the semantic representation of Chinese characters and words, a semantic vector representation method based on local and global context is proposed. Through the semantic relations between the word and its context, this method constructs a neural network model for the semantic computation of local context and global context. The model learns the semantic vectors of characters and words in an unsupervised way to make the semantics irreplaceable in its context. Two widely covered groups of representations are trained by the model respectively for Chinese characters and Chinese words. Experimental results show that the learned vector representations contain effective semantic relations, and the low dimensional continuous vectors are more advantageous to the short text computation.(2)To avoid the failure of word matching and the words out of vocabulary of the traditional Chinese word segmentation methods, a Chinese word segmentation method based on Chinese character vector representations is proposed. This method takes the positions in the word as the target, and converts the word segmentation into a sequence annotation problem. A neural network model is constructed as an annotation classifier for the semantic analyzing of the context. Then, the word segmentation is performed by the estimated position in the word of each character. With the comparison of ICTCLAS, Cloud platform of HIT, and Paodingjieniu Chinese word segmentation tool, the experimental results demonstrate that the results of this method are effectively higher in accuracy and recall.(3)Aiming at the problems of sparse representation and the semantic loss of traditional methods in representing short texts, a short text representation method based on the pooling computation is proposed. Taking into account of the similar semantic words between the target text and the candidate text, the short texts are represented in the weighted pooling method with the word vectors. In addition, the feature obtained from the recursive auto-encoder is fused to construct a short text similarity computing framework. The experimental results demonstrate that the proposed framework can effectively improve the retrieval results of short texts.Finally, according to the actual needs of the biomedical information retrieval task, the text representation method is applied to the query expansion in order to solve the problem of the lack of domain dictionary and synonymy thesaurus. A biomedical information retrieval system based on the semantic representations and short text representations are finally designed. The BioASQ evaluation results show that the system won the champion for twice and the second place for twice in document retrieval and won the second place for four times in snippet retrieval. The application example further demonstrates the validity of this method.
Keywords/Search Tags:Short Text, Deep Learning, Semantic Representation, Chinese Word Segmentation, Similarity Computation
PDF Full Text Request
Related items