Font Size: a A A

The Research On Chinese Sentential Semantic Model Parsing And Text Representation

Posted on:2017-03-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:L HanFull Text:PDF
GTID:1108330503955255Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet and information technology, text data such as news, commentary, weibo, etc. have seen explosive growth. This phenomenon puts forward higher requirements for computer to process the large quantity of texts. Text representation is one of the very significant content in basic research and plays an important role in text processing. Meanwhile, the demand of semantic analyticity is growing and the semantic information needs to be dug from language. Different from English, Chinese text processing is more difficult, especially the Chinese semantic analysis is a long-term challenging work.This paper focuses on Chinese sentential semantic model(CSM) and its application in text representation, puts forward Chinese sentential semantic structure parsing method, effectively improves the effect of long and short text representation using the abundant semantic information provided by CSM, and attempts to promote the development of Chinese semantic analysis theory and technology.The main contributions of this dissertation are listed below:1. Based on conditional random fields and dependency parsing method, a five links method for Chinese sentential semantic structure analysis is proposed. It acquires 28 kinds of sentential semantic components and 3 kinds of relations among the sentential semantic components, and enriches the machine-processable semantic features.CSM is the abstract representation for sentence meaning. It is regarded as a significant method for Chinese semantic analysis and a machine-comprehensible model representation for sentential semantic computation. The proposed Chinese sentential semantic structure analysis method divides the sentential semantic structure recognition process into five links, each of them obtains distinct semantic features and can be selected flexibly according to the needs. The experiment based on Beijing Forest Studio Chinese tagged corpus evaluates the five links method. The F-value reaches 0.787 in the experiment. The five links method can identify all the sentential semantic components and relations at once, achieve the transformation from plain sentences to computational sentential semantic structure, enrich the machine-processable semantic features and boost the research on Chinese semantic parsing.2. A long text representation method by fusing the relations among the sentential semantic components and the topic model is proposed. It utilizes the semantic relations to control the word generation process in topic model, and relaxes the bagof-words assumption. The proposed long text representation makes full use of textual semantic features and can be applied to improve the efficiency of the long text classification and clustering.The statistical long text representation method has been developing rapidly in recent years, especially, the topic models are considered as a paramount method for long text representation. However, the existing topic models neglect the semantic relationship between words, thus the effect of text representation is diminished. Given the problem, this paper puts forward the relational latent dirichlet allocation(RLDA) by fusing relations among the sentential semantic components and LDA(latent dirichlet allocation). RLDA utilizes the mapping from the relations among the sentential semantic components to the lexical semantic relationship to represent the word with itself and its semantic correlated word. In this way, it integrates the lexical semantic relations into word generation process and eventually relaxes the bag-of-words assumption. Based on Sogou corpus, perplexity, text classification and clustering experiments are implemented. The experimental results show that perplexity value reaches 480.319, the accuracy of text classification reaches 0.907, and the ARI value of text clustering reaches 0.4537.3. The short text representation method using the sentential semantic components is proposed. The method designes topic selection rules dependent on the Topic and the Commen of CSM. Based on the rules expands short text with semantic correlated word. It mitigates the sparseness problem of short text representation and improves the efficiency of short text classification and clustering.In the research on short text representation, the sparseness problem declines the performance of short text classification and clustering. To resolve the problem, this paper presents the short text representation method using the sentential semantic components. Without changing the dimension of feature space, the method utilizes the sentential semantic components and topic model to obtain the semantic correlated words, and then expands the short text with the semantic correlated words according to the topic selection rules. It reduces the zero-value dimension of in the text representation feature vectors and mitigates the sparseness problem. On the basis of Sogou corpus, short text classification and clustering experiments are implemented. The experimental results show that the accuracy of short text classification reaches 0.8031, the ARI value of short text clustering is 0.2728. In summary, the proposed short text representation method uses sentential semantic components and expands short text with the semantic correlated words. It mitigates the sparseness problem effectively and improves the performance of short text classification and clustering.4. A Chinese sentential semantic structure analysis and application platform is constructed. It can achieve the functionality of sentential semantic structure autoparsing and CSM corpus annotation. With the extendable characteristic, the platform can develop the basic and applied research based on CSM.In order to promote the research on CSM, a Chinese sentential semantic structure analysis platform is built adopting LNMP architecture and remote procedure call protocol. The main functions include Chinese sentential semantic structure auto-parsing and corpus annotation et al. Besides, the platform is stable, reliable, maintainable and extensible. It lays a solid foundation for the extensive research on CSM.
Keywords/Search Tags:Chinese Sentential Semantic Model, Sentential Semantic Structure, Text Representation, Topic Model, Semantic Analysis, Text Classification, Text Clustering, Natural Language Processing
PDF Full Text Request
Related items