Research On Modeling And Representation Method Of Intra-and Extra-text Information Based On Chinese Language Cognition

Posted on:2024-07-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H Q Tao

Full Text:PDF

GTID:1528306929992729

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Language is a social phenomenon unique to humans and a key to the mystery of human intelligence.Since text is the most important visual carrier of language,how to reasonably model and represent text in computers has always been the research focus of artificial intelligence(AI)and natural language processing(NLP).At present,although existing research on text representation has achieved certain results in practical applications,most methods only focus on the string representation aspect of "text presentation->mathematical representation by computer",ignoring the cognitive processes of textspecific features and the abstraction and transformation of external information that are most closely related to human intelligence.This includes the characteristics of abstract concepts expressed in different languages and their differences in text features,as well as the impact of human cognitive behavior on the process of text comprehension.Therefore,current text representation research still faces challenges such as difficulty in constructing the adaptive organization of literal features within text,difficulty in representing and utilizing key information outside text,and difficulty in correlated modeling between the semantics inside/outside texts and human cognitive processes.To explore the complex linguistic cognitive processes involved behind textual semantics,this dissertation aims to start out with Chinese,which is the only highly developed ideographic language inherited throughout human history,to carry out an exploratory research on text information modeling and representation via Chinese word embedding and Chinese text classification tasks based on three cognitive language perspectives of visual perception,associative thinking and empirical organization.The main work and contributions can be summarized as follows:(1)This dissertation investigates methods for modeling and representation of intratext features for visual perception.The first step of people text understanding is to observe the text.Therefore,from the visual perception perspective of Chinese language cognition,this dissertation first examines the similarities and differences between English and other alphabetical languages and Chinese in terms of morphological assistance in semantic transmission and understanding.It then delves into the most basic stroke granularity and intuitive two-dimensional character form that make up Chinese characters,and explores the roles of stroke and character form as two types of morphological information in Chinese word embedding tasks.Then,the "Dual-channel Word Embedding model"(D WE)especial for Chinese text is proposed.By learning the stroke information of Chinese characters in sequential channel and glyph information of Chinese characters in spatial channel,we can achieve the modeling and mining of Chinese morphological information.Through the evaluation of "word similarity" and "word meaning analogy",our DWE model shows its significant advantage of capturing Chinese morphological information,which proves the rationality and effectiveness of morphological information for enriching Chinese word embedding representation,providing some new insights for the study of Chinese word embedding.Second,from the micro stroke to macro granular features,this dissertation makes further exploration on the utilization of character,word,character-leve radical and word-level radical,and discuss the introduction of radical and the semantical classification effects of radical.Then,this dissertation combines radicals with the most relevant Chinese text classification task to conduct progressive research,putting forward a Radical-aware Attention-based FourGranularity model(RAFG),which can reasonably and effectively integrate the meaning of radical features in specific Chinese context.Finally,through multiple verifications and evaluations,the experimental results not only prove the superiority of RAFG,but also validate the effectiveness of radicals in the task of Chinese text classification.These exploration conclusions have also laid a solid foundation for the subsequent work of this dissertation.(2)This dissertation investigates the modeling and representation methods of extratext prior information for associative thinking in texts.After observing the text,it is a natural and instinctive behavior of human beings to assist understanding by associating key prior information outside the text.Therefore,from the associative thinking perspective of Chinese language cognition,the external prior information modeling and representation methods are studied.First,based on the prior information carried by the radicals of Chinese Phono-semantic Compound Characters,this dissertation investigates the modeling and utilization strategy of "association information relying on features within the text"(Text-dependent Associative Concepts),and propose a Radical-guided Association Model(RAM).RAM consists of two coupled spaces,namely literal space and associative space,which ideally imitate the interaction and matching process when people obtain information when understanding Chinese text and think of relevant information based on the literal features.Then,this dissertation draws inspiration from the schema theory in psychology to explore the modeling and utilization of "schema information that does not depend on features within the text"(Text-independent Schemata).To be specific,this dissertation proposes a schema space with a new loss function paradigm on the basis of RAM to propose a Schema-aware Radical-guided Associative Model(SRAM).SRAM can use the label information of supervised classification task datasets to introduce necessary label semantics into the modeling of text representation,which reasonably imitates the function of prior information outside of the text and schema scenarios in humans’mind.Finally,this dissertation conducts extensive experiments on three different real datasets with different characteristics,where the experimental results not only verify the effectiveness of RAM and SRAM in Chinese text classification task scenarios,but also correspond to the key technologies in the field of deep learning with interdisciplinary related principles of language cognition,so that the performance and rationality of our models can coexist.(3)This dissertation investigates a unified generalized representation method for both internal and external information in text,which is aimed at empirical organizations.In complex language scenarios,organically unifying the information inside and outside the text and realizing the integration of prior and posterior information is the key to maintaining the generalization ability of human language cognition.Therefore,based on the cognitive characteristics of Chinese and the commonality of human language cognition from the perspective of empirical organization,and considering the current predicament of natural language processing research being constrained by the construction of large models and expensive computing resources,this dissertation first expounds on the two-stage development process of statistical learning to deep learning in natural language processing.Then,this dissertation analyzes the influence of human potential thinking and empirical organization process on text semantic understanding,and the necessity of unified and efficient generalization methods for internal and external semantic information in text.Then,this dissertation designs a Statistics-based Label Interactive Model(SLIM).Specifically,based on the sociality nature of language and experience communication,this dissertation proposes a two-stage approach of "pre-classification(coarse classification)"+"enhanced classification(fine classification)" by imitating the cognitive process of human experience recall,probability analysis and sequential analysis,which can effectively realize the unified integration of prior and posterior information learning inside and outside the text.This two-stage modeling strategy does not rely on large-scale database and graph construction.It only needs offline statistical learning based on labeled multi-domain category data to achieve the pre-classification process on new data,and can achieve the enhanced classification process by appending it to existing deep learning models.Therefore,it is a lightweight and transferable text representation and classification enhancement framework.Finally,through extensive experiments on three datasets with different characteristics,the article demonstrates that the designed strategy can effectively improve the ability of existing models to understand text semantics.

Keywords/Search Tags:

Chinese Language Cognition, Text Representation, Intra-text Feature Modeling, Extra-text Information Modeling, Unified Generalization Modeling of Inner and Outer Text Information

PDF Full Text Request

Related items

1	Research On Topic Modeling For Short Text With Enriched Feature Representation
2	The Research On Local Smooth Preserving Of Manifold Regularization Auto Encoder For Text Representation
3	Research On Scene Text Extraction Techniques With Parametric Text Shape Modeling
4	Research On Multilingual Text Recognition In Complex Scenes And System Design
5	Text Representation And Algorithms For Chinese Text Classification
6	Algorithm Research On Text Classification And Named Entity Recognition Based On Deep Text Feature Representation
7	Learning Representations of Text through Language and Discourse Modeling: From Characters to Sentences
8	Research On Video Text Extraction And The Application In Virtual Karaoke
9	Research On Text Steganography
10	Web Text Classification System For Chinese Pretreatment Technology