Font Size: a A A

Construction And Application Of Chinese Euphemism Language Resource Based On Natural Language Processing Techniques

Posted on:2022-06-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:C L ZhangFull Text:PDF
GTID:1485306728464904Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Euphemism is a common language phenomenon.It is an essential language skill to achieve smooth communication.Euphemism has always been a hot topic in linguistics,but no relevant research has been found in Natural Language Processing.Euphemism is an issue worth studying.By constructing a Chinese Euphemism Language Resource,it is of great significance to machine translation,metaphor recognition,sentiment analysis and post-editing in natural language generation.In the field of linguistics,the study of euphemism involves all aspects,including its definition,classification,language taboo,cognitive motivation,etc.Most of the researches focus on Principle of Conversation,Sociolinguistics,Cognitive Linguistics,Language Teaching and so on.However,most of them are theoretical and qualitative studies,lacking of relevant quantitative studies and corpus.In Natural Language Processing,there are a large number of mature automatic or semi-automatic techniques to deal with language issues,but lacking of formalized processing methods and human tagging of basic language resources for euphemism.The paper attempts to combine the theories and techniques of Linguistics and Natural Language Processing.Based on Natural Language Processing techniques,a dynamic Chinese Euphemism Language Resource is constructed and preliminarily applied.The work of the paper mainly includes three aspects as follows:1.To construct a Chinese Euphemism Language Resource that can be used in Natural Language Processing.By comparing four commonly used Chinese euphemism dictionaries,the scale of commonly used euphemisms is determined.By estimating the effect of Chinese automatic word segmentation for example sentences,we finally choose People's Daily as the source of euphemism sentences.In this step,63,159 sentences are added to 923 euphemisms.In order to do a better job in the construction of language resources in the early stage,we use the method of five-expert-voting and manual tagging to annotate all sentences,including the changes of euphemisms on their semantic,usage,and sentiment.Based on the existing researches on linguistics,we classify euphemisms in detail at the semantic level,by combining relevant tasks in Natural Language Processing The classification includes 11 categories with 2-5 subcategories in each.Relevant arguments and examples are also tagged to each euphemism.2.By using Natural Language Processing techniques,the paper makes it possible to recognize euphemism automatically.Moreover,the resource can be automatically updated to obtain large-scale sentences into corpus.In the paper,we use word embedding to generate sentence vectors by using arithmetic average and TF-IDF weighted average.K-means and spectral clustering are used to cluster euphemism sentences unsupervised.Through the analysis and visualization of the results,it is found that the contextual features of euphemism are difficult to learn by unsupervised clustering.The automatic semantic judgement of euphemism needs the prior knowledge of manual tagging.Therefore,we use KNN and SVM,two supervised classification models,for experiments with ten-fold cross validation for evaluation,and achieved good results.Based on supervised classifier,the highest accuracy is 96.29%,and F1 value is 0.9167.We also use the trained supervised classifiers to judge euphemisms that are not included in the corpus.With under-sampling,we make up for the problem that the number of different types of samples in training set is not balanced,which greatly affects the prediction performance of the classifiers.The experiment has achieved some results,however it still needs to be improved.The scale of Chinese euphemism resources still needs to be further expanded.3.The paper uses euphemism automatic recognition technology to recognize massive sentences,with billions of Chinese characters,and analyzes the development trend and causes of diachronic change of euphemism.In the paper,we use the trained classifier to recognize euphemism automatically.By using automatic recognition and quantitative statistical analysis on millions of euphemism sentences extracted from the corpus of People's Daily from 1946 to 2017,the paper studies the diachronic change and development of euphemism,and analyzes the reasons.We uses large numbers of data to show the covariance among the development and change of euphemism,society and people's concepts.From the perspective of quantitative research,the paper proves Gresham's Law and Law of succession in the development of Language Change.4.A preliminary attempt is made to automatically rewrite of euphemistic expression: The paper attempts to rewrite Chinese sentences to make them euphemistic,specifically,changing the sentences which express attitudes and opinions into euphemistic expressions.To achieve this,we start with the annotation of dictionaries and refer to the previous linguistic research.By using Stanford core NLP to generate a syntactic tree,the rewriting rule is set according to the part of speech of the replaced word,position in the sentence,and the context.For the problems that cannot be solved by linguistic rule,Ken LM is used to train a language model to score the rewritten sentences.By calculating D-value of the scores before and after rewriting,a threshold is set for filtering in order to achieve the automatic rewriting of euphemistic sentences.
Keywords/Search Tags:Euphemism, Construction of Language Resources, Corpus, Diachronique, Covariance
PDF Full Text Request
Related items