Font Size: a A A

Comparison Of Chinese And English Textual Features Based On Quantitative Linguistics Indicators

Posted on:2018-01-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:R N ChenFull Text:PDF
GTID:1365330629482376Subject:Foreign Language and Literature
Abstract/Summary:PDF Full Text Request
Quantitative linguistics distinguishes itself as a branch of linguistics that aims to explore the laws underlying the language phenomenon through accurate quantitative techniques.In recent years,with the gradual mushrooming of quantitative indicators used to measure textual features,not only can we use these indicators to verify some conclusions from traditional qualitative research,but find other linguistic laws underpinning different texts that may not be found or well-explained by other research methods.This paper selects indicators that are comprehensive reflections of quantitative linguistic features of the language system,and conducts synchronic and typological linguistic comparisons of representative text types in both Chinese and English.We use LCMC(the Lancaster Corpus of Mandarin Chinese)and Frown(the Freiburg-Brown Corpus of American English)as our research corpora,specializing in comparative study on Chinese and English.We quantify textual features as data for the following calculation of the frequency and the frequency distribution with statistical techniques,with the aim to discover the mathematical laws underlying the language phenomenon,interpret the intrinsic reasons from the perspectives of linguistics,and provide ideas for the study of language.This paper has six chapters.The first chapter is the introduction.We first sort out and review studies concerning textual features in the domestic and foreign academic arena that are of qualitative and quantitative nature,as well as comparative research on Chinese and English,and then we put forward our research questions.In the second chapter,we introduce the data and the research method,including the corpora of LCMC and Frown,some important quantitative indicators(entropy,h point and thematic concentration),and some measuring tools.In the third,fourth,and fifth chapters,we conduct empirical studies with specific quantitative indicators to compare different text types,focusing on syntactic variation,lexical richness and degrees of concentration of the theme(s)respectively.The sixth chapter is the conclusion.We summarize the main findings,the issues unexplored,and point out directions for further research.The third chapter adopts the indicator of entropy to investigate the syntactic variation and aspect markers of different text types.Entropy is an important parameter in information theory,reflecting the look of language information in mathematical terms.The three quantitative indicators based on entropy employed in this chapter,namely,word positional relative entropy,POS(Part-of-Speech)positional relative entropy,and relative entropy of aspect markers,may cast some light on syntactic variations of different text types.The most distinguishing feature of using entropy-based indicators in exploring syntactic variations is that this paradigm has taken into consideration of mutual influences and correlational strength of consecutive words or POS in a sentence,which can seldom be accomplished by other research methods.The means of POS positional relative entropy may be a more reliable indicator of syntactic variation.Statistical tests verify that POS positional relative entropy and relative entropy of aspect markers can distinguish different texts,especially dichotomous“narrative vs.expository”texts in both Chinese and English.The fourth chapter compares different text types from the perspective of lexical richness,with the indicator of entropy employed again.We define a difference of lexical richness between two text types as a difference in their word type probability distributions.Further along this vein,the concept of“lexical richness”is under discussion with three interrelated distribution functions,namely TTR(Type-to-Token-Ratio)distribution,TTR-Entropy distribution,and word frequency distribution profile.We later find that TTR distribution and TTR-Entropy distribution can distinguish different text types in both Chinese and English.TTR-Entropy distribution may be considered as a type of Lorenz curve,to be specific,“Lorenz curves for scale-free networks”.We then borrow the parameter~?(indicating the degree of upward convex of power function)of Lorenz curves to compare TTR differences,which is the most direct way to compare differences of lexical richness in different text types.In both LCMC and Frown,Official Document has the highest lexical richness,Fiction the lowest,and News ranges somewhere between the two.The fifth chapter compares different text types from the perspective of thematic concentration via three quantitative indicators,namely,thematic concentration(TC),Secondary Thematic Concentration(STC)and Proportional Thematic Concentration(PTC).The method of measuring thematic concentration of different texts distinguishes itself from other content analysis in that it quantifies in an exact manner the concentration of themes of a text through its thematic words,based on which some advanced statistical tests may be conducted.With the values of three quantitative indicators of three representative text types(News,Official Document and Fiction)as eigenvectors,we then compare thematic characteristics of these texts with the methods of PAM(Partition around Medoids)and HA(Hierarchical Agglomerative)clustering.The results show that eigenvectors standing for the thematic characteristic of three text types can be clustered into their corresponding categories in both Chinese and English.Two contributing factors are identified for the clustering results:(1)One is the hierarchical differences manifested both from the values of each indicator in each text type,as well as the values of three indicators in three text types;(2)The other is the differences of“thematic words”both in terms of their amount as well as their POS(Part-of-Speech)types.The differences of nouns'percentages at the pre-h-point and pre-2h-point domain contribute to the thematic differentiations of three text types.The ranking of nouns'percentages in descending order is the same in both LCMC and Frown,that is,“Official Document>News>Fiction”,which corresponds with their three tripartite“intensive-balanced-dispersive”confrontation.This characterization also bears a cross-linguistic similarity in both Chinese and English.
Keywords/Search Tags:Quantitative Indicators, Entropy, Word Richness, Thematic Concentration, Text Type Differences
PDF Full Text Request
Related items