Font Size: a A A

The Applicability Of Zipf's Law In Chinese Language Based On Words' Frequency Statistics

Posted on:2012-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:F Y HeFull Text:PDF
GTID:2218330338470700Subject:Chinese Philology
Abstract/Summary:
In this paper the main problems to be solved by large Chinese text corpus of word frequency statistics and analysis, including Zipfs first law and Zipfs second law, including the applicability of the validation studies conducted in Chinese.To carry out this research work is divided into five chapters:The first chapter is a general introduction and word frequency statistics for an overview and clarify the definitions and characteristics of word frequency, word frequency statistics of foreign and domestic described the development process of thesis describes the purpose, significance and content.The second chapter is the subject of this study-the development process of Zipfs law studies, this paper explains the theoretical background and philosophy, mathematics on the alignment from the husband's law is deduced and presentation, and review for the domestic Zipf's law and the applicability of Zipf's law studies in Chinese.The third chapter is a large-scale text corpus by word frequency statistics and analysis to verify the first law of Zipf's Chinese applicability. This article first defines the word level and word order to distinguish clear the way for the next experimental obstacles; followed by experiment 1 on the word-level verification and comparison of methods for determining, selecting a more appropriate method for determining the word class; then experiment 2 on a language test materials and statistical manual segmentation, segmentation means to compare the computer and artificial means of statistical difference between the computer means to verify the feasibility and credibility; Finally, experiment 3, large-scale text corpus word frequency statistics and analysis Corpus six points plotted Zipf's distribution curve and distribution curve of the number, and with the first law in the Zipf's plot of the Zipf's distribution curve ideal and the ideal Zipf's distribution curve compared to the number to determine Zipf's first law of Chinese applicability.The fourth chapter is the second law of husband aligned text corpus for large-scale verification and analysis to determine the distribution of low-frequency words the law section of the Chinese word frequency, and the alignment of the applicability of the second law husband.Zipf paper described the development process of the second law, and Zipfs first law with distinction and connection; then set the experiment 4, the first five minutes corpus corpus statistics of the same frequency and number of words number of words with frequency logarithm, then the second law of Zipf's, the number of words with the theoretical frequency derived to calculate the predicted number of words with the same frequency and the predictive value of the number of same frequency on the number of words, and finally proposes to draw five word corpus of their respective distributions of the same frequency curve, with the number on the number of word-frequency distribution curve, the number of predicted words with the same frequency distribution curve and predictive value of the log number of words with the same frequency distribution curve, for comparison, in order to determine the second law of Zipf's Chinese applicability.The fifth chapter is the conclusion, made on this article summarizes statistics and verification, to be reflection on the shortcomings of the future can continue to work for Outlook.Corpus of six points in the Zipf's distribution, we found that large Chinese text corpus word frequency distribution of words in the high-frequency words and IF stages of the first law consistent with Zipf's distribution, while the low-frequency words while the Frequency section and the zipf's distribution of the second law husband is more consistent.Accordingly, the large Chinese text corpus term frequency distribution of low-frequency words than paragraph described the Zipf's first law of the linear descending trend of significant deviation, showing a parabolic decline in the state; in which the high-frequency words of paragraph Zipf's word frequency distribution and the second law of distribution is not described in similar ladder-like decline.It also reflects Zipfs first law and second law of their scope and range.This paper concludes the large-scale Chinese text corpus distribution consistent with Zipf's law.
Keywords/Search Tags:word frequency statistics, Zipf's first law, Zipf's second law, the applicability of Chinese language
Related items