| Word frequency statistics is a research method of lexical analysis. It’s a way of knowing the pattern of word usage, by calculating the number and frequency of an individual word in a defined corpus. Word frequency statistics is applied to linguistics, informatics, bibliometrics and other fields. What’s more, high-frequency words are recognized particularly important while doing research in these fields. High-frequency words have a frequently practical application in daily life. It has an important effect and research value in the usage and development of a language. The main work I have done in this paper is a comparative analysis of the usage and development of high-frequency words in different corpuses which is based on different periods in order to benefit the comprehension of the rules of Chinese language developments. In addition, verifying the continuity and inheritance of Chinese language developments would further reveal the historical distribution mechanism of Chinese frequency by explaining the individual difference of high-frequency words use.This paper is divided into six chapters:Chapter one, is a brief introduction of the development situation of the corpus and word frequency at home and abroad. In this chapter the research purpose, significance and the content structure will be explained.Chapter two mainly introduces Chinese word segmentation which includes basic definition, heavy difficulties and major algorithms; meanwhile, describes evaluation of segmentation effect’s three standards which called Precision, Recall and F Index. The above standards would be used as criteria to prove the accuracy of chosen word segmentation software is high, so that it is executable to use this software.Chapter three mainly describes the calculation of the word frequency statistics. The corpus was divided into two time periods according to the rules of Chinese language evolution because the Chinese corpuses in this paper have a certain time span. And also use different methods of word frequency statistics on the basis of two time periods. The first period is from the pre-Qin to the Song Dynasty. The learned word appears in the form of monosyllable in that time, which means a character, is a word. In the meantime, corpus use MyZiCiFrep as the word frequency statistical tool. It can calculate the number of each character in the corpus and output the results according to the frequency automatically. The second period is during Ming and Qing Dynasties or after them, and the use rate of disyllable of that time have been increased. Corpus use programmed algorithm to calculate and rank the word frequency. At last, this paper introduces the word frequency algorithm.Chapter four is the emphasis and main part of this paper. The chapter briefly introduces the source corpus in this paper. The corpora are divided in Pre-Qin, Han Dynasty, Wei Jin and the Northern and Southern Dynasties, Tang, Song, Ming and Qing Dynasties according to the order of different works. TONP is the method to define high-frequency words and low-frequency words. At first, the data from Pre-Qin to Song Dynasty would be analyzed to make a correlation table of word frequency by certain methods. Via methods such as overall contrast and variance contrast, it could be clear that most of the high-frequency words in this table are stable in the development of language. At last, high-frequency words in Ming and Qing Dynasties would be compared with data obtained previously. We can conclude that the serial numbers of high-frequency words only have a little connection with time period by a data analysis tool named ANOVA in EXCEL. It shows that the change of high-frequency words might not be drastical over time; meanwhile, thhe use of high-frequency words dose not change much due to the inheritance and continuity of a certain language.Chapter five makes a simple correlation analysis about English and Chinese. And found that there’s some linear relation between the usage rate of top 1000 words in English and Chinese. These words have a certain positive correlation. It also verified that there are similarities in the developments of different languages.Chapter six is the conclusion and outlook. It summarizes the conclusions of this paper, meanwhile, the demerits of this work as well as the future trends of this topicwould be demonstrated. |