| Nowadays,Hanyu Shuiping Kaoshi(HSK)becomes an internationally recognized standardized test for Chinese proficiency,so the automatic analysis of text readability is selected as the topic of this research.By means of combining machine learning with the readability of HSK essay reading test texts,a way to realize the automatic evaluation of the readability of HSK essay reading test texts based on multi-level Chinese text characteristics is explored,so as to provide a certain basis for selecting HSK essay reading test texts,and then to economize the HSK proposition process.On the foundation of existing achievements,firstly,a characteristic system for the readability of HSK essay reading test texts is constructed by starting with the four levels of Chinese characters,vocabulary,syntax and context.Meanwhile,in this way,34 initial characteristic indexes potentially affecting the readability of HSK essay reading test texts are identified,among which,the specific study of literary form is one of the innovations of this research.Secondly,through the independent sample t test and chi-square test,25 characteristic indexes with significant differences are extracted as input variables for subsequent modeling.Thirdly,once the characteristic indexes are determined,the support vector machine and decision tree algorithm are applied to establish a model for the readability of HSK essay reading test texts,which is expected to select the optimal method for classifying the HSK essay reading test texts.As can be learned from comparison,in terms of test set,the support vector machine model based on the Gaussian kernel function performs best,with up to 97.83% of accuracy in classification of HSK essay reading test texts,95.83% of precision,100% of recalling rate and 97.87% of F1 value.Therefore,in accordance with the existing datasets and characteristic indexes,this research believes that the support vector machine model based on the Gaussian kernel function is the best for evaluating the readability of HSK essay reading test texts.In addition,the performance of the established model on the simulated HSK essay reading test texts and HSK textbooks is studied,so as to further analyze the generalization ability of the established model.According to the results,the established model also achieves satisfying effect on classifying the simulated HSK essay reading test texts,with an F1 value at 85.39%,but performs relatively poor in HSK textbooks.Finally,in the viewpoint of textual characteristics,the reasons for the evaluation results are discussed,and some suggestions on compiling the simulated HSK essay reading test texts as well as HSK textbooks are proposed.It is concluded that the reasonable distribution of vocabulary and literary form shall be controlled when compiling simulated HSK essay reading test texts,but the importance of literary form shall be highlighted when compiling HSK textbooks. |