In this highly information-oriented era, the Question Answering(QA) systems are playing more and more important roles in the human-computer interactions, and the growing demand for information exchange also makes the functional requirements of question answer system to improve unceasingly. As a comprehensive product of information retrieval and natural language processing, QA systems can make correct responses to the questions raised by the users in natural language presentation. However, questions with grammar errors usually result in wrong responses, so the grammar error detection is an important necessary function for the QA systems. As the natural language used in our daily life and work, it is of great significance to study the automatic detection of Chinese grammatical errors.Traditionally, the researches on the automatic diagnosis of grammatical errors are mainly focused on the text editing, text recognition, speech input, language learning, and so on. In recent years, the grammatical error automatic diagnosis technology has been applied to the machine translation, the pretreatment of the questions input and the answer generation in the QA systems and many other fields. The methods used in current studies on the grammatical error automatic diagnosis are relatively simple, for the existing methods are either based on the rules or statistical methods or based on only one machine learning method, and there are few methods integrated the three factors to analyze the problem. To solve this problem, this research proposed a method based on the n-gram model, the word part of speech of the sentences and the dependency syntax tree structure, and carried out the grammatical errors automatic diagnosis in the aspects of the classification and sequence annotation, separately.The main contents of this research consists of three parts, that is the corpus analysis and expansion, the grammatical errors automatic diagnosis based on classification and the grammatical errors automatic diagnosis based on sequence annotation. Firstly, the research built heuristic rules and expanded the corpus through the analysis of the characteristics of different types of grammatical errors in the corpus. Secondly, for the grammatical errors automatic diagnosis based on classification, the research extracted the binary and ternary composition of the part of speech in statement level and the n-gram model based on statistics of the part of speech as features to construct base classifiers and ensemble classifiers, and then the convolution neural network is used to construct classification model from a different aspect. Lastly, for the grammatical error automatic diagnosis based on sequence annotation, the research mainly used the characteristics of sentences’ syntactic dependency tree and constructed a model with conditional random field to diagnose the grammatical errors. This approach can not only diagnose the error types of the sentences, but also identify the location of the error statement meanwhile.Among the models constructed above, the model based on the linear weighted ensemble classification method achieved the highest F-Score value(36.28%), and the F-Score value can be further enhanced(37.87%) via fusing the model with a model based on the rules, while the SVM based method achieved the highest recall rate(44.11%), and the method based on CRF has achieved the highest accuracy(40.00%). This research constructed an effective model to detect the sentence ’s error and classify the error into correct types, considering the characteristics of the models mentioned above comprehensively. On this basis, the research implemented a simple Chinese grammatical errors automatic diagnosis platform, which can be helpful to the optimization for the questions presented by users and the answers generated by the QA. |