With the development of Web2.0, interactions on the Internet have become more and more convenient and popular. More and more users get used to expressing their questions in natural language, while more and more users respond to these questions in natural language. It usually requires time for waiting for the answers. Considering that similar questions may have been asked and answered in some other forms, it is valuable to rapidly recommend proper answers to a new question by utilizing existing question-answering pairs. The system for this end is called community question answering (CQA) based on question-answer pairs. It has become a valuable application and has got more attention.Aiming at the shortcomings of current CQA researches, this thesis applies the comprehensive information theory to develop key techniques in CQA including similar question retrieval and candidate answer ranking, and comprehensive information based CQA is built with these techniques. More specifically, the main contributions of the thesis are presented as follows.A question similarity model based on the assumption of similarity transferring is proposed. The assumption of similarity transferring can be expressed by the assumption of "if the questions are similar, so are their answers". Existing work usually uses the assumption as a default premise, and does not consider whether the assumption is satisfied. In this thesis, the assumption is for the first time used as a constraint to get a good similarity measure, and an average correlation index is defined based on Pearson correlation coefficient to measure the extent to which the assumption is satisfied. Results show that models with a higher average correlation tend to get a higher precision, and the best model increases the average correlation by16.79%, which shows that the way of modeling a similarity measure by constraining it to the assumption of similarity transferring is beneficial to get a better similarity measure.A question similarity model based on comprehensive information is proposed. The model integrates syntactic information, pragmatic information and semantic information to model the similarity measure. And a simplified question retrieval model, VSM expanded with word2vec, is further proposed according results of integrated models for the purpose of simplifying the parameter training work. Of the integrated models basing on comprehensive information, the best model gets an avgP@1(average precision at1)of0.4586, increasing8.986%compared with VSM. The result shows that lexical features added with dictionary expansion features and part of results from a sentence parser will get better performance. Of the two kinds of beneficial features, the former one is more remarkable. The best simplified model increases avgP@3by33.60%.A candidate answer ranking model based on pragmatic information is proposed. In CQA, the authority of the answerer has a direct impact on the quality of the answer. And it is also clear that the evaluations to the answer provided by some other users are helpful to predict the answer quality. Both of the two kinds of information are user related pragmatic information and are integrated with contextual features to rank candidate answers in our model. Results on the Yahoo!Answer dataset show that pragmatic information remarkably improves the ranking performance of the basic ranking model with only contextual features. And the improvement could be enlarged and converged with the adjustment of parameters of the model. The larger number of candidate answers, the bigger improvement of ranking precision. Performances of models with both kinds of pragmatic information integrated outperform those of models integrated with single pragmatic information. The best integrated model respectively increases50%and40%over the two single integrated models. A prototype system of comprehensive information based community question answering is designed and developed. The question similarity models and candidate answer ranking models are integrated into the system to recommend proper answers to questions. When a totally new question is rendered, information outside of the CQA is required. The closed CQA question answering system should be developed to an open one. Question intent understanding is one of the most important techniques to realize the target. A comprehensive information based question intent understanding method is proposed. It decomposes the abstract question intent into the four elements including question type, question keywords, question focus and question domain. Methods of mining the four elements are further discussed. |