With the rapid development of the Internet,online information is assuming the geometric growth.At present,network data witnesses explosive growth because of the huge amount of the information which is created every single day.It is becoming more and more difficult to retrieve the useful information and mean while the time and energy we spend on it are also increasing.The characteristics of the content of the information such as dispersion,redundancy and hysteresis make the traditional access to information difficult to meet the complex and diverse needs of users step by step.Finding the effective way to connect users and useful information,extracting valuable information from the Internet and providing knowledge-based services have become the focus of Internet users.The rise of socialized Q&A platform such as Yahoo!Answers provides our research with excellent material.Based on the items under the directory of Allergies which is the sub-directory of Diseases&Conditions which is under the directory of Health we collect with the crawler which is used to get the questions and answers of allergies of 2013,Q&A knowledge base is built and knowledge-based services such as searching,recommending based on questions and QA are provided.The main contents include the following four points:(1)Research and realization of the crawlerThis paper introduces the working mechanism of crawlers and common solutions to anti-crawler mechanism,develops a crawler based on Selenium-RC which is an open-source automated testing tool released by thoughtworks,and manage to get the access to the original data.(2)Text pretreatment techniquesIn this article,we have studied the text pretreatment techniques such as participle,removing stop words,stemming,lemmatization and other techniques.Also,we analyze the similarities and differences between stemming and lemmatization,along with the technique itself and the occasion to use it.Based on the techniques,the raw data is processed.(3)Similarity algorithmsIn order to build the knowledge base and provide knowledge-based services,we have studied several common similarity algorithms from the two aspects of words and sentences and finished similarity calculation of the items from the above two aspects.(4)Realization of knowledge-based services.Based on relational databases and Lucene which is a full-text search engine toolkit,we have built a health-care Q&A knowledge base.And we described the work principle and development status of knowledge-based services such as searching,recommending based on questions and QA.The three services mentioned above are realized. |