Font Size: a A A

The Research And Implementation Method For The Answer Source On Open-domain Question Answering System

Posted on:2013-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:J Y JiaFull Text:PDF
GTID:2248330371490547Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In today’s society, the knowledge resources in the Internet is wide and rich, which give us much help and provide much information when we meet in our daily work and learning. The main way to get the net information in our life is using search engines, such as Google and baidu. However, these traditional search engine exposed some drawbacks, as the improved requirements of information accuracy and time efficiency to the users. For example, the form to analyze the user’s question is according to the key words of combination, this will bring deviation to the users’search. And, what returned to the users is the result of a large collection of web pages, that require users to identify and find. But the user hope to get the accurate and concise answers. On the basis of the traditional search engines, the new generation of automatic question answering system become the research focus in the field of information retrieval and trend, because of its characteristics of high efficiency. On the one hand, the users can use natural language questions conveniently. On the other hand, what returned to the users are the final answers, which have high theory research value and broad application prospect.The automatic question answering system mainly includes three parts, question analysis, information retrieval and answer extraction. Among these parts, the answer extraction is the last key steps. Whether we can do this step is directly related to the system’s accurate and efficient. This article mainly aims at the last step, as to the method for answer source. Combining with the results of the former researchers, we do some study in how to grab the Web pages, how to light the Web pages and how to extract the Web information. The main research results are as follows:(1) According to the users’questions, we can search the corresponding answers in the Web. On the platform of the traditional search engine, we will save the relevant Web answers to local. In the design of this experiment, basing on the knowledge base of "baiduzhidao", we will grab a certain number of web pages from the URL chain in width and depth through the Crawler program and the corresponding grab algorithm. Finally, these will be the library of our next information extraction answer source.(2) In the process of scraping document from the pages, the same or similar pages existed in the web increase the cost of the system and reduce the efficiency. Through using the related research achievements in lighting the web pages, we introduced based on the text blocks, the shingle method and based on the web set statistical methods, and give the assessment standards.(3) When extracting the information from the web document, we can filter the web tags, advertising, pictures and other useless information. According to the half structural characteristics of HTML pages, we can use the node structure of DOM tree to abstract the contents of the pages and extract the information form the document, ready for the answer extraction followed. Design the plan, and give related instructions.
Keywords/Search Tags:automatic question answering system, getting the answer source, Web crawlers, removing repetitive web pages, information extraction, DOMTree
PDF Full Text Request
Related items