Font Size: a A A

Preliminary Research On Intelligent Retrieval Of Topic-relevantDocuments Based On Event Frame

Posted on:2005-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:P B WuFull Text:PDF
GTID:2168360152967702Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The development of Internet technology provides a huge information space for human beings. How to acquire user favorite topical information from the space in a rapid, accurate, and complete way is becoming a hot point of the research on modern information retrieval technology. While, the capability of retrieval system not only depends on improving and advancing retrieval method but also relies on gathering and processing topic-relevant information that can provide efficient retrieval resource. In order to achieve better quality of Web information retrieval service, this thesis oriented to documents attracting users focuses on research of intelligent retrieval for topic-relevant information. The main task involve evaluating and removing the duplicated web pages, gathering event-relevant documents, and extracting and integrating event key information. There are some results as following: First, the texts are precisely extracted from web pages, and strings of feature code used for evaluation of duplicated web pages are extracted from the texts. Furthermore, propose an approach to fast removing-algorithm of Chinese duplicated web pages in large-scale based on comparing string of feature code. The algorithm uses information of content and structure of web page text efficiently and employs the fuzzy-matching to complete the task. Efficiency of the approach is optimized so that it is suitable to remove duplicated copies of Chinese web pages in large-scale. The recall rate of duplicated web pages reaches 97.3%, and the precision rate of removing duplicated web pages reaches 99.5% in large-scale testing. The second, present a retrieval method of event-relevant documents based on event frame knowledge according to several documents read by users. In the method, event frame is used for predicting the relevant documents, and event body is used for reducing interference of similar event, and evaluation function of system is improved. Experiments show the new method is advanced on retrieving event-relevant documents. F-Measure of the new retrieval system has increase of 31.5% compared with the system that doesn't use event knowledge and event body information. The third, implementing a system that can extract and integrate key information of event becomes practical. The new features of the system are as follows: (1) Extraction rules are built by sentence pattern as event information is extracted, then event information is directly extracted from texts in which base phrases are recognized, and temporal phrases (TP) and space phrases (SP) are recognized and normalized separately. So the extraction system is easily implemented owing to skipping complex syntax parsing. (2) The same event in different documents is related by normalized TP and SP of event, then the information belonged to an event is merged. (3) Because discourse structure of event text is looser, when new event appears in a text, the text is segmented. So the isolative information of events in same segment can merge its owner. Experiments show that means and strategies in this paper are feasible, and ours system basically achieves advanced level in term of event extraction in the world.
Keywords/Search Tags:intelligent retrieval, remove duplicated pages, event frame, information extraction, event merging
PDF Full Text Request
Related items