Font Size: a A A

A Research And Implementation Of Vertical Search Technology In Archives Domain

Posted on:2012-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z H WangFull Text:PDF
GTID:2178330332485790Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Archives are very important files that every country and everyone is connected to them. The construction of archival information in our country is far behind the developed countries', though had some achievements. Research how to promote the construction process and utilization factor of archives is a critical topic of our nation.Search engine have gained people's favor for its advantage to serve real-time and extract information they want, and becoming into their major tools to gain needed information. But the vertical search engine come out and grow up rapidly because normal search engine which have a width coverage and inaccurate information, can not satisfy user's need. Differently from normal search, the vertical search faced to exact domain, so it can be more concerned, more professional and search deeper information in particular domain. Nevertheless, the vertical search engine is still not satisfactorily. Research and improve is very hot whole the world nowadays. Main content in this article is research features in archives domain, research and improve vertical search technologies based those features and used into archives domain.Initially, research and implement the topic crawler based archives domain characteristics to collect archives'information is the beginning. Archives are special files with many unique features, such as originality, normative storage format, reappear the history, unified administration, consistent identification, etc. And they are stored in specified storage websites, by which offer access to society or special audience. As a result, topic crawler in archives can be restricted in limited range and search documents for analysis. A domain faced linked analysis algorithm is advanced for those purpose. Strategy to use irrelevant files find relevant is given, too. Files collected by topic crawler need content analysis, meanwhile keywords with weights will be calculated and abstracts will be extracted. Improved TF-IDF (Term Frequency- Inverse Document Frequency) algorithm is used to calculate the weights of keywords based on the existence of instruction documents and which contain very important information such as keywords, owner and so on. The weights of keywords in such documents would be assigned to 1 when they exist, otherwise, different weights would be assigned according to the place that contains keywords, title, body, abstract and other. Besides, archives and relevant files are processed into structured files, xml files, using text analysis technology, so as to supply more accurate search results. Both static and dynamic abstracts are used in search process to provide more appropriate document summarize. If the archive contains abstract, it will be used as static abstract. If not, dynamic abstract will be combined from sentences which have keywords user input. Those sentences can be found quickly by using place information in index. After user's search, they can vote the result, and the vote will be used to optimize the system. Additionally, a vertical search engine in archives domain is designed and its'flow chart is given. Crawler algorithm and craw strategies, improved TF-IDF are implemented at the same time.Oppositely, a Best-First Search algorithm and the TF-IDF algorithm are implemented too. According to the research and experiments, use those improvements can get a better result. The topic crawler can collect more relevant files, and the indexer could calculate keywords'weights more precisely. Techniques this article suggested can become a reference for our nation's archives information construction and vertical search in this domain.
Keywords/Search Tags:Archives, Vertical Search, Topic Crawler, Query Sort
PDF Full Text Request
Related items