Font Size: a A A

Research On Temporal-Textual Indexes For Web Pages

Posted on:2012-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2178330338992039Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, search engine has been an important part in people's life and working hours. However, due to the rapid growth of Web information and the increasing of people's demands on Web search, general search engines can not satisfy users' request for information retrieval services. As a consequence, it is urgent to improve the searching effectiveness and efficiency of Web search engines. For this goal, many researches have been focused on the time information in Web. Most Web pages contain time information in their contents, such as business news, publication information, promotion information in online stores, and so on. It is meaningful to integrate those time information into the process of Web search. However, major search engines can only support Web search on the update time (or crawled time) of Web pages, and therefore are not sufficient to deal with temporal-textual queries, as the content time in Web pages is not in the searching scope. Hence, establishing a temporal-textual Web search engine with the support of temporal and text information is important to improve the search efficiency and effectiveness of Web search.In this paper, we research the time information and keyword information on Web environment, and propose some effective temporal-textual index structures. Based on the analysis on the traditional keyword and temporal index structures, we present several possible temporal-textual index structures. We perform both theoretical and experimental study to compare the performance of those index structures and finally find the best one, based on which we further make some improvement by introducing the hash technology. As a result, a hash-based temporal-textual index structure is proposed, and the experimental results show that it has better performance in comparison with previously-proposed one.The main contributions of this paper can be summarized as follows:(1) We propose hybrid index structures for temporal-textual Web search. According to the characteristic of time information on temporal text search engine, we divide the time information of Web pages into two parts, namely update time and content time. Then, we introduce the concept of primary time, and take it as the basic element in temporal-textual index structures. We study and compare five different hybrid ways for temporal-textual index structures, which are based on B+-tree, inverted file and MAP21-tree. We conduct experiments on both simulation data sets and real data sets, and measure the performance in terms of index size, Page I/O# and query time. As a consequence, the "first inverted file then MAP21-tree" index structure has the best query performance and thus is an acceptable choice for indexing temporal-textual information in Web search engines.(2) We introduce the hash technology to improve the "first inverted file then MAP21-tree" index structure and present the hash-based temporal-textual index structure. Based on the analysis on Web time characteristics, especially on the content time and primary time of Web pages, we convert the original content time, which is a time interval, into a time instant. Then we replace the MAP21-tree with a hash table, and construct a new temporal-textual index structure to improve the query performance. We conduct experiments on a real data set and compare performance in terms of index size, rebuilt time and query time on the basis of five types of queries. The experimental results demonstrate that the hash-based temporal-textual index structure has better performance than the "first inverted file then MAP21-tree" index structure.
Keywords/Search Tags:Web search, temporal information, hybrid index structure, temporal-textual query
PDF Full Text Request
Related items