Font Size: a A A

Research And Implementation Of Index Technology In Domain-specific Search Engine

Posted on:2012-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:H J TangFull Text:PDF
GTID:2218330344950307Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an information carrier, Internet has become an indispensable tool to obtain information. Owing to the development of Internet, people are enjoying the great convenience by large-scale information sharing across the distance of time and space. However, with the rapid development of Internet, web resources are changing rapidly and rich in content. But how to find the useful information quickly and effectively has become a common problem for the majority of Internet users. To meet their needs, the universal search engine come with the tide of fashion.The universal search engine, which is widely used now, has significantly improved the efficiency of retrieving information. According to CNNIC (China Internet Network Information Center) 26th Internet survey, the search takes up 76.3% for absolute advantage as a major way for users to obtain information from the Internet. Among almost all the surveys of using on the Internet in the world, search engine is second only to e-mail service. But with the growth of a wide range of information, these universal search engines can not meet people's needs either in retrieval precision or in retrieval efficiency when retrieving information on a subject or topic. That's because as long as the user enters the same keywords, the feedbacks of universal search engine are just the same. Universal search engine does not take the differences in interests and needs between different users, which often exist, into account. For example, dentists and ceramics enthusiasts would hold different concerns about the term "ceramic". In order to be more rapid, accurate and efficient in retrieving information on particular subject or theme, it is essential to develop information retrieval systems on specific areas, that is, the domain-specific search engine.Compared with universal search engine, domain-specific search engine collects only part of the information in the web,then judges the subject correlation of page and saves the pages relevant to the preconcerted subject. Meanwhile, domain-specific search engine uses intelligent strategies such as domain-specific knowledge, correlation computation, machine learning to compensate for the shortcomings of universal search engine such as large quantities searching results and low correlation. In that way the domain-specific search engine has significantly improved on query accuracy and efficiency compared with universal search engine. Index technology is one of the core technologies of search engine. The quality of the index technology has a direct influence to the precision and customer response time of search engine. Hence, it is really essential to study the index technology of search engine. Based on the in-depth study of relevant index technology of search engine, this paper studies the open source code of Lucene project by analyzing its architecture, basic data types, logical and physical structure of index database of Lucene as well as its indexing mechanism, the control of index weight, the optimization of index .On this basis ,this paper takes computer topic as example and uses the API interface that Lucene provided to makes some innovative improvements. The innovative improvements are as follows:Firstly, this paper improves the structure of index dictionary file. When users use computer domain-specific search engine to query, most of the search terms will be computer professional vocabularies.If the computer domain-specific search engine does the same as Lucene to put the whole indexing dictionary files into memory, it would spend more retrieval response time. But if classing the keywords according to computer specialty vocabularies and non-computer specialty vocabularies, putting the classed keywords into two index dictionary files, sorting the grouped keywords, and only importing the index dictionary file which stores computer specialty vocabularies into memory, the unnecessary response time for retrieval will be cut down.Secondly, this paper set a weight on documents which will be indexed. Because Lucene doesn't only face one certain subject or topic information, therefore the document scoring mechanism lacks pertinence and it doesn't set weight on documents which will be indexed. In order to meet the need of computer theme retrieval, it is necessary to set weight according to the type of document for index and retrieval needs to effectively improve the weight of computer professional documents. So that we can improve the accuracy of the retrieval of the information about computer subject.Thirdly, Changing index means. The default index means Lucene provides is to build index files on disk with single indexer. This would frequently execute I/O operations which result in low efficiency of indexing. To combine with the advantages of FSDirectory(a path of file system) and RAMDirectory(an area in memory), this paper designs to build index files by adopting memory-buffer distributed parallel index technique so as to shorten the time of creating index.Finally, this paper achieves the indexer of domain-specific search engine by using improved methods and devises a full-text retrieval system which is appropriate to computer subject retrieval and compared the full-text retrieval system with Lucene full-text retrieval system to verify feasibility and effectiveness of the indexer which the paper realized. The results show that: compared with Lucene, the indexer which this paper realized either retrieval response time or accuracy of retrieval results are more suited to the retrieval of the information about computer subject, and the efficiency of creating index is obviously higher than the efficiency of creating index by Lucene.
Keywords/Search Tags:Domain-specific search engine, Index technology, Lucene, Distributed parallel index, Subject correlation
PDF Full Text Request
Related items