Font Size: a A A

A Lightweight Search Engine Based On Text-mining

Posted on:2016-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:2308330461468316Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, the scale of information in the Internet is becoming larger and larger. How to more effectively use these information resources has attracted more and more attention. The Internet information source is very extensive with variety of forms, such as text, image, video, audio and other different forms. Confronted with different sources and different forms of mass information, how to accurately, quickly find the information we needed become a problem. So the development of a search engine is very necessary for us. Search engine is an information retrieval system, helping users quickly find useful information for themselves. It uses the web crawler grab information and then get Web page information extraction through certain technology. Some query interface is used to grasp information stored in the index database to realize information retrieval.As an application of Internet search engine is very crucial to the development of the Internet, becoming the commanding heights in various fields.Search engine is a seldom technology considering as a core technology among such Internet area. The most successful commercial search engine is Google search engine in the United States, followed by the largest Chinese search engine company Baidu in China. Bing search engine,developed by the cooperation of Microsoft and Yahoo, accounted for 30% market share in United States.360 company’s market share only less than Baidu.360 is a sudden emergence of freshman in search engine area. The search engine illustrate above are mainstream successful and mature commercial among such area.Currently,Memcached system used by the major search engines have been meet a major performance bottleneck. Several thousand of memcached cluster is very common in some companies. Memcached and Redis are both based on memory.Redis play the role of a database to support more data types while memcached cache towards cache; s performance to a single nucleus, the data is not the case in a single, Redis will performance better under a single core with database’s scale not very large for the reason that Redis is single threaded and only use one core.The Memcached is multi-threaded, so the performance is definitely not as well as Redis. Compared to Memcached, Redis did not choose libevent. Enormous versatility was build in memcached (Redis’s code’s scale was less than 1/3) in order to meet request for commonly used and sacrifice a lot of performance in a particular platform. Redis modified two files in libevent to achieve their epoll event loop. Therefore,in this paper Redis play a role of database more than cache system. The efficiency in a small amount of data and single-threaded model is far ahead of the traditional database system Memcached memory object caching system.The engine in this paper is a directory search engine.It is used to solve problem for cache and performance optimization. It provides website search service for education resources website of primary and secondary school. Optimization from the architecture design to underlying details of each level are as follows:(1) use TCP protocol and epoll multiplexer. Traditional search engines server based on POSIX platform adopt UDP protocol and poll multiplexer,it’s connection is not stable and data transmission is not reliable, handling task cost much system source caused by frequency copy between system state and the core state. The connection of search engine in this thesis is reliable and stable, epoll monitoring connection event greatly reduce the server load. It is better than traditional search engine on architecture execution level.(2) adopt search words correction mechanism for error correction and association recommendation.It provide a candidate word set to reduce training cost and improve efficiency of user, while the traditional search engine does not provide such function.(3) use multiple index coordination mechanism instead of the traditional single index, It can rapidly and precisely locates on web document in web page library.Efficiency for calculation of web page document relevancy is also improved at the same time during text mining.(4) use Redis to handle query history cache instead of traditional Memcached system.Redis was based on the memory while Memcached was based distributed system. It performances better than Memcached on handling cache.(5) use Hash Map to handle data instead of traditional Map, consumption of data storage and search was greatly reduced.The experimental data is Fudan University corpus. Test method is comparison between longitudinal and traversal test. We can find that the service average time is reduced from 5ms to less than lms, the actual performance is improved nearly 100 times from last average results.
Keywords/Search Tags:Linux, search engine, TCP, thread_pool, Redis
PDF Full Text Request
Related items