Font Size: a A A

Design And Implementation Of News Search Engine Based On MySQL

Posted on:2014-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:J X ChenFull Text:PDF
GTID:2308330464457786Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of modern information technology, the amount and types of information on the Internet are increasing explosively which greatly facilitate people’s daily life, work, and studying. But, there are new problems have been brought to people as the explosion of Internet information at the same time, such as the unified management these vast amount of information, how to index these distributed resources, and how to accurately obtain information people needed from the mass of Web resources. Search engines are the key technology to solve these problems, but the traditional general-purpose search engine collects all kinds of information without selection and services for all the different levels of users. Such attempt to cover every topic of Web information have met a great challenge when comes across the increasingly massive of information and a brand new breakthrough can hardly be made.The degree of concern and width of information of Ordinary users are always concentrated, so a specific-purpose search engine which concentrated on the specific needs of specific areas was created by search engine researchers. Unlike the traditional general-purpose search engines, a specific-purpose search engine will only collect topic related information on the Internet by content analysis to determine whether the content is specific topic related and only relevant information can be transferred to further processing. Therefore, the resource consumption of specific-purpose search engine was reduced and the accuracy of query improved greatly when compared to the traditional general-purpose search engines.The paper’s research works is based-on the specific-purpose search engine and take consideration of the News information as the main topic. During the study, thorough theoretical study of the search engine key technologies, acquired a further deep understanding of the field of search engines. In this paper, we select the Sina News website as the entrance of the web crawler whose main target is collect huge amount of news page corpus. The collection work was done by a specific-purpose web crawler. It starts from the website’s index page, then download the whole page to the local disk and by content analysis to extract all the news links. These new links was appended to the URL queue which was about to be crawled latterly. The depth-first algorithm was taken to traverse the whole website and the depth is given to three. After collected those original pages which contain a lot of HTML and JavaScript tags, a purification algorithm was taken to extract useful information. Finally, the search engine indexer will establish a database which was called the inverted index. Search engines will eventually provide search service for all the ordinary users. So a good web design and a good user experience query interface should be provided by the search engine. This paper describes in detail the design and implementation of web crawler, web information extraction and purification and the construction of inverted index. These technologies are hot topics of the currently natural language processing domain and artificial intelligence domain.This specific search engine started from the simplest technologies. And gradually realize other complex modules of the search engine. The final experimental results show that the system has a certain accuracy which achieved good results.
Keywords/Search Tags:Information Retrieval, Web Spider, Inverted Index, Specialized Search Engines
PDF Full Text Request
Related items