Font Size: a A A

Research And Implementation On Search Engine Based On Internet In Network Educational Resource Management System

Posted on:2005-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:A J YuFull Text:PDF
GTID:2168360125950934Subject:Computer applications
Abstract/Summary:PDF Full Text Request
NERMS (Network Educational Resource Management System) is a great project sponsored by Science Committee of Jilin Province and assigned to Knowledge Engineering Lab of the Institute of Computer Science and Technology in Jilin University. The aim of the project is to organize and manage various kinds of educational resources effectively so that people can share and gain them efficiently and can increase the speed of developing network education. The search engine designed and implemented here is part of NERMS, and it can extend the resources of education dynamically. We explore the engineering details and algorithmic issues behind search engines.First, we introduce basic conceptions and main techinques of search engines, and describe the general architecture of a search engine. In the second section we research the key techniques of crawling the Web, and how to impove the crawling program called spider.In the third section the most import component, Parser and Indexer will be discussed.At last we will discuss the problems involved in query and the techniques to rank results. The paper mainly includes works on three subjects:Crawling the Web,SpiderA spider collects pages from Internet based on HTTP protocol. It begins its search with one or more popular known pages. These may be handselected by a human guide. When a new page is retrieved, the robot extracts all URLs in the new page and adds them to its growing URL database.Thereafter, the spider "automatically traverses the Web's hypertext structure by retrieving a document and recursively retrieving all documents that are referenced by URL in the retrieved document."Firstly, the Web structure is both complex and non-uniform. On the Internet there are many types of pages, such as text, HTML, XML etc, and the text documents will be indexed easily, and the XML documents are well structured that can be processed easily too.But most of pages on the Internet are HTML pages, and these documents are HTML tagged, but the syntax of HTML is not strict,for example,a tag may have not end tag which will be ended by another tag.Secondly, there are lot of active pages, which contains form, Javascript and other active feathers,so a spider is required to handle the forms in a page,in another word,it can post a form or execute Javascript code.In order to maintain a connection between the browser and the Web server based on the stateless HTTP,the spider has to read and store Cookie returned form the Web server.The spider should not store the Cookie permanently,because when it visit this page next time the state will be a new state.Thirdly, there are some sites which require the connection between browser and server to be based on HTTPS.So spider should be designed to construct a HTTPS connection to the server.At last, spider should be high-powered, so here multithreading technique was adopted.A spider have many SpiderWorkers,one SpiderWorker extends the Thread class,so the SpiderWorkers can run concurrently.When a worker complete a task it will be assign another task by the spider.Parsing and indexing documentsOnce a database of pages has been accumulated, the Parser begins to parse the documents and then the Indexer indexes them. In the process HTML document must be parsed firstly,after that the text extracted form HTML pages must be pretreated,and then if the document is Chinese page Chinese words will be extracted.Once the words in the documents are presented the Indexer will select keywords as index entry.Firstly, since Web pages are HTML tagged, the parser can use these tags as guides. It may generate index terms from the components of the page itself, e.g., a component tagged "title," or a component labeled "description." The parser may also take HTML tags into account in weighting index terms, e.g., by giving a higher weight to a key word in a component with the tag "title."Secondly, parser use conventional stemming techniques to normalize index terms in English and segmentation to extract words from Chinese document.Two parts compose...
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items