Font Size: a A A

The Design And Implementation Of Thematic Crawler Based On The Subject Ontology Of Basic Education

Posted on:2018-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z P HuangFull Text:PDF
GTID:2438330518992499Subject:Education Technology
Abstract/Summary:PDF Full Text Request
Web pages on the Internet is increasing at an alarming rate,the contradiction between the growth of the network information and the user's capacity of accessing to network information is increasingly prominent,the traditional search engines haven't meet the need of users' high-quality personalized information retrieval.For this reason,the vertical search engine comes.Vertical search engines is designed for retrieve specific areas of information,and it can return more comprehensive information.The core of the vertical search engine is focused crawler.Nowadays the number of basic education resource webpages has begun to take shape,but using the traditional web crawler to grab them can't make satisfactory results.Therefore,through the study of focused crawling method,I hope to explore the implementation of the basic education subject ontology based focused crawler,and make the crawler automatic indexing basic education resources scattered in different websites,and provides high quality of resource retrieval service for the teachers,students and researchers.This paper use junior high school physics as an example.First of all,I explores the method of building the junior high school physics ontology.I reference the "seven steps" method proposed by Stanford to propose the junior high school physics ontology building method.This method will divide the process of building subject ontology into frame construction and entity extraction two parts.Then this article introduces the design of the junior high school physics ontology based focused crawler,this crawler is developed based on the open source web crawler Nutch.The implement code of the focused crawling method proposed in this article is added in the core module of the crawler's ParserJob module.Through the analysis of the link anchor text and web page similarity calculation,this method can filtering webpages and grab the pages that we need.Link anchor text analysis is implemented by keyword matching.Keyword sets used in this study is constructed by semantic extension of the ontology.Similarity calculation is implemented by naive Bayes algorithm,and this method uses ontology filter feature terms to implement term space reduction to optimize the Bayes algorithm.Experiment shows that the average accuracy of the ontology crawler implemented in this article is higher than the Bayes algorithm based crawler and conventional crawler,and it can reach 68%.The analysis of the accuracy of every theme identification shows that the ability of the crawler designed in this paper is very stable.The difference between every theme identification accuracy rate is not more than 5%.Analyze the webpages grabbed by the crawler designed in this paper,we know that most of these pages contains courseware or lesson plan or exercise.
Keywords/Search Tags:Focused crawler, Basic education subject ontology, Naive Bayes algorithm
PDF Full Text Request
Related items