Font Size: a A A

Primary School And Secondary School Website Education Informationization Topic Discovery And Trend Analysis

Posted on:2017-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:H J ZhaoFull Text:PDF
GTID:2357330491452263Subject:Education Technology
Abstract/Summary:PDF Full Text Request
Education information is an important symbol of the national and regional education development. With the vigorous development of Internet technology and construction demand for education information, China's primary and secondary school has established websites as a platform form propaganda and communication. It's a current research focus for quickly and efficiently finding the related topic and keep on tracking from massive frequently updated school websites news reports.Based on the Topic Detection and Tracking, the article proposes an education information topic detection system which can deal with big data. The system includes two important parts, local event detection and topic detection. The first part is the filtering process using pattern matching, while the second one is the process of incremental hierarchical clustering. Each topic contains a series of related documents and expressed as a potential knowledge. The major contributions of this paper are:1. Solved the problem of collection and storage of large amounts of unstructured data. The paper set up a Hadoop distributed cluster and a web crawler to deal with the frequently updated web page data. The distributed cluster and Nutch are good solutions to solve the problem of the speed of data collection. The application of HBase makes it easy to store large amounts of unstructured web data.2. To extract web page information, this paper proposes a strategy including open source Jsoup, regular expressions and line-block distribution function three techniques. Open source kit Jsoup, pattern matching and line-block distribution function are used to extract information. The main function of Jsoup is to extract title, keywords and description, while the line-block mainly deals with time and the body of the page. Meanwhile, each page created as a Java class.3. To solve the problem of big data's computing, through the research and analysis of the MapReduce distributed programming model, the weight calculation TF-IDF formula, cosine angle and the clustering algorithm are designed to run on the MapReduce programming model, which lay the foundation for topic detection.Finally, comparative experiment expanded between primary and secondary school and Chinese education information website. The experimental results analyzed in terms of time and frequency change of topic content, which show that the event of education information in primary and secondary school website slightly delayed than that in Chinese education information website. The result also shows that the proposed method is effective.
Keywords/Search Tags:Education Information, hot topic detection, Distributed Computation, Big Data technologies
PDF Full Text Request
Related items