Font Size: a A A

Research On The Key Technology Of Ajax Depth Information Acquisition And Clustering

Posted on:2016-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:D M YouFull Text:PDF
GTID:2308330467972458Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the source of acquiring knowledge for people is no longer limited to books and classrooms. An increasing number of online class resources come into our modern life. Such as Netease Open Class, MOOC and so on. People can communicate with each other and share perspectives after watching videos. Study the information collecting and text clustering algorithm are very important to improve open class construction quality. This paper will involve two key technologies:information collecting and text clustering. According to analyze the characteristics of those websites, reviews are presented in Ajax dynamic page instead of traditional static HTML page. The texts are very colloquial and the implicit topics are also extremely dispersed. So it brings new challenges for the traditional information collecting technology and text clustering technology.I have independently completed research work as follows:Firstly, during information collecting, we used the HtmlUnit to simulate a Firefox browser. Then we call browser API to simulate page request and get the full page after interacting. Events are instead of URLs to represent page change states can avoid the traditional web crawler depends on URL and ensure collecting integrality. Review collection will provide a necessary data source for clustering. Secondly, text preprocessing contains NLPIR Chinese word segmentation, constructing user dictionary and sorting1205stop words. Texts need to be converted to the data model which computer can understand. We introduced the LDA topic model to consider semantic correlation. Thirdly, we take the choice of initial center as the breakthrough point. Text clustering based on some important topics dimension in advance. Then make the convergence center points as initial center to clustering base on all topic dimensions. The method made the choice of initial clustering center follow certain probability and avoid the clustering instability. In K-means, we also improve text similarity calculation formula. Combining the LDA with the VSM linearly and get the optimal parameters of linear equation by training samples.Experiments showed that collecting of review information based on event-driven acquisition mechanism is more efficient. The formula which combines VSM with LDA would improve text similarity accuracy. And it is applied into improved K-means algorithm, the clustering result will be better.
Keywords/Search Tags:Open Class, Ajax, Deep Crawl, Topic Model, Clustering
PDF Full Text Request
Related items