Research On The Key Technology Of Ajax Depth Information Acquisition And Clustering

Posted on:2016-08-28

Degree:Master

Type:Thesis

Country:China

Candidate:D M You

Full Text:PDF

GTID:2308330467972458

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, the source of acquiring knowledge for people is no longer limited to books and classrooms. An increasing number of online class resources come into our modern life. Such as Netease Open Class, MOOC and so on. People can communicate with each other and share perspectives after watching videos. Study the information collecting and text clustering algorithm are very important to improve open class construction quality. This paper will involve two key technologies:information collecting and text clustering. According to analyze the characteristics of those websites, reviews are presented in Ajax dynamic page instead of traditional static HTML page. The texts are very colloquial and the implicit topics are also extremely dispersed. So it brings new challenges for the traditional information collecting technology and text clustering technology.I have independently completed research work as follows:Firstly, during information collecting, we used the HtmlUnit to simulate a Firefox browser. Then we call browser API to simulate page request and get the full page after interacting. Events are instead of URLs to represent page change states can avoid the traditional web crawler depends on URL and ensure collecting integrality. Review collection will provide a necessary data source for clustering. Secondly, text preprocessing contains NLPIR Chinese word segmentation, constructing user dictionary and sorting1205stop words. Texts need to be converted to the data model which computer can understand. We introduced the LDA topic model to consider semantic correlation. Thirdly, we take the choice of initial center as the breakthrough point. Text clustering based on some important topics dimension in advance. Then make the convergence center points as initial center to clustering base on all topic dimensions. The method made the choice of initial clustering center follow certain probability and avoid the clustering instability. In K-means, we also improve text similarity calculation formula. Combining the LDA with the VSM linearly and get the optimal parameters of linear equation by training samples.Experiments showed that collecting of review information based on event-driven acquisition mechanism is more efficient. The formula which combines VSM with LDA would improve text similarity accuracy. And it is applied into improved K-means algorithm, the clustering result will be better.

Keywords/Search Tags:

Open Class, Ajax, Deep Crawl, Topic Model, Clustering

PDF Full Text Request

Related items

1	Research Of Deep Web Crawler Supporting Ajax
2	Research On Subject-Based Incremental Parallel Crawling
3	Research And Implementation Of The Topic Web Crawlers
4	Research On Topic Clustering Algorithm Based On Topic Models
5	Network Hot Topic Discovery Based On Topic Model And Clustering Algorithm
6	Crawl Model Based Testing Of Soap Protocol
7	Sphere Topic Model Based On Word Embedding In Text Clustering Field
8	News Topic Detection Based On LDA Fusion Model And Multi-layer Clustering
9	Design And Implementation Of An Ajax-supported DEEP WEB Crawlershanghai Jiao Tong University
10	A Study Of Web 2.0 Community Oriented Crawling Techniques