Font Size: a A A

Research On The Mining Technology Of Hot Accessing Topics From The Network Information Stream

Posted on:2008-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:W S LiFull Text:PDF
GTID:2178360245997822Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the fast development of the Internet, the importance of the net public opinion is becoming more greatly.Almost all of the hot spots of society can be found on the Internet. So, studying the net public opinion is of great significance in improving people's standard of living and establishing a harmonic society.In this thesis, through research on the definition, characteristics, origin and pathway of the net public opinion, we design a prototype for net public opinion system. Such a system, incorporating knowledge of many subjects and fields such as computer networks, web text mining, dada stream mining and natural language processing, does realtime network data stream monitoring, web page gathering, subject detecting and tracing, incident analyzing, news trend predicting and statistical report forming for the purpose of prewarning.It needs the technologies of natural lanuage processing to realize a hot topic mining system. In this thesis, we chose appropriate algorithms for word segmentation and sentence similarity computing, and made optimization to them.Data sream is continuous, infinite, rapid, time-varying. Frequent item mining algorithms need to perform as little data stream scanning as possible while using limited size of memory. The lossy counting algorithm is such a classical frequent item mining algorithm. In this thesis, we pre-process the webpage topics using related natural language processing techniques, so as to make the lossy counting algorithm applicable to textuary-type data stream. We then make use of the abstract of the lossy counting algorithm to realize the mining of recent accesed hot topics.Based on those theories above, we developed a complete system for mining hot accessing topics from network data stream, which includes modules for public opinion collecting, webpage feature extraction, topic keyword stemming, similarity comparision and frequent item mining. Throught tests on the performance of systems with different parameters, we studied the effects of different parameters on system performance and how to choose appropriate parameters to obtain the optimal performance. Accuracy and operating efficiency are two important criteria to evaluate the system performance. Through these tests, we also proved that this system does provide a satisfying hot topic mining result, which garantees good differentiation between topics and a relatively outstanding performance on high-speed.
Keywords/Search Tags:net public opinion, hot accessing topic, data stream, sentence similarity computation
PDF Full Text Request
Related items