Research On Automatic Detection Technology Of Network Topic Based On K-means

Posted on:2012-05-17

Degree:Master

Type:Thesis

Country:China

Candidate:S Chai

Full Text:PDF

GTID:2178330338493385

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Nowadays,with the rapid development of Modern Communication Technology and Internet Technologies,it has become a hotspot in the area of handling Mass Network Information that how computers discover the latest happening hotspots and hot-button issues in the first time and auto tracking theirs further apcert. This paper mainly studies auto detecting technologies in network topic,mainly including :collecting network data,extracting the Text-Inhalt of webpages,extracting naming entity and auto detecting technologies of topics. There are four research findings as follows:(1) Researching collecting network data technology,This paper constructs distributed system of collecting network data on the basic of traditional system of collecting web data,and realizes distributed collection for large-scale and dynamic network data. This collection system adopts distributed collection system architecture with"master-slave distributing and self-determination synergism", and blends in many collection policies , realizing high-efficiency collection of large-scale and dynamic network data.(2) Researching extracting the Text-Inhalt of webpages. This paper raises a method to extract the Text-Inhalt of webpages based on anchor text computing. This method aims at webpage's local noise,and firstly,this method analyses the structure of webpage's sound code and then determines webpage's content block by extracting labels;Secondly,this method calculates proportionally content blocks of the anchor text and adopts certain threshold value to sentence webpage's Text-Inhalt;Finally,this method calculates proportion of each content and webpage title and ultimately realizes integrated extraction of webpage Text-Inhalt. The experimental result demonstrates that this method can considerably improve integrality and accuracy of webpage Text-Inhalt extraction compared with traditional extraction method。(3) Researching extracting naming entity technologies. This paper raises a method to extract naming entities based on custom rule. This method aims at label errors of naming entities in the label specification of chinese naming entities,and adopts regular expression to set corresponding rules to correct chinese participle results,based on which this method realizes extracting network naming entities accurately. The experimental result demonstrates that the performance of this method is better than directly extraction. This is an effective extraction method of network naming entities.(4) Researching auto detecting technologies of network topics. This paper raises an auto detecting and dynamic method of network topic based on naming entities. This method aims at the difficulties to differentiate similar topics in the traditional topic detecting method,and from the vector representation of topic's centre,using naming entities and the assemble of character words to rebuild topic center. This method adopts maximum and minimum self-similarity clustering to realize auto detecting network topic ultimately. This is an effective auto detecting method of network topic.

Keywords/Search Tags:

topic detecting, data collecting, naming entity, time stamp, self-similarity clustering

PDF Full Text Request

Related items

1	Research For Algorithm Of Chinese Entity Linking Technology Based On Topic Relation Graph
2	The Design And Implementation Of WeChat Mini Program For Stamp-collecting Communication
3	Research On Topic Clustering Algorithm Based On Topic Models
4	Research On Social Network Community Detecting By Integrating With Topic Attributes
5	Research On Quality Management Of Stamp Collecting Management Software Development Project Of Harbin Postal Company
6	Research And Implementation Of Online Entity Disambiguation Based On Entity Gene
7	The Hot-topic Discovery Based On Density Clustering Of Feature Words And Similarity Calculation
8	Research On The Similarity-Based Clustering Of Time Series
9	An Improved Digital Signature Scheme Based On Time-Stamp
10	Theory And Key Techniques Of Entity Retrieval