Font Size: a A A

Research On Automatic Detection Technology Of Network Topic Based On K-means

Posted on:2012-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:S ChaiFull Text:PDF
GTID:2178330338493385Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Nowadays,with the rapid development of Modern Communication Technology and Internet Technologies,it has become a hotspot in the area of handling Mass Network Information that how computers discover the latest happening hotspots and hot-button issues in the first time and auto tracking theirs further apcert. This paper mainly studies auto detecting technologies in network topic,mainly including :collecting network data,extracting the Text-Inhalt of webpages,extracting naming entity and auto detecting technologies of topics. There are four research findings as follows:(1) Researching collecting network data technology,This paper constructs distributed system of collecting network data on the basic of traditional system of collecting web data,and realizes distributed collection for large-scale and dynamic network data. This collection system adopts distributed collection system architecture with"master-slave distributing and self-determination synergism", and blends in many collection policies , realizing high-efficiency collection of large-scale and dynamic network data.(2) Researching extracting the Text-Inhalt of webpages. This paper raises a method to extract the Text-Inhalt of webpages based on anchor text computing. This method aims at webpage's local noise,and firstly,this method analyses the structure of webpage's sound code and then determines webpage's content block by extracting labels;Secondly,this method calculates proportionally content blocks of the anchor text and adopts certain threshold value to sentence webpage's Text-Inhalt;Finally,this method calculates proportion of each content and webpage title and ultimately realizes integrated extraction of webpage Text-Inhalt. The experimental result demonstrates that this method can considerably improve integrality and accuracy of webpage Text-Inhalt extraction compared with traditional extraction method。(3) Researching extracting naming entity technologies. This paper raises a method to extract naming entities based on custom rule. This method aims at label errors of naming entities in the label specification of chinese naming entities,and adopts regular expression to set corresponding rules to correct chinese participle results,based on which this method realizes extracting network naming entities accurately. The experimental result demonstrates that the performance of this method is better than directly extraction. This is an effective extraction method of network naming entities.(4) Researching auto detecting technologies of network topics. This paper raises an auto detecting and dynamic method of network topic based on naming entities. This method aims at the difficulties to differentiate similar topics in the traditional topic detecting method,and from the vector representation of topic's centre,using naming entities and the assemble of character words to rebuild topic center. This method adopts maximum and minimum self-similarity clustering to realize auto detecting network topic ultimately. This is an effective auto detecting method of network topic.
Keywords/Search Tags:topic detecting, data collecting, naming entity, time stamp, self-similarity clustering
PDF Full Text Request
Related items