Font Size: a A A

Research On Micro-Blog Data Extraction And Topic Detection Method

Posted on:2014-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y QiuFull Text:PDF
GTID:2248330398950219Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As a rapidly popular network application in recent years, Micro-Blog is becoming more and more deep into everyone’s daily life. Because its content distribution can not only through computers but also through mobile phones and other portable devices, Micro-Blog has features such as real-time, debris and so on. Meanwhile, the owner of Micro-Blog can has a relationship between attention and concern, the content of a Micro-Blog can be commented and forwarded, so Micro-Blog has the characteristic of interactivity and flexibility. Based on the above characteristics, this paper does the work of data extraction and topic detection.Traditional network text extraction use graph traversal thinking through the web and web crawler to collect information. Based on this, this paper discusses the use of Micro-Blog open API interface for data acquisition. It first focuses on analyzing the principles of OAuth1.0and authentication, then studies the certification process, this certification acquisition is a prerequisite for the use of open interfaces, its purpose is to allow third-party applications access the data of the service side without disclosure of personal information, finally this paper uses Sina Micro-Blog open interfaces for data extraction and save the data obtained in a more efficient JSON format. Experiments show that this method is more efficient compared to the traditional method, and will achieve a smaller file size in the equivalent amount of data.Topic detection has been in-depth research in the field of data mining. It can extract a small amount of different themes from multiple scattered text files, so the overview of the data can be showed more clearly. When detecting topic, the traditional modeling method based on vector space model is easy to cause the loss of semantic, so this paper improve the existing feature weights and similarity calculation method, it uses a combination of semantic. Meanwhile, for the real-time characteristic of Micro-blog, this paper add time parameters in the pre-modeling phase to ensure the correction of topic detection. Traditional topic detection is mainly focused on the unstructured text, in this paper we consider the forwarding function of Micro-Blog, finally this paper select the improved single-pass clustering method to realize topic detection. By comparison experiments, it shows that the proposed method can have a good result in the topic of testing standards such as the undeteced rate and false detection rate.
Keywords/Search Tags:Micro-Blog, Data Extraction, Topic Detection, Vector SpaceModel, Single-Pass
PDF Full Text Request
Related items