Font Size: a A A

Research On New Words Discovery And Topic Detection Technology For Microblogging

Posted on:2016-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:W K LiFull Text:PDF
GTID:2208330452970733Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of internet technology and the popularization of mobileterminal services, micro-blog gets rapid development and gets the favor of theindividuals, businesses and government, which is a new type of social media.Nowadays, many news and hot topics are published through micro-blog and spread.How to discover the important information from a large number of micro-blogs forindividuals, businesses, and even the government has important significance. Thispaper studies the new word detection and topic detection technology for micro-blog.And the main research content is as follows.(1) Research on micro-blog data collection.This paper introduces the principle of the traditional data collection method andthe data collection method based on the micro-blog API. Analysis of the advantagesand disadvantages of the two methods in terms of data collection micro-blog. In theend, combining the structure feature of micro-blog sites, this paper proposes a datacollection method, which is suitable for micro-blog. Also it collects3millionmicro-blog data by the method, and provides a rich resource of corpus for micro-blogtopic detection.(2) Research on new word detection for micro-blog.This paper introduces the current research at home and abroad for new worddetection. Also this paper introduces the common statistics and algorithm in the fieldof new word detection, and analyses the principle and advantages and disadvantagesof all kinds of methods. In the end, this paper detects the new words by calculating theinner combination degree and the boundary freedom degree of word. We participatedCOAE2014evaluation by this method, and achieved good results.(3) Research on micro-blog topic detection.This paper introduces the current research at home and abroad for micro-blogtopic detection. Also this paper introduces the common clustering and the similaritycalculation methods in the field of micro-blog topic detection, and introduces theprinciple of LDA subject model. In the end, this paper proposes micro-blog topicdetection based on LDA model and hierarchical clustering. Firstly, this method modelmicro-blog corpus with LDA model and extract micro-blog semantic information;Secondly, combining with the scheduling of micro-blog, improves the traditionalalgorithm of Single-Pass. The improved Single-Pass clustering and hierarchical clustering detect micro-blog topic.(4) Research on topic keyword extraction.At present, the research on topic keyword extraction is less. In this paper, theresults of this multi-level clustering was used as the corpus of topic keywordextraction. Firstly, the corpus was segmented and the common words were removed.Secondly, we statistic the words’ TF value internally, which appear in every topic.Thirdly, we statistic the words’ IDF value in all corpus. In the end, we calculate theTF-IDF value through the TF value and the IDF value, and put the top three words asthe topic keywords. The experiments prove that this method is effective.
Keywords/Search Tags:data collection, new word detection, topic detection, LDA model, keywords extraction
PDF Full Text Request
Related items