Font Size: a A A

Summarization Method On Social Network Corpora

Posted on:2018-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:X J GuanFull Text:PDF
GTID:2428330512494294Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the social network era,we all are knowledge publishers.Knowledge mainly comes from twosource,one is the user's content(twitter,blog)while the other is the interaction between users(tags,comments,reply).The content contains a huge amount of information waiting for mining,such as public opinion,the users' personal hobby and preference.However,data mining on these corpora meets great challenges because of its huge volume.It is particularly important to summarize these large corpora.Text summarizationhas been a long-standing subject in natural language processing.Methods proposed by predecessors have been effective and widely used in plain text corpus,but they had poorperformance in the corpora of social networks.The main reason is that they neglect the "social" attribute of social network corpus,which is mainly reflected in the short length of sentence,informal words,verbal words,new words and so on.In addition,the information contained in the comments,replies,the label added by authorcannot be used in the traditional method,which plays an importance role in theinference process of such corpora topic.Based on the above reasons,we propose atext-summarization method which is suitable for social network corpus.This method can meetsummarization needs of three different scenes,which including tag-driven summary,comparative summary and real-time summary.The model can not only automatically estimates the number of topics to "let the data speak",but also demonstrates of "rich get richer" in the topic evolution of social networks.In addition,as an important carrier of public opinion,the online debate data needs to be summarized.Users may only join a few debates which makes a lot of missing entry.The sparsity of online debate data makes regular feature selection difficult.The paper proposes a topic selection algorithm based on the ensemble learning,which summarizes all the topics according to the attributes of different groups,and puts forward a subset of the topics that can distinguish the group parties mostly.The subsetcan be viewed as the main focus of the group.This paper presents summarization algorithms of labeled corpus data and online debate data respectfully,and experiment result verifies the effectiveness of these algorithms.
Keywords/Search Tags:Automatic Summarization, Dirichlet Process, Ensemble Learning
PDF Full Text Request
Related items