Font Size: a A A

Research Of Method Based On The Topic Model On News Headlines Classification

Posted on:2017-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:H P ZhuFull Text:PDF
GTID:2308330485463991Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The advent of the era of big data, is to bring the unprecedented impact and challenge to today’s journalism, and data is becoming the most important resource of news reports.And it uses some technologies,such as data mining, statistical analysis and so on,revealing the relation between the mass complex data and the whole social development to the public by the means of visualization.There are huge amounts of news every day in the Internet,when we want to collect all kinds of news and find useful information, news classification is one of the basic work must be done.Automatic classification of network news has become a research hotspot under the background of the "Data-driven journalism".Because the headline is the summary of a text content, so it is very efficient to categorize the news headlines when facing with vast amounts of data.News headline is a kind of short text.However, not like other short text, such as weibo, which sometimes can has hundreds of words,the length of news headline is usually no more than 30 words.So the features of the news headlines are sparse, which make it a big challenge on short text classification.And it also means that the traditional research methods reflect the defects and the insufficiency on the news headline classification.Text classification is a process that classifying a document to the predefined categories. In face of the problem of news information mining, a basic work we have to do is news data classification.At present,most of the methods are for the classification of news text content, due to the length of a news which can reach hundreds of thousands of words, the classification of news content belongs to long text classification.However,it is very troublesome to deal with the huge news data. By reason of that a news headline is the summary of a news text, this paper will categorize news with news headlines.In this paper,the main work is as follows:(1) Extracting headlines from the data set of Tencent news which found on the internet and building a headlines corpus for classification,which contains seven categories,including politics,economy,education,science and technology,sports,society.And the category of people’s livelihood news contains other three categories(transportation, medical, housing).(2) As a result of the fact that news headlines have few language features,using the existing word segmentation technology cannot achieve good segmentation result,and it will also affect the final classification result.According to this problem,this paper uses the news domain dictionary in the process of word segmentation.And the experiment proved that this method is effective and it also can improve the classification effect.(3) The news headlines are normally featured with less content and strong describing ability.Consequently,the effects of traditional classification algorithms(e.g TF-IDF) are quite unsatisfactory.This paper will use the. LDA(Latent Dirichlet Allocation)and the BTM (Biterm Topic Model) Topic Model,which were proposed in recent years by digging out the headlines in the implied semantic relations, and combining the news domain dictionary, to improve the effect of classification of news headlines.And the experiments prove the method is much better than other classification method.
Keywords/Search Tags:News Headline, Short Text, Domain Dictionary, BTM, Topic Model, Classification Method
PDF Full Text Request
Related items