Font Size: a A A

Research And Implementation Of Long-tail Topic Mining Algorithm Based On Matrix Factorization

Posted on:2019-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:S X SunFull Text:PDF
GTID:2438330545993148Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Nowadays,Internet technology has become popular in the country and even in the world.It has gradually become an indispensable carrier of contemporary information,and it has quickly begun to infiltrate into every aspect of people's lives and work.Due to the rapid development of Internet technology,the amount of information has begun to show geometric growth,far beyond what people can understand and analyze.The more effective way to solve this problem is to use computers to automatically discover more valuable content and information from a large amount of textual data for people to understand and use.Data mining technology itself is a new field of data technology development.Since traditional information processing and text mining are not satisfactory for the processing of massive data,topic mining becomes increasingly important.Thematic mining,as one of the most basic tasks in the fields of image,text,and signal,plays an extremely important role in assisting people to obtain and understand information,and has huge applications in contemporary search engines,intelligent question answering,and natural language processing.value.As a commonly used data mining tool,the topic model can effectively extract topic information in large-scale data and help people quickly obtain useful information.The Long Tail Theory,which was born in the early 21 st century,innovatively proposes that as long as the channels for storage and circulation are large enough,the market share of some products with low demand or poor sales can also be occupied by those few hot products.The market share is even larger(ie,long-tailed distribution),and later the researcher found that the distribution of themes in digital information such as images and texts also has long tail effects.Intuitively,there are as many unpopular products as possible.The benefits obtained are comparable to those of popular products.It is possible to reach a niche market(also known as a niche market)that is much larger than the best-selling product.And the convergence of a large number of long-tailed topical information can even rival the popular ones and bring considerable value to society.Although the long-tail theory is proposed for a short period of time,most of the non-hierarchical topic models still ignore long-tail information.Although the hierarchical topic model has a certain long tail mining ability because of the introduction of the topic hierarchy,it has a high degree of complexity.Based on the above analysis,the research focus of this paper is whether it is possible to adopt a data mining method based on a matrix decomposition-based non-hierarchical topic model.By adding an effective constraint condition,we can obtain the same or better training results as the hierarchical topic model.Forming an NMF-LT.Based on the non-negative matrix factorization,the long-tailed topic mining algorithm based on non-negative matrix factorization adds soft orthogonal constraints to the feature subject matrix to ensure the independence between the topics,adding long tail constraints to the topic document matrix.To enhance the robustness of the model and the ability to characterize long-tailed features of the topic distribution.The combination of soft orthogonal constraints,sparse constraints and long-tail constraints makes the model to better mine the long-tail topic information in the data while ensuring the quality of the topic.And through experiments,compared with LDA,NMF,pLSA and other commonly used topic models,NMF-LT can achieve better theme mining results.Finally,the paper summarizes the problems of the full-text study and the work done,and points out the direction for the next stage worthy of study.
Keywords/Search Tags:Topic Model, Non-negative Matrix Factorization, Long-tail Topic
PDF Full Text Request
Related items