Font Size: a A A

Text Classification Algorithm Based On Chinese And English Topic Space

Posted on:2019-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:X H WuFull Text:PDF
GTID:2428330542494093Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the gradual development of Internet technology,the amount of information on the Internet has shown explosive growth.Among them,unstructured text information is a very important part.It usually involves all aspects of people's lives such as politics,economy,technology,culture and entertainment.The emergence of the text has been a history of thousands of years.The Internet has given the text more functions and meanings.The amount of text accumulated on the Internet has greatly exceeded the traditional textual information accumulated in the past.How to extract the objects we need from the mass of textual information is one of the important tasks of natural language process,which is of great significance to further improve people's production efficiency and provide people with a better life experience.The text classification algorithm is the key technology to extract the useful information from the mass of textual information.Text classification algorithms can be applied in many fields of text information processing such as information retrieval,sentiment analysis and news recommendation.Therefore,it is of great theoretical significance and practical value to study the text classification algorithms with the high classification accuracy and the lower time complexity.The main research contents of this article include:1.Construction of Chinese and English Topic Vector spaceIn order to overcome the high complexity of natural language processing algorithms based on word granularity,this thesis considers a three-layer text representation structure of text-topic-word and studies natural language processing algorithms at the topic level.In this thesis,we use the global word co-occurrence matrix extracted from corpus to construct a variable dimension and real-valued topic vector space through Poisson infinite relational model.This space achieves a vectorized representation of the topics,which allows the topic to directly participate in the numerical calculation process as a vector.In addition,this thesis proposes a multi-level topic vector space construction method for Chinese situation.Different levels of topic vector space correspond to different depth topics.Experiments "with actual Chinese and English classification datasets show that the topic vector space extracted by this algorithm can be better applied in text classification algorithm.2.Text Classification Algorithm based on Topic Vector SpaceDue to the high time complexity for word-based text classification algorithms,this thesis proposes a new text distance measure-topic mover's distance with the topic vector space constructed above.The distance between two texts is defined as the minimum amount of distance that the topics in one text need to travel to the topics in the other text.Among them,the distance between two topics is measured by the Euclidean distance of two topic vectors.Experiments using actual Chinese and English classification datasets show that this algorithm can achieve much lower time complexity while ensuring higher classification accuracy.
Keywords/Search Tags:text information, text classification, Poisson infinite relational model, topic vector space, topic mover's distance
PDF Full Text Request
Related items