Font Size: a A A

Tibetan-Chinese Cross-language Topic Detection And Tracking

Posted on:2017-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhaoFull Text:PDF
GTID:2358330485955846Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet makes the network become the main source of access to information. Under the background of the increasingly multifarious network information, it is necessary to solve the problem how to enable people to get useful information quickly.As a key technology to solve this problem, topic detection and tracking is designed to identify topics from the vast amounts of news stream and track the follow-up development and evolution of the known topics, helping people cope with today's Internet information explosion problem. This research has become an important research direction in the field of natural language processing and information processing, and has great practical value in monitoring public opinion, information extraction and so on. Communication around the world continues to strengthen makes the language of the Internet become diverse. Topic detection and tracking is not limited to research on a single language, experts and scholars have begun to study related cross-language technologies. The works carried out in this paper include:We present a Tibetan-Chinese cross-language text similarity calculation method based on word vector. Build Tibetan-Chinese comparable news corpus by calculating the similarity of Chinese news texts and Tibetan news texts. To calculate the similarity, after the pretreatment of Tibetan texts and Chinese texts, we use the traditional TF-IDF method to selected texts'keywords, and then training word vector to extend keywords on semantic. Experiments show that this method improves the accuracy of the calculation, and it is feasible.To extract Tibetan topics and Chinese topics, we build LDA topic model on the basis of Tibetan-Chinese comparable corpus, and use Gibbs sampling to estimate model parameters. To achieve the alignment of Tibetan topics and Chinese topics, we calculate the similarity between Tibetan topic and Chinese topic based on the distribution of text-topic generated by LDA topic model. We propose a voting method based on cosine distance, Euclidean distance, Hellinger distance and KL distance during judgment of topic similarity. We use four methods to find the Chinese topic with max similarity for each Tibetan topic. Analyze the results of each method and choose a more superiority method as the voting result when the vote is invalid. The voting method improves the accuracy.We build Cross-language LDA topic model after aligning the Tibetan topics and Chinese topics, and it can detect the existing topics and discovery new topics in two languages. We also use it extrapolate the sample news to track the related topics'development trend of specific news events in cross-language topic tracking.
Keywords/Search Tags:Cross-language Text Similarity Calculation, LDA Topic Model, Topic Dectection, Topic Tracking
PDF Full Text Request
Related items