Tibetan-Chinese Cross-language Topic Detection And Tracking

Posted on:2017-07-28

Degree:Master

Type:Thesis

Country:China

Candidate:Q Zhao

Full Text:PDF

GTID:2358330485955846

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The rapid development of the Internet makes the network become the main source of access to information. Under the background of the increasingly multifarious network information, it is necessary to solve the problem how to enable people to get useful information quickly.As a key technology to solve this problem, topic detection and tracking is designed to identify topics from the vast amounts of news stream and track the follow-up development and evolution of the known topics, helping people cope with today’s Internet information explosion problem. This research has become an important research direction in the field of natural language processing and information processing, and has great practical value in monitoring public opinion, information extraction and so on. Communication around the world continues to strengthen makes the language of the Internet become diverse. Topic detection and tracking is not limited to research on a single language, experts and scholars have begun to study related cross-language technologies. The works carried out in this paper include:We present a Tibetan-Chinese cross-language text similarity calculation method based on word vector. Build Tibetan-Chinese comparable news corpus by calculating the similarity of Chinese news texts and Tibetan news texts. To calculate the similarity, after the pretreatment of Tibetan texts and Chinese texts, we use the traditional TF-IDF method to selected texts’keywords, and then training word vector to extend keywords on semantic. Experiments show that this method improves the accuracy of the calculation, and it is feasible.To extract Tibetan topics and Chinese topics, we build LDA topic model on the basis of Tibetan-Chinese comparable corpus, and use Gibbs sampling to estimate model parameters. To achieve the alignment of Tibetan topics and Chinese topics, we calculate the similarity between Tibetan topic and Chinese topic based on the distribution of text-topic generated by LDA topic model. We propose a voting method based on cosine distance, Euclidean distance, Hellinger distance and KL distance during judgment of topic similarity. We use four methods to find the Chinese topic with max similarity for each Tibetan topic. Analyze the results of each method and choose a more superiority method as the voting result when the vote is invalid. The voting method improves the accuracy.We build Cross-language LDA topic model after aligning the Tibetan topics and Chinese topics, and it can detect the existing topics and discovery new topics in two languages. We also use it extrapolate the sample news to track the related topics’development trend of specific news events in cross-language topic tracking.

Keywords/Search Tags:

Cross-language Text Similarity Calculation, LDA Topic Model, Topic Dectection, Topic Tracking

PDF Full Text Request

Related items

1	Research On The Method And Technique Of Chinese And Thai Cross - Language Topic Detection
2	Research On BBS Topic Detection And Tracking
3	Research On Topic Detection And Tracking Of Micro-blog Based On Topic Model
4	Research On Bilingual Topic Model And Its Algorithm In Cross-language Information Retrieval
5	Research On Topic Clustering Algorithm Based On Topic Models
6	Research And Realization On Chinese Text Topic Analysis Technology
7	Research On Short Text Topic Information Mining Technology
8	Research And Design On Hot Topic Detection And Tracking System In Internet
9	Research On The Method Of Auto-discovery And Verification Of Topic-Websites
10	Research On ’Topic+View’ Extraction Method Based On WSO-LDA For Micro Blog Topic