Font Size: a A A

Chinese Information Retrieval Based On Term Dependency Information

Posted on:2015-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:H C YinFull Text:PDF
GTID:2268330428468449Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the popularity of smart phones, the user of mobile Internet is increasing, which promotes the rapid development of Internet. Data stored in the Internet are increasing exponentially every single day. How to obtain useful information from these vast amounts of data quickly and efficiently by use information retrieval technology is an challenging issue facing industry and academia community.Most of traditional inform retrieval techniques based on the bag-of-words model, while ignoring some basic text features such as the relation of term dependence. In this thesis, we develop an method to capture the dependence of terms based on statistical information, and incorporate those relationships into Chinese information retrieval based on traditional "bag of words" model. Studies in the thesis mainly include:First, this thesis proposes three relations of term dependence:full independent, sequential dependence and full dependence based on statistical information. We create three kinds of text features by the three term dependence relationships. We propose a rank function based on statistical information. Considering the average length of text in the test collection and the particular characteristic of Chinese sentence, we take the distance between terms into account when forming the phrases by order or unordered terms. We conduct a set of experiments on NTCIR-5Chinese information retrieval test set. We design two experiments. In the first experiment, we construct text feature based on three dependencies, when we construct text feature based order and unordered terms, we only consider two terms.. In the second experiment, we construct text feature based sequential dependence and full dependence by incorporating more than two terms but limiting the query text. And we make a comparative study with the first part of experiment. Our results show that incorporating terms dependence to Chinese information retrieval can make the retrieval result better on the mean average precision rate (MAP) and P@10, and the retrieval result of incorporating more than two terms is better than only two terms.Second, we design a Chinese full text information retrieval system based on term dependence information. The system consists of three modules:the corpus parse module, the indexing module, and the user interaction module. In the corpus collection module, we use NCTIR-5Chinese collection as fundamental corpus, parse corpus to four fields and store each of them in individual text. The indexing module performs text preprocessing function, and index creation and update function. User interaction module implemented user’s query text processing, and incorporated rank function based on term dependence relationship, finally achieved the result presentation.
Keywords/Search Tags:Term Dependency, information retrieval, full independence, sequentialdependence, full dependence
PDF Full Text Request
Related items