Font Size: a A A

On The Models And Theory Of Information Retrieval Based On Kolmogorov Complexity

Posted on:2007-01-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L WangFull Text:PDF
GTID:1118360185468408Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
In this thesis, we discuss and program to validate several models-some models based on NID(NCD) of kolmogorov complexity, Graphical Model, simply relevant model. Further, we compare them with VSM, and try to solve the ad hoc theoretic and empirical problems of Information Retrieval in two aspects-deriving retrieval model and explaining empirical models with universal theory.Three jobs concerning NCD theory and model, are done: approximate realization and experiment of NCD model, explanation of Information Retrieval with NCD, comparison of empirical model with NCD model and explanation of empirical model by NCD model.We derived approximately two compression algorithm based models from NCD theory, and do some experiments on structural algorithm information of texts.Experimental results indicate that the bigger the compression ratio is, the better the retrieval result will be.Besides,encoding texts such that every word can be seen as a unit or encoding words in the same length, the retrieval results also become better. It indicates that further improving the compression algorithm and increasing the compression ratio can achieve better retrieval results; modifying the program that realizes the compression algorithm to compress the texts by words can also improve the retrieval results. We compose a simple programm according to lz algorithm that indeed realizes compression by words (regarding word as incompressible unit), and conducted experiments to test. The result shows that the retrieval performance was greatly improved. The simplified algorithm did not achieve the best compression result of LZ algorithm because we approximately realized the algorithm. We expect that we can realize it exactly in the future, and do some modification on LZ, BWT algorithm to achieve better compressibility. In this way, it is hopeful to get a better performance. Besides, we can estimate the retrieval performance just depending on the text structure.Because of experimental condition, we can not conduct large-scaled experiments to estimate the effect that algorithm information of word itself has on information retrieve. We derived a IR formula on the assumption of VSM about relations between words , and composed a program based on small corpus to validate without tuning the parameters. Results show its feasibility. The algorithm has much similarity with the typical formula...
Keywords/Search Tags:text informatin retrieval, kolmogorov complexity, NCD(normalized compression distance), graph model, simple associated model, empirical model, the ad hoc problem of empirical models, ad hoc problem of information retrieval
PDF Full Text Request
Related items