Font Size: a A A

Research On Chinese XML Compression Technology

Posted on:2012-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:S S ZhangFull Text:PDF
GTID:2218330362956546Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the extensive application of XML (Extensible Markup Language), there is a large quantity of Chinese XML documents to be exchanged and deposited on the Internet or local computers. XML is a meta-language, which can be used to describe other data, and XML has features like simplicity, flexibility, platform-independent, etc. But XML has significant redundancy in its structure and has a low space utilization ratio. In order to improve the space utilization ratio and to accelerate the data exchange in XML format, an efficient Chinese-Oriented Compressor, COX (Chinese-Oriented-XML Compressor), is introduced.According to the features of XML, redundancies in the structure and contents of XML are analyzed. Regarding these redundancies, COX, is presented. The procedure of compressing Chinese XML documents with COX comprises following stages, i.e. creating dictionary, searching dictionary, encoding, and integrated compression. Dictionary is created in three stages. Firstly, XML documents are segmented into words and the frequencies of each word is counted; Then, the dictionary obtained in previous stage is filtered. Short words or (and) low-frequency words are removed. Lastly, the words in the dictionary are sorted according to their frequencies. High-frequency words are located in the beginning and low-frequency words are moved to the end. The sorted dictionary is right the enconding-dictionay. After the dictionary is established, the document is scanned again, the data is classified according to some conditions, and the containers are created, then search is conducted through dictionary. For the dictionary-words, a prefix code is used and while other type of data is encoded through other corresponding coding technique. Lastly, these encoded containers are compressed by LZMA compressor. COX uses many technologies including Chinese word segmentation, dictionary creation and the way of assorting containers etc. The critical technology of COX is dictionary creation. Comparative experiments are carried out between COX and other popular compressors. The result shows that COX has highest compression ratio in the same experimental environment on all the data sets. COX improves the space utility ratio and provides a good solution for Chinese XML document compression.
Keywords/Search Tags:Chinese XML Document, Data Compression, Chinese word Segmentation, Dictionary
PDF Full Text Request
Related items