With the development of computer and Internet technology, the network has become the largest integrated information base, whose resources have the largest number and most types. This information can be divided into two categories: structured data and unstructured data. According to statistics, unstructured data possess more than 80% in the entire amount of information, in the process of information transmission, 80% of the time is used to obtain information. So, how to obtain information legitimately and efficiently from the Web on-line is the significance of this paper.XML database technology and Web search engine technology scant hope for improving efficiency of Web information retrieval especially the unstructured data retrieval. Because that XML database provides technical support for information storage and management, and the search engine builds a platform for Web information retrieval. For this reason, this article does an in-depth and meticulous research for XML data management techniques and classification of Web search technology. The main research and new ideas of this paper are presented as follows:Firstly, this article reviewed and analysized native XML database and XML-enabled database management and indexing mechanism. On the base of summing up the various characteristics of the data model, it analysized the advantages of the adoption of relational database to store information as data source and extended XQuery as data model, and then put forward XML data storage and index structure SBXI based on Schema constraint by extending XQuery model. At meantime, defined XML document update language XUL from the user level, and realized the key technology of XML document updating using Kweelt Query System and Java programming.Secondly, resolved the key technology of XML pages classification -information retrieval model problems. As the traditional vector space model can not be applied to XML documents similarity comparison, this paper built Frequent Structure Vector Model based on algorithm TreeMiner, expression of document characteristics matrix and document similarity function. Then, extended this model, put forward Frequent Structure Hierarchy Vector Model further, and improved the similarity measurement precision, not only miner structure information, but also extract keywords information. In order to make it more suitable for mining frequent structures from large collection of documents, we improved the algorithm TreeMiner, the experiments had proved that the retrieval model based on frequent structure is very good for classification XML pages.At last, provided the thinking of search twice which combined classification retrieval to full-text retrieval. From the point of system design, we build the framework of Web documents full-text search retrieval engine based on theme classification, which adopts FSHVM as information retrieval model and uses SBXI as index structure, and discussed the main components of the functions and work processes. |