With the growing of the information from the internet, more and more people begin to know and use web. They take the internet as the necessary tool to collect the knowledge, but with the augment of the information from web, users find out that they always get the useless information when they find interesting information. To solve this problem, the information retrieval becomes more and more important.Today there are many web information retrieval systems, but these systems can't satisfy user's need, because unspecialized users are excessive. They can't express their needs very well, and can't use the advanced signal offering by the information retrieval system, so the result of the retrieval is not exact according to their real meaning. Simultaneously, because retrieval results are ranked by their relations to query, the retrieval results which users got is not what users need.To solve this problem, an information retrieval system is developed for specific domain. This system first gathers information from Web. It can select the information according to information which users need, and preserve them hierarchically after redundancy filtering. This system can prevent from none stopping circulation because of citation among different web pages. This system can preserve the information hierarchically by the source of the information. It uses semantic technique to analyze text, and combine text segmentation and keyword extraction to extract key gram to construct index. This system also provides source retrieval service. It finds retrieval results quickly according to user's query by index, and clusters the results. It also provides the label of each cluster, and then this system ranks the clusters by the relationship between the cluster and the query.In this paper, the following works have been done:1. It craws documents quickly form website, and preserves information which is appointed retrieved by users. It also can remove redundancy among information and preserve information hierarchically.2. It retrieves special domain information. This system include front and behind operation projects. The front project realizes retrieve and the behind project realizes index construction. Experiments demonstrate this index construction method outperforms methods by word frequency and word position.3. It uses clustering to organize retrieve results, and uses gram to represent topic of cluster. This method can improve retrieve efficiency. It also uses topics to obtain more precise cluster word.4. In comparison with other general search engines such as Baidu and Google, this system pays more attention on special domain. It constructs index to reflect topic more precisely, and extract more precise word as index word. It also clusters retrieve results to make users retrieve information more efficiently.This system is already used in Harbin Kai Bo Company. It preserves information about component and company of Chi Pian Canal. This component is useful in heat transformation field. This system can help customs to organize plenty of information and can quickly find relative information about company and price of this component. It can improve work efficiency of this company. |