With the rapid spreading of Internet, users frequently upload to share their resources, but search engines return an enormous result presented as linear order which is inconvenient for users to use. And how to search and retrieve the necessary information more accurately and precisely is becoming an urging problem. With its flexibility and ability of Process automation, text clustering has become an indispensable medium to effectively organize and Navigate massive text information.In this thesis we conducted a deep research on the document clustering algorithm, with the consideration of the impact of cluster label we use the Lingo clustering algorithm as the mainframe to search the possible application of document clustering in the search engine field. Our research work mainly focused on as following:A great deal crucial techniques of text pretreatment directly determine the final clustering result. So an expanded research into document clustering technology, furthermore we achieved a multifunctional preprocessing subsystem which contains functions such as page denoising, stemming and stop-words removal.The traditional TF-IDF weighting is the most commonly used method of vector space model. This method can effectively enhance the weighting of high- frequency words in a document and weaken the high- frequency words of the whole documents set which contains less document's information. But it is lack of concern about the impact of position factor and part of speech factor so we modified the TF-IFD formula by introducing the position factor and part of speech factor into it and achieved a proper result.An expanded research into Lingo document clustering algorithm, and a comparison in a control experiment shows its superiority of clustering and label induction. So we use the Lingo clustering algorithm as the mainframe and use the same organizing method used in HSTC algorithm to organize the clusters of Web search results into a tree structure, then we proposed a plan to improve the performance and resolve technical problems. The test result showed this new method proved better than the traditional HSTC method and the anticipation goal achieved.In general, POS Tagging should concern about the context of words, so if a clustering system takes into account POS Tagging, it means to tag the class of words online. For its high complexity and computational cost, POS Tagging may impact significantly on system performance. We conducted an in-depth study of part of speech tagger, designed and achieved a XML-based-on dictionary which .could efficiently reduce the great costs caused by integrating part of speech tagger into the clustering system. Besides all above, we integrated the Nutch search engine into our clustering system made it could both cluster the results returned by the other search engines and local platform search results. Furthermore, it is a multifunctional engine with open query portals. |