Font Size: a A A

Fundamental Research And Application Discussion Of String

Posted on:2009-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:X N ZhaoFull Text:PDF
GTID:2178360242980656Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Alongwiththetechnical progress,thenetworkis playingmoreandmoreimportant role in the information field. The people are familiar withgradually gain the information from the network. At the same time, internetis growing in depth and breadth day by day. Under this situation, how canpeople access to the valuable information accurately and convenientlybecome a big headache for people to the issue, this is also the search enginemanifestsitsvalueplace.Search engine satisfied people to carry on the demand of informationsearch in the Internet, and greatly reduced in the information retrievalprocess time-consuming. However, it also has the place which needs to beimproved. Especially there is no explicit space between Chinese word and word,the search engine needs to rely on the participle dictionary to come to thekey word to carry on the recognition and the explanation. Various influence,the dictionary usually is imperfect .We have discussed through the analysisinterconnection internet pages extraction repeat words and expressions,carries on the method of the expansion to the participle dictionary In thispaper.First, we have to extract text from the web site information. This partof work divide into two parts: Traversing website content and internet pagecontent analysis. As the following work's foundation, we need a largenumber of website information. Usuallywhenwevisitthewebsite,wetransmittherequests and receive responses. But we can't we need to get large amount ofWeb page information by the manual click .We need one tool to simulatemanual visit .About this question, we use two java.net packages namedHttpURLConnectionandURLkindproducewiththetransmissionrequested,andgainpages. After having completed the above operation, we need to carry on thepages content the analysis. At present, most of the pages are compiled byHTML language, what mixed the semantics and the grammar .we need thedata message usually exists by the subject or the main text form in theHTML files .We need to analyze the HTML documents only then may carryon the next step operation.In a concrete HTML text, there are lots of juxtaposes between the HTMLelements.HTML has a distinct hierarchical structure. This kind of structuremay regard as the HTML tree. Obviously, in the HTML tree, the informationwhich we need to gain comes from the title and the main text part. At thistime, we need a way to extract what we are interested in accurately.HTMLParser has provided such function. This article has carried on thedetailed introduction to HTML and HTMLParser, and has given the concreteanalysis method and the code.The next step is to discuss the problem of how to gain repeat strings .Sofar, people has already proposed many calculate algorithm of repeat strings.They can be attributed to two broad categories .First, based on suffix tree ST;Second, based on suffix array SA. In this paper, we propose a new method–Horizontal-Division Method that is visualized. By it we can compute all the MaximalRepeats of string x using only the suffix-array SAx and lcp-array LCPx. Based on themethod, we analyzed the situations and locations which the maximal NE-Repeats andSNE-Repeats of x. Then we designed three algorithms that compute all maximalRepeats, maximal NE-Repeats, and maximal SNE-Repeats in a string x[1 .. n]. All ofthem only using SAx and LCPx, all of them requires O(n) time and space. Compareswith the other existing methods, the Horizontal-Division Method has thefollowing merit:Only suffix array SAx and LCP array LCPx would be used;It is easier to realize than algorithms that based on suffix tree, and it ismore economical in space; This algorithm's time and the spatial complicated grade are O(n);Only need to scan the array LCPx for one time, then calculates allMaximal(NE/SNE) Repeat of string x;Available for visual chart of the algorithm described it easier tounderstand.In this part, first of all, we will give a kind of algorithm for x MaximalRepeats, which is based on the calculation of all the Maximal NE-Repeatsthe algorithm, and then get all of the Maximal SNE-Repeats. Thesealgorithms and the graphical representation have carried on the detailedintroduction in the main text.After the above steps, we come to the new word extraction's last link -discrimination. We may find that some of the repeats were only because ofsome fortuitous factors exist in the results. They do not have the value regardingthe search engine, and they need to be discarded. In this part, we wouldintroduced participle dictionary's physics and the logical organization first,and described the dictionary retrieval method as following. As the differentsearch engine take different new word insertion method, but related tocommercially sensitive matters of companies, in this article, the author hasnot obtained the accurate new word discrimination and the insertion method,but we have carried on the preliminary tentative plan and infers. This is alsothe question which needs to be discussed in following work.It is noteworthy that this article describes a new word extraction methodwhich has the quite ideal time and space complexity. But this method alsohas the place that needs to be consummated. There is also a huge room forgrowth for this technology. I believe that through future study and work, butalso continue to improve on this method, and new results will be achieved.
Keywords/Search Tags:Fundamental
PDF Full Text Request
Related items