| In May 1999, Shawn Fanning, then a freshman at Northeastern University, launched the first “peer-to-peer” or P2P file-sharing application—Napster. Napster allowed individual end-users (called peers) to share the MP3-encoded music stored on their local computers directly with one another over the Internet. Within a year, Napster had grown to a user population of over 50 million users making it the fastest growing Internet application to date. Three years later, despite the closure of Napster, the phenomenon of file-sharing continues its dramatic growth and appears set to remain an important feature of the Internet for the foreseeable future. The sheer scale of these file-sharing applications make them important in their own right. And yet, as this thesis will argue, P2P is much more than just a way to trade MP3s over the Internet. The P2P architecture with its use of low-cost, grass-roots resources and its decentralized nature that does not rely on any form of centrally managed infrastructure, represents a significant departure from the client-server architecture of the Web. These unique characteristics, we believe, allow P2P systems to support the rapid and low-cost deployment of powerful large-scale applications in a manner that would not be possible with the current architecture of the Web.; There are two key pieces to a P2P system: the lookup mechanism used to locate a desired file and the actual file download. The decentralized storage in P2P systems makes the file transfer process inherently scalable; the hard part is finding the peer(s) from which to retrieve the desired file. This thesis addresses this problem of scalable indexing in P2P systems, i.e., given a file identifier, how can we find the IP address of the peer(s) holding the file? Ideally, a solution to this indexing problem must be scalable to millions of users, must find files quickly, and must be resilient to the frequent arrival and departure of participant peers. As a solution, we introduce the concept of a Content-Addressable Network (CAN) as a distributed system that provides hash table functionality—mapping “keys” onto “values”—on Internet-like scales. Our CAN design is completely distributed (requiring no form of centralized control, coordination or configuration), scalable (nodes maintain only a small amount of control state that is independent of the number of nodes in the system), and fault-tolerant (nodes can route around failures).; The Distributed Hash Table (DHT) functionality supported by CAN serves as a useful substrate for a range of large distributed systems; for example, Internet-scale facilities such as global file systems, application-layer multicast, event notification, and chat services can all be layered over a DHT system such as CAN. |