| The performance of the input/output subsystem is becoming increasingly important for many applications. Commercial I/O intensive applications are a fast growing market segment and experience constantly increasing performance demands. Many of these applications exploit concurrency to overlap the latency of I/O operations to improve throughput. At the same time, semiconductor technology trends result in a growing gap between application and operating system performance. Consequently, operating system overhead increasingly limits the efficiency of latency-hiding techniques to improve throughput. This dissertation develops and evaluates a novel I/O architecture that, by providing user-level access to the I/O subsystem, minimizes I/O overhead while maintaining the level of protection and programming flexibility of conventional kernel-based architectures. Inexpensive hardware mechanisms in the I/O device and host processor implement protected user-level request initiation, user-space data transfers, and user-level notifications. Together, these mechanisms are able to reduce I/O overhead by up to two orders of magnitude. As a result, applications are able to efficiently overlap long-latency I/O operations to maximize throughput and to exploit the scalable bandwidth of next-generation distributed I/O architectures. The flexibility of the basic mechanisms facilitates library implementations of a variety of standard I/O programming models with low overhead, as the architecture does not restrict the allocation and use of I/O buffers.; A prototype of the user-level I/O architecture is implemented and evaluated in an execution-driven system simulator. The simulation system combines detailed models of a modern microprocessor and caches, which are based on an existing simulator, a memory controller and I/O devices, with a UNIX-compatible operating system. Validation of the simulator against a real workstation show that the tool accurately captures the performance characteristics of existing computer systems. Synthetic benchmarks demonstrate that the user-level I/O architecture achieves twice the aggregate bandwidth on 23 request streams compared to kernel-based I/O, while at the same time reducing CPU occupancy by 98 percent. The MySQL database server is able to improve throughput by up to 25 percent, without requiring any program modifications. |