Projects: Linux scalability: Project workscope
Work scope for the Linux Scalability Project
Chuck Lever, Niels Provos, Naomaru ItoiAbstract
We outline areas of research for the Linux Scalability research project, and list specific deliverables.
This working document is Copyright © 1998, 1999 Netscape Communications Corp.,
all rights reserved. Trademarked material referenced in this document is copyright
by its respective owner.
This document outlines areas of research and development for the Linux Scalability research project. The primary goal of this research is to help make the Linux operating system enterprise-ready by improving its performance and scalability. Linux is beginning to compete with other UNIX and non-UNIX operating systems in the server market, and is becoming more popular among ISPs and ESPs who are using it to provide enterprise-class network services to their customers.
We are specifically interested in finding immediate and practical improvements to Linux that will increase the performance of Netscape's network server suite, which includes an LDAP directory server, an IMAP electronic mail server, and a web server, among others. To achieve our primary goal, we will select a number of areas of potential improvement to the Linux operating system, prioritize the areas based on their estimated improvement pay-off versus their implementation cost, then implement the highest priority improvements. We will evaluate each improvement using server and OS benchmarking methodologies that are as close to standard as possible to allow scientific comparison with other research in this area. Finally we will work with Linux developers to incorporate our improvements into the baseline Linux source code. Improving existing features and adding new ones to an operating system to boost application-specific performance via independent research is an opportunity afforded by freeware operating systems such as Linux.
In the following sections, we outline the areas where we think we can have the most success. In addition to technical achievements, part of our work will include building collaborative relationships among freeware advocates, and with system and software vendors.
Network server performance issuesThere are several common performance and scalability issues recognized by researchers and server developers. Here we summarize some of the most important factors we have considered.
File descriptor scalability
As the number of network users and clients grows, the number of concurrent open file descriptors on network servers can easily exhaust system limits. The number of file descriptors maintained on servers often grows proportionally with the number of concurrent clients served. For example, IMAP servers need to maintain a socket to connect with each client, and an open file descriptor for the client's mailbox. A system-wide file descriptor limit of 1024 prevents such a server from supporting more than about 500 concurrent users.Data throughput
Service scalability depends on the ability of network servers to deliver more and more data at higher and higher rates. Operating system architecture and implementation can have significant effects on data bandwidth. To improve a server's effectiveness, we need to address issues in the operating system and application that limit the amount and rate of data flowing from the server's disk to the network.On some types of network servers, such as mail servers, the disk read-write ratio is significantly skewed towards writes. Metadata updates and data writes are among the most expensive disk and file system operations. Careful analysis of these operations may be of great benefit.
Memory bandwidth is also important in this regard. Memory allocation and system memory management can be optimized to make good use of hardware memory caches. As well, maintaining I/O data cached in main memory can improve overall server efficiency.
Network traffic generated by heavily-used network servers exhibits unique characteristics not easily reproduced when analyzing server performance. Clients are often situated behind high latency network connections, resulting in a high degree of server packet retransmission. Packet retransmission creates unnecessary levels of network congestion. Furthermore, servers often maintain an increasingly large number of concurrent connections because most clients are slow to retrieve data, and thus maintain their connection for a longer time.
Research has suggested ways to improve TCP congestion management and startup behavior. The good news is that these changes can be implemented on the server, benefiting server network data throughput without dependencies on client networking software.
Specific OS dependencies
Lock contention
Locks are used extensively in server applications, so the performance of an OS's lock primitives is very important. Also, support for mutexes that can be shared among processes is required.Memory management
Server applications make heavy use of shared regions, anonymous maps, and mapped files. Special features like locking down regions so they aren't swapped, fast mmap(), support for allocating very large shared regions and memory areas, and efficient memory allocation are especially useful. For example malloc() in the C run-time library appears not to scale well across multiple processors, since sharing the heap requires a single global heap lock.MP scalability
To provide more processing power to a server application, we can add more CPUs to a server, but first we must be sure that the operating system and the server application can take full advantage of more than one or two CPUs at a time. Network servers are generally I/O bound, however. Increasing the number of CPUs, while not directly increasing the I/O bandwidth of a system, may have other benefits, such as increasing the amount of CPU available for handling interrupts and processing network protocols. The very latest versions of Linux use MP hardware significantly more efficiently than some earlier versions do. However, there is still room to improve.Asynchronous events and thread dispatching
Network servers require an integrated approach to asynchronous I/O and thread dispatching. Most modern server architectures make heavy use of both asynchronous I/O and threads. Asynchronous I/O support helps keep the amount of kernel resources and number of outstanding read buffers to a minimum. Having an asynchronous I/O model that is easy to program with and allows reuse of server software among various OS platforms is a big win. Most importantly, an OS-provided integrated asynchronous I/O and event dispatching facility has been shown by researchers to be critical to the performance and scalability of internet servers.More flexible and efficient system call interfaces
Under some circumstances, Netscape server products appear to perform better on Windows NT than on UNIX platforms. Many have conjectured that NT has better system call interfaces for network servers than UNIX. A way of improving server performance and scalability is to help the server application itself make more efficient use of the operating system and the resources it provides. We can do this by adding improved interfaces, or by making the current interfaces, such as poll(), more efficient. System interfaces should also support 64-bit files and filesystems, as well as very large address spaces and more than a few gigabytes of physical RAM.Prioritizing our work
To decide which improvements provide the most benefit, we will rate each potential improvement in the following categories.Estimated pay-offs
Estimated implementation costs
These efforts help build collaborative relationships among research and corporate entities to forward the mission of our research.
EECS research on performance and NT-like system call APIs
Graduate research at U-M's College of Engineering may be approaching several of these issues already. We would like to coordinate with this work so we don't duplicate it.Input from Netscape server product teams
We will consult with members of Netscape server product teams to itemize and prioritize work that the teams would like to see to improve Linux/server product performance.Collaboration with Linux development community
We will co-operate with members of the Linux development community to determine the current state of Linux, and determine how that work affects Netscape server product performance. We will offer development resources for work on scalability and performance issues.ISV relationships
We will work with interested system and software vendors to build support for Linux as a server platform. This work may range from helping Veritas create a Linux version of their VxFS file system, to working with Intel's performance engineers, or working with the makers of Purify to port their products to run on Linux.Staged delivery
We've broken our project goals and deliverables into three stages. Project prioritization is based on what expertise and resources are available to our project, and what will have the highest pay-off and probability of success. In other words, we will start with "low-hanging fruit," and as our resource base and experience grows, we will tackle more difficult and riskier problems.Stage one
Initially, we are interested in providing improvements that require no changes to application architecture or to the system interface. These changes are easy and have a high probability of pay-off, with little risk of introducing new bugs or performance problems. This is a period during which we will build our expertise, and create ties to the Linux development community. We also anticipate forming relationships with several ISVs. Finally, we will construct and benchmark a small local test harness in preparation for measuring later implementations.Our deliverables during stage one include a finalized version of this work scope document, scholarly papers and status reports describing our progress, and the construction of our local test harness. We will also establish OS and application benchmark levels with microbenchmarks and application-level benchmarks. The benchmark results will provide both a base-level performance measurement, and specific improvement initiatives.
Specific stage one projects include:
Stage two
By stage two, we will explore solutions that may involve some changes to Linux's system API and/or to server application architecture. We may also try some of the more risky or more complicated improvements, now that we have some experience under our belts. We will also have built constructive relationships with some ISVs and with parts of the Linux development community.Our deliverables during stage two include the improvements themselves, with scholarly papers and reports describing our improvements. Specific stage two projects include:
Some such areas might include boosting the ability to create many files in the same directory, providing support for swapping memory-based filesystems, improving the efficiency of metadata operations and data writes, and supporting very large filesystems via variable block sizes (for use with RAID subsystems).
Stage three
During stage three, we will subject our most successful improvements from early stages to more stringent performance and scalability testing by working directly with Netscape's server development teams, and by providing some of our work to service providers we know well, such as Netcenter. As part of our stage three efforts, we might also focus on creating a Linux Center of Excellence at CITI. This Center of Excellence could host other researchers and provide hardware arenas for advanced research and development of the Linux operating system.Our deliverables during stage three include scholarly papers and reports describing the deployment and performance measurement results. We will also identify a comprehensive set of performance measurement tools and methodologies. Finally, we will complete a Linux server "Best Practices" guide which describe Linux-specific configuration options and enhancements to help customers get the most out of their Linux-based network servers.
Specific stage three projects include:
Benchmark methodologies
The current Linux development kernel (v2.1) is about to be rolled over into the next version of the stable kernel (v2.2). The 2.1 kernel has been "feature-frozen" since the Spring of 1998, meaning that bug fixes are gladly accepted, but new features generally are not. Since the 2.2 kernel is so close, it is likely that most or all of our enhancements will be added to the 2.3 kernel when it arrives. Whereever possible, we will work with the current development tree, since it contains a number of enhancements that are required by the Netscape server products. The development tree contains many improvements to the kernel, but is sometimes made unusable by work in progress, so it may be a source of delays.To provide truly useful measurements of performance and scalability, we will choose benchmarking systems that performance researchers and Netscape's own performance engineers use most often. This permits comparison and repetition of our work, increasing its value over time. At the same time, we recognize that standard benchmarks are often inadequate for measuring certain types of performance problems, so we will use other benchmarks as well.
There will also be cases where we want to examine directly the effects of certain modifications to operating system features. For analyzing OS-specific modifications, McVoy's microbenchmarks and the Byte Linux Benchmarks will be useful. File system benchmarks, such as Bonnie, the Modified Andrew Benchmark, and SPEC's SDET and KENBUS benchmarks, will provide cross-sections of overall system performance.
We are especially interested in application performance, so application-specific benchmarks will also be used to measure our progress. Webstone and SPECweb96 appear to be the standard web server benchmarks. However, S-client and httperf have features that would exercise pathological network behavior, and may be useful in judging networking improvements. Directory-Mark is Netscape's directory server benchmark of choice.
We have a strong bias towards Web-server benchmarks, even though our work will initially be focused on the Netscape Directory and Messaging Servers, for several reasons:
Hardware
High speed networking technologies will be an integral part of our test harness. Either switched fast Ethernet or ATM will comprise our test harness network.We will also have multiple CPU hardware on hand to implement and test SMP changes. It may be advisable to use the more powerful machines to drive server loads on smaller machines in the test harness to approach server performance limits more quickly and repeatably.
Testing and evaluation of large-scale server configurations is beyond the scope of this project. We can go as far as understanding compatibility and Linux-specific performance issues with large-scale and esoteric configurations, but our expertise is focused on software optimization. Besides, we believe that all our operating system optimizations will benefit moderate and large-scale server deployments. And, as our work progresses, we will be better positioned later to investigate large-scale server performance issues.
Milestones
In this section, we list project milestones by date.If you have comments or suggestions, email linux-scalability@citi.umich.edu
Revision: $Id: workscope.html,v 1.12 1999/11/12 20:15:46 cel Exp $
Copyright 1999