Interview: Ingo Molnar
Submitted by Jeremy
KernelTrap
December 03, 2002
Ingo Molnar has been contributing to Linux kernel development since 1995 with an
impressive list of accomplishments. Most recently his O(1) scheduler was merged
into the 2.5 development kernel, as well as much work to enhance the handling of
threads. Other highly visible contributions include software-RAID support and the
in-kernel Tux web and FTP servers.
In this interview, Ingo explores how he started working on the Linux kernel, noting,
"It might sound a bit strange but i installed my first Linux box for the sole purpose
of looking at the kernel source." He goes on to explain the concepts behind his
new O(1) scheduler, and to describe many of his other kernel efforts. This interview
was conducted over several months, and covers a wide range interesting topics...
Jeremy Andrews: When did you get started with Linux?
Ingo Molnar: i think i first heard about Linux around 1993, but i truly got hooked
on kernel development in 1995 when i bought the german edition of the 'Linux Kernel
Internals' book. It might sound a bit strange but i installed my first Linux box
for the sole purpose of looking at the kernel source - which i found (and still
find) fascinating. So i guess i'm one of the few people who started out as a kernel
developer, later on learned their way to be a Linux admin and then finally learned
their way around as a Linux user ;-)
JA: What was your first contribution to the kernel?
Ingo Molnar: my very first contribution was a trivial #ifdef bugfix to the networking
code, which was reviewed and merged by Alan Cox. At that point i've been lurking
on the kernel mailing list for a couple of months already. My first bigger patch
was to arch/i386/kernel/time.c, i implemented timestamp-counter based gettimeofday()
on Pentiums (which sped up the gettimeofday() syscall by a factor of ~4) - that
code is still alive in current kernels. This patch too was first reviewed by Alan
Cox.
I strongly believe that a positive 'first contact' between kernel newbies and kernel
oldbies is perhaps the single most important factor in attracting new developers
to Linux. Besides having the ability to code, kernel developers also need the ability
to talk and listen to other developers.
JA: Do you participate in other mailing lists beyond the lkml? Or is this the only
place were newbies and oldbies alike will find you?
Ingo Molnar: i'm subscribed to many mailing lists, but for kernel development it's
the vger list(s) where most of the stuff happens.
JA: Your most recent contribution to the Linux kernel was the O(1) scheduler, merged
into the 2.5 development tree in early January. When did you first start working
on this project and what was the inspiration?
Ingo Molnar: one of the core ideas used within the O(1) scheduler - the using of
two sets of priority arrays to achieve fairness - in fact originates from around
1998, i even wrote a preliminary patch back in those times. I couldnt solve some
O(N) problems back then so i stopped working on it. I started working on the current
O(1) scheduler late last year (2001), sometime in December. The inspiration was
what the name suggests - to create a good scheduler for Linux that is O(1) in its
entirety: wakeup, context-switching and timer interrupt overhead.
JA: Did you base the design on any existing scheduler implementations or research
papers?
Ingo Molnar: this might sound a bit arrogant, but i have only read (most of the)
research papers after writing the scheduler. This i found to be a good approach
in the area of Linux - knowing about too many well-researched details can often
confuse the real direction we have to take. I like writing new code, and i prefer
to approach things from the physics side: take a few elementary rules and build
up the 'one correct' solution, no compromises. This might not be as effective as
first reading all the available material and then cherry-picking a few ideas and
thinking up the remaining things, but it sure gives me lots of fun :-)
[ One thing i always try to ensure: i take a look at all existing kernel patches
that were announced on the linux-kernel mailing list in the same area, to make sure
there's no duplication of effort or NIH syndrome. Since such kernel mailing-list
postings are progress reports of active research, it can be said that i read alot
of bleeding-edge research. ]
JA: Can you explain how your O(1) scheduler improves upon the previous Linux scheduler?
Ingo Molnar: there are three main areas of improvements.
firstly, as the name suggests, it behaves pretty well independently of how many
tasks there are in the system. A number of server workloads (eg. JVMs) actually
triggered this inefficiency in the old scheduler.
secondly, scheduling on SMP got improved significantly: both performance and scalability
is much better. Also, the scheduler decisions are much more robust these days, because
the core design is SMP-aware.
the third improvement is in the way interactive tasks are handled. This is actually
the change that should be the most noticeable for ordinary users. Interactive tasks
are now detected via a separate, usage-statistics-driven mechanism, which is decoupled
from other scheduler mechanisms such as timeslice management. The end result is:
interactive tasks are still snappy under heavy load, and CPU-intensive tasks are
isolated from interactive tasks much better so that they cannot monopolize CPU resources.
This part is still being tweaked upon - the important thing is that it's decoupled
- the old scheduler had lots of behavioral details integrated into it implicitly,
which made tweaking harder. There's even a patch that makes all scheduler-internal
constants (timeslice length, various deadlines and interactivity rules) runtime
configurable.
the scheduler also enabled the implementation of a new scheduling policy: batch-scheduling
of CPU-intensive tasks. This is a correct implementation of the SCHED_IDLE patches
that are floating around - the end result is that batch-scheduled tasks do not disturb
other tasks in the system in any way, if other tasks are running then batch-scheduled
tasks take up zero CPU time. This can be used for things like SETI calculations,
or long numeric calculations in university setups.
JA: Is this batch-scheduling in the queue of patches waiting inclusion into the
2.5 kernel?
Ingo Molnar: well, batch scheduling is a feature that is welcome by a number of
users, but is largely irrelevant to others. Right now it's a separate patch to the
stock scheduler. There are also some conflicting requests about SCHED_BATCH semantics:
some people would like priority levels to cause RT-like separation of execution
times, while the current SCHED_BATCH patch uses priority levels [ie. nice values]
to determine the percentage of CPU time shared between SCHED_BATCH tasks. Until
such issues are not decided (by actual use), it's not good to codify them by moving
it into the stock kernel.
ie. in the first 'RT-alike priorities' model, if a 'nice level 15' SCHED_BATCH task
is running at the same time a 'nice level 10' task is running, the nice-10 task
will get all the CPU time - always, until it exits. In the second priority model
the nice-10 task will get more CPU time than the nice-15 task, but both of them
will get CPU time.
another property of SCHED_BATCH scheduling is the use of much longer timeslices.
Eg. right now it's 3 seconds for a default priority SCHED_BATCH task - while normal
tasks have 150 msec timeslices. For things like numeric calculations it's good to
have as long timeslices as possible, to minimize the effect of cache trashing. Eg.
on a sufficiently powerful CPU with a 2 MB L2 cache, the 'population time' of the
cache can be as high as 10 milliseconds. So if there are two numeric calculation
tasks that both fully utilize the L2 cache (in nonobvious patterns), and which context-switch
every 150 milliseconds, then they will waste 10 milliseconds on cache-rebuilding
in the first 6% of their timeslice. This shows up as a direct 6% slowdown of the
numeric calculation jobs. Now, if SCHED_BATCH is used, and each task has a 3000
milliseconds timeslice, then the cache-rebuild overhead can be at most 0.3% - a
far more acceptable number.
this is also one of the reasons why the default timeslice length got almost doubled
over the 2.4 scheduler's timeslice length (there it was 80 msecs). We cannot do
the same in the 2.4 scheduler because it has a weaker interactivity detection code,
which will make things like an X desktop appear 'sluggish' while eg. a compilation
job is running.
I think in the longer term we want to have a more abstract timeslice management
solution (something like the fairsched patch), which is more than possible with
the O(1) scheduler.
JA: How do JVMs trigger an inefficiency in the old scheduler?
Ingo Molnar: the Java programming model prefers the use of many 'threads' - which
is a valid and popular application programming model. So JVMs under Linux tend to
be amongst the applications that use the most processes/threads, which are interacting
in complex ways. Schedulers usually have the most work to do when there are more
tasks in the systems, so JVMs tend to trigger scheduler inefficiencies sooner than
perhaps any other Linux application.
JA: Are you aware of any areas where your O(1) scheduler doesn't perform as well
as the 2.4 scheduler?
Ingo Molnar: not really, i tried to make sure we preserve all the good things from
the 2.4 scheduler. If anyone manages to identify such an area then please mail me
about it! :-)
well, there's one area where difference can be felt - nice levels (priorities) are
taken far more seriously in the new scheduler. So if one wants maximum X performance
even during heavy X load which makes X a "CPU hog", the X server should be reniced
to nice levels -10 or -15:
renice -15 -u root
the above command renices all currently existing root-owned processes to -15, this
includes admin shells and X as well, on most distributions. Some distributions that
added the O(1) scheduler to their kernel also set X's priority to -10 or -15 in
X's startup scripts - an obviously more robust solution.
similarly, audio playback code that uses up lots of CPU time (Ogg decoders for example)
should use nice level -20 to get the best audio latencies and no 'skipping' of soundtracks.
Most of the audio playback applications already support the use of RT priorities
for playback - nice level -20 under the O(1) scheduler is a far more secure solution.
(if a task with RT priority locks up then that can cause a lockup of the system.)
Obviously all of these operations are privileged, and can only be done as root.
JA: What are the scalability limitations of the O(1) scheduler?
Ingo Molnar: i'm afraid there are none currently - the runqueues are perfectly isolated.
The load-balancer is the only piece of code that has to look at the 'global' scheduling
picture, but even that code first tries to figure out whether it has work in a 'lock-less'
way, then does it go take the runqueue locks. And the load-balancer runs at a much
lower frequency than other pieces of the scheduler - the 'big' load balancer runs
every 250 msecs - which, in the kernel's timescale, is an eternity. The 'idle rebalancer'
runs every 1 millisecond on every idle CPU - but since it uses up idle time its
cost is essentially zero.
this property of the load-balancer also enables us to add more complex things like
support for NUMA cache hierarchies easily - it's not a performance or scalability-critical
piece of code. It can support everything from '16 isolated groups of 4 CPUs' or
'32 CPUs on a single chip' or any other future cache-hierarchy. This is the beauty
of the new multiprocessor scheduler, the handling of the cache hierarchy [ie. SMP
or NUMA or CCNUMA, etc.] is decoupled from the actual per-CPU scheduling, so (hopefully)
there's no radical redesign needed in the future.
JA: Are the improvements of the O(1) scheduler mainly felt on large servers with
multiple CPU's? For example, my home server is an aging PIII 650, and there's definitely
a finite limit to the number of processes it can handle at one time.
Ingo Molnar: the improvements are noticeable for basically every workload where
the CPU is 'overloaded': interactive tasks are actively detected and preferred,
and the scheduler itself does not add to the load no matter how high the load is.
While it's typically servers that are overloaded (mainly because server admins can
size their own needs better than desktop users, and mainly because desktop use is
largely dependent on the reaction speed of humans, which is quite lacking), it's
still quite common for desktop systems to get into various sorts of CPU overload
situations - so overload handling is important to both categories of uses.
JA: What sort of tuning is left to be done?
Ingo Molnar: there are things left like support for non-SMP and non-UP cache hierarchies,
like NUMA or SMT (HyperThreading), but the basic design makes the scheduler well-suited
for such purposes as well. In most cases an alternative load-balancing algorithm
solves the problems. Plus the tweaking of parameters will perhaps never end.
JA: Do you intend to personally work on supporting the NUMA architecture?
Ingo Molnar: i think the patches from Erich Focht & NUMA crew are looking good,
and i'm quite sure we will merge them once things have settled down.
JA: Do you aim to have the O(1) scheduler eventually merged into the mainline 2.4
kernel?
Ingo Molnar: this largely depends on Marcelo. I'm (trying to) do periodic backports
of the scheduler to 2.4, and feedback has been positive so far. Most distributions
include the O(1) scheduler in their kernel tree, so the code gets a fair amount
of testing. (in fact only Debian does not include it, this is due to a generic policy
of shipping the default kernel as shipped by Linus.) If 2.6 is released soon enough
then it might not be worth putting the O(1) scheduler into 2.4 - with so much stuff
being backported to 2.4 i think 2.6 should have some new features by the time it's
released! :-)
JA: Whether or not it actually happens, how much more testing needs to be done before
you personally would be comfortable with the O(1) scheduler being merged into the
stable kernel?
Ingo Molnar: it depends on what rule Marcelo uses to include stuff in 2.4. If the
rule is to 'include stuff that lots of people use and which works just fine' then
the O(1) scheduler is ready. If the rule is to 'include nonintrusive or must-have
fixes only' then the O(1) scheduler should not be included, since the 2.4 scheduler
works just fine for most workloads. The 2.4 scheduler is still actively maintained
and has no major problems, so it's not like we are in a hurry.
JA: How up-to-date is the O(1) scheduler that's part of Alan Cox's 2.4-ac tree?
Ingo Molnar: it's in essence the same scheduler that we have in 2.5. It has one
minor tweak missing. In the 2.5 scheduler we have bits of code from other kernel
features as well which are not present in 2.4 (and will probably never be), Rusty
Russell's hotplug CPU framework, Robert Love's preemptable kernel code, Dipankar
Sarma's RCU code and Andrew Morton's autoremove wakeups. So the 2.5 sched.c cannot
be directly compared to 2.4's sched.c.
JA: You've also recently been experimenting with making the O(1) scheduler aware
of Hyper-Threading (aka symmetric-multithreading) capable CPU's. You explained in
an email to the Linux kernel mailing list how you implemented this by introducing
the concept of a shared runqueue. With future tuning, how much of a performance
gain do you think you can get by adding this support?
Ingo Molnar: this patch makes a measurable impact when the HT-capable system is
not fully utilized. Eg. if a 2-CPU HT system (4 logical CPUs) has 2 tasks running.
In this case the correct scheduling decision is to move the two tasks to two different
physical CPUs. This alone can result in an up to 30% performance increase of the
two task's performance - but for HT systems that are out on the market now it could
have a bigger impact as well. It all depends on the tasks, how much cache they use
and how well the SMT hardware switches between logical CPUs.
when the HT system is fully utilized then the 'stock' scheduler gets pretty close
to the 'HT-aware' scheduler's performance, due to an existing feature of the scheduler,
the so-called "cache-decay based affinity".
JA: Have you had much feedback regarding this patch?
Ingo Molnar: Intel is obviously interested, and so were a number of kernel developers,
and users as well. But i do not expect the kind of feedback the O(1) scheduler itself
produced - HT systems are fresh on the market, and the stock O(1) scheduler handles
it reasonably well already.
JA: What processors currently support Hyper-Threading?
Ingo Molnar: only Intel AFAIK - Hyper-Threading is an Intel trademark iirc. I think
there are some non-x86 CPUs that have SMT concepts included, perhaps PowerPCs or
Alpha?
JA: You're also the author of the original kernel preemption patch. How did your
patch differ from the more recent work Robert Love has done in this area?
Ingo Molnar: it was a small concept-patch from early 2000 that just showed that
a preemptible kernel can indeed be done by using SMP spinlocks. The patch, while
it booted and appeared to work to a certain degree, had bugs and did not handle
the many cases that need special care, which Robert's patches and the current 2.5
kernel handles correctly.
otherwise the base approach is IMO very similar, it has things like:
+ preempt_on();
clear_highpage(page);
+ preempt_off();
and:
+ atomic_inc_local(¤t->may_preempt);
which is quite similar to what we have 2.5 today, with the difference that
Robert and the kernel developer community actually did the other 95% of the work
:-)
JA: Are you also actively working on 2.5 preemptible kernel development?
Ingo Molnar: The maintainer is Robert - i do tend to send smaller preempt related
patches (and even a larger one, the 'IRQ lock removal' patch centered around the
use of the preemption count). I'm obviously interested in the topic, and i'm happy
that all the seemingly conflicting concepts as lowlatency and preemption are now
properly merged into 2.5 and that we have really good kernel latencies. Other pressing
topics like the scheduler and the threading code still keep me busy most of the
time.
JA: Your IRQ rewrite and Robert's preemptible kernel work have resulted in a unified
per-task atomic count (the preempt_count) and a lot of code being cleaned up. Do
you have plans to do more work in this area?
Ingo Molnar: not at the moment - right now i think that the IRQ code could hardly
be any cleaner than it is today :-)
JA: What other kernel projects are you currently working on?
Ingo Molnar: mainly the scheduler, plus these days i'm working on enhancing the
handling of 'threads' under Linux, utilized by the NPTL project done by glibc maintainer
Ulrich Drepper. This has a high number of components that are in the 2.5 kernel
already.
JA: Can you further describe the components that have already been merged into the
2.5 kernel?
Ingo Molnar: TLS stands for 'Thread Local Storage'. You can find the first announcement
of the patch at:
http://lwn.net/Articles/5851/
a number of followup patches were posted, and it all got eventually merged
into 2.5.31.
Plus there were a few other things related to threading:
http://lwn.net/Articles/8131/
http://lwn.net/Articles/8034/
http://lwn.net/Articles/7618/
http://lwn.net/Articles/7617/
http://lwn.net/Articles/7603/
http://lwn.net/Articles/7411/
http://lwn.net/Articles/7408/
(note that most of the above patches got reworked significantly before they
got into the 2.5 kernel, but the concepts were all preserved.)
JA: You conducted a test to start hundreds of thousands of threads at one time...
Can you describe how you did this, and what were the results?
Ingo Molnar: the first test had slightly less than 100,000 threads. My goal was
to create an easy tool to trigger inefficiencies in kernel algorithms that somehow
still depend on the number of threads. I wrote some simple C code that started up
100,000 parallel threads with a small userspace stack. This simple test alone triggered
4-5 inefficiencies in the kernel which took a number of days to fix. One of the
inefficiencies was the PID allocator, which got discussed on linux-kernel quite
extensively and which triggered some emotional responses as well, but finally the
patch was merged and now we have a constant-time PID allocator. Another thing the
code triggered was the fact that procfs crashed upon creating the 65536'th thread
- so my box was definitely the first box on the planet that ran more than 64K kernel
threads :-) Another deficiency was the O(N) property of the exit()/wait4() syscalls.
And while we were at it, a number of new syscalls were introduced to reduce the
overhead of thread operations as much as possible.
The IRQ-stacks patch, written by Ben LaHaise and Dave Hansen, roughly doubles the
maximum numbers of threads possible on a 1 GB RAM x86 box, ie. slightly less than
200K threads.
JA: What other Linux kernel related projects have you worked on in the past?
Ingo Molnar: here's a probably incomplete list of the bigger pieces that made it
into the kernel: software-RAID support, 3-level paging on x86 (and highmem), the
recent IRQ handling rewrite in 2.5 (which also removed the 'big IRQ lock'), the
timer scalability patch, kernel workqueues, the CPU affinity syscalls, the initial
SMP pagecache scalability code in 2.3, and i also wrote the original 'writeback
pagecache' patch for 2.3, wrote various fixes and enhancements to the 'old' scheduler,
wrote the 'wake one' support patch for 2.4, wrote the original zoned allocator,
bootmem and mempool subsystems. Ie. all across the spectrum.
One project that is not in the 2.5 kernel is the Tux webserver (and now FTP server
as well). If you want to see a Tux/FTP server that can serve 10,000 users then do:
ftp ftp://ftp.rpmfind.net/
some smaller but interesting patches: the NMI watchdog, the ability of the 2.4 kernel
to create more than ~4000 processes on x86 (ie. the removal of per-thread TSS),
netconsole/netdump, 'big reader locks', and one older patch from 2.2 times i'm particularly
proud of: i wrote the original 'current task pointer' implementation, which uses
the stack pointer to get to the 'current task pointer' on SMP systems. I also wrote
the 'memleak' and 'ktrace' debugging helper tools, which have been picked up by
other projects.
JA: Your list of contributions is staggering!
Ingo Molnar: well, it's just that i've been around long enough, and that i'm interested
in many different areas. So a colorful mix of contributions piled up.
JA: Are you still working on the Tux webserver?
Ingo Molnar: occasionally yes, but other things take precedence currently. But life
has not stopped, eg. Anton Blanchard has ported Tux to 2.5, and Arjan van de Ven
keeps the 2.4 patch uptodate.
JA: How complete an implementation is the Tux FTP server?
Ingo Molnar: it started out as a proof-of-concept, that a fully in-kernel FTP server
was possible. Eg. it implements/accelerates some of the functionality of 'ls' within
the kernel as well. It does not handle all aspects of the FTP protocol yet (and
perhaps will never support the full range), its main feature currently is absolute
security [eg. attackers cannot upload trojans through it, because it, well, has
no upload support at all :-) ] and download-only FTP serving - that's clearly the
main area where FTP server performance is the biggest problem.
JA: In your opinion, are the Tux webserver and FTP server ready to be merged into
the 2.5 kernel?
Ingo Molnar: Anton Blanchard has done merging work in that area, but i think we
missed the 2.5 feature freeze deadline. The patch needed for the generic kernel
(ie. not the Tux code itself) is fairly small - most areas are part of the kernel
already.
JA: What still needs to be modified in the generic kernel?
Ingo Molnar: it's mainly two VFS changes, an exit()-time cleanup function and one
new TCP event callback. All the 'big' features that were induced by TUX are in the
2.5 kernel already, zerocopy and the scalability work, so TUX for 2.5 is a really
unintrusive patch.
JA: Do these remaining patches add any overhead to the kernel for users that do
not need TUX?
Ingo Molnar: nope, not really.
JA: What future plans do you have for Tux?
Ingo Molnar: well, to get it into the stock kernel :-)
JA: Has Linus offered any opinion regarding TUX, and the possibility of merging
it into his kernel tree?
Ingo Molnar: yes, there was an attempt to merge TUX early during 2.5. There were
no big technical problems, just suggestions to do certain things differently - this
is what happens when any bigger piece of code is merged. It got delayed by the scheduler
and then by the threading work and now by the feature freeze :-)
JA: Of all these many impressive accomplishments, which are you the most proud?
Ingo Molnar: well, perhaps the scheduler, it manages to solve a few really hard
conceptual problems in a pretty critical piece of code that already got called a
couple of thousand times while eg. reading this article on a Linux box! :-)
JA: What is your background in programming prior to getting involved with Linux?
Ingo Molnar: well, like many others, i grew up on programming all possible (and
even some impossible) aspects of Commodore micro-computers, since age 11. Completely
knowing a greatly simplified but fully functional computer architecture helped alot
in kernel development.
I think kids today have a harder time, since hardware vendors are much more tightlipped
about computer internals, and the complexity of computer systems skyrocketed as
well. Linux perhaps helps here too, as a central 'documentation' and reference implementation
for "all computer internals that matter".
JA: Much of your work seems to be focused on improving the performance and scalability
of the 2.5 kernel. Is this the result of RedHat's product requirements, or your
own interests?
Ingo Molnar: well, i'm in the fortunate position that the two are a perfect match.
JA: Can you describe your development environment, including the hardware and software
tools you typically use?
Ingo Molnar: i use all the normal text based kernel development tools: vim, gcc/make/etc.,
i use a serial line to a test-system to debug kernels, and that's all. I like it
simple when reading kernel code: i use text consoles (on an LCD screen) to do most
of my development work. Occasionally i drop into X for tools that make sense only
there, such as ethereal or some of the BK tools.
JA: Is it safe to assume you are working on a Pentium 4 Xeon, based on your recent
Hyper-Threading patch?
Ingo Molnar: well, the HT box is one of my test-systems. These days i'm working
on a 'boring' system, on which i almost never boot experimental kernel stuff. While
not quite mannish, i prefer this solution, having a safe system creates a certain
peace of mind.
JA: What's your impression of how the kernel has changed over the past seven years
that you've been involved?
Ingo Molnar: there's roughly one order of magnitude more code in the kernel - while
the 1.2-ish kernels were just 300 thousand lines of code, the 2.5.48 kernel is more
than 5.5 million lines of code. Even if some of this size increase is due to new
architectures and new drivers (which do not directly complicate kernel coding),
even the "core code" has increased roughly 5 times in size, which is considerable.
The 1.2.0 kernel supported 4 architectures, the 2.5.48 kernel supports 20 different
CPU architectures. The 1.2.0 kernel supported 11 filesystems, the 2.5.48 kernel
has native support for _50_ filesystems. So the kernel got considerably more complex
- but it also got more logical in many respects. More 'refined' would be the right
word i think. I really hope we are successful in keeping it simple (and well commented)
enough for new developers to understand.
compared to the situation years ago, it's roughly the same amount of work to get
a patch into the official kernel, which i think is very good - it's a tribute to
Linus. (his patch-integration and patch-steering work has increased an order of
magnitude as well. So during the years Linus not only had to care about the scalability
of Linux as a technology, but he also had to scale and form his own workload.)
at this point i think it's fair to mention BitKeeper, which, not being open-source
code, ruffled some feathers (mine included). While as an open-source purist one
can see the disadvantages of BK, i also have to note the kind of improvements it
brought. Patches are now getting into the Linux kernel in a more predictable way,
and also in a faster way (than say a year ago) - and this is clearly due to BK giving
Linus more flexibility. BK also gives a number of very useful tools when searching
for bugs or integrating code - eg. i can see which line was last modified by which
person, and i can navigate the changes in a quick and logical way. I have worked
with a number of source-control packages before (even with some of the 'big' closed-code
ones), but BK definitely tops them all. While from the 'big' source control packages
i had the impression that they are "the manager's best friend", BK is definitely
the "developer's best friend". Which, for a project like Linux, is the single most
important factor.
Eg. look at the following graphs:
http://kernelnewbies.org/status/Linux_Kernel_2_5_Progress.png
http://kernelnewbies.org/status/Linux_Kernel_2_5_Compounded_Progress.png
( the links are from: http://kt.zork.net/kernel-traffic/kt20021111_191.html#11
)
these show that Linux is a healthy software project, it has a quick and
steady merging rate and only a low number of features are kept in limbo.
JA: You mention originally having issues with BitKeeper. Have these concerns been
addressed by Larry and the people at BitMover?
Ingo Molnar: they have been addressed mostly, yes - via existing features of BK.
Eg. there's now a commit mailing list for Linus' tree, which is important to keep
all the BK related metadata open.
JA: Are you now using BitKeeper yourself?
Ingo Molnar: actually i've been one of the first kernel developers to do a BK merge
with Linus (eg. when the scheduler was still in flux i had a scheduler BK-tree from
which Linus merged), but it's really the kind of activity that Linus does that fits
BK most. So i'm mostly using BK to look at changesets and to generate various kernel
trees automatically. Another, technical problem is that my development box(es) are
detached from the internet, so BK openlogging does not work.
JA: Do you expect that eventually BitKeeper will be replaced by an open source tool?
Ingo Molnar: it's definitely lots of work, and BK is really complex and well-refined.
Currently nothing comparable exists.
JA: Have you worked with any other open source kernels?
Ingo Molnar: not really. I occasionally take a look at FreeBSD - some things they
do right, some things they dont, in the areas i'm most interested in the Linux kernel
is currently ahead both design-wise and implementation-wise. Finally we caught up
in the VM subsystem as well, with Andrea's big and important 2.4 rewrite, Rik's
great rmap code and Andrew's fantastic integration work. But what other answer would
one expect from a Linux kernel developer? :-)
JA: FreeBSD 5.0 is due to be released around December of this year, with some significant
changes to the kernel. Have you followed this development?
Ingo Molnar: not really. The things i sometimes do is to look at their code. Also,
when i search for past discussions regarding some specific topic, sometimes there's
a FreeBSD hit and then i read it. That's all what i can tell. But i do wish their
kernel gets better just as much as the Linux kernel gets better, there needs to
be competition to drive both projects forwards. (the Windows kernel is closed up
enough so that it does not create any development stimulus for Linux (and vice versa).
Rarely do any Windows features get discussed.)
JA: What areas of the Linux kernel do you think still lags behind FreeBSD?
Ingo Molnar: there were two areas where i think we used to lag, the VM and the block
IO subsystem - both have been significantly reworked in 2.5. Whether the VM got
better than FreeBSD's remains to be seen (via actual use), but the Linux VM already
has features that FreeBSD does not have, eg. support for more than 4 GB RAM on x86
(here i guess i'm biased, i wrote much of that code). But FreeBSD's core VM logic
itself, ie. the state machine that decides what to throw out under memory pressure,
how to swap and how to do IO, is top-notch. I think with Andrew Morton's and Jens
Axobe's latest VM and IO work we are top-notch as well (with a few extras perhaps).
There's also an interesting VM project in the making, Arjan van de Ven's O(1) VM
code. [without doubt i do appear to have a sweet spot for O(1) code :-) ] Rik van
Riel has merged Arjan's code a couple of days ago. The code converts every important
VM algorithm (laundering, aging) to a O(1) algorithm while still keeping the fundamentals
- this is quite nontrivial for things like page aging. It's in essence the VM overhead
reduction work that Andrea Arcangeli has started in 2.4.10, brought to the extreme.
I have run Arjan's O(1) VM under high memory pressure, and it's really impressive
- kswapd (the central VM housekeeping kernel thread), which used to eat up lots
of CPU time under VM load, has almost vanished from the CPU usage chart.
I do have the impression that the Linux VM is close to a conceptual breakthrough
- with all the dots connected we now have something that is the next level of quality.
The 2.5 VM has merged all the seemingly conflicting VM branches that fought it out
in 2.4, and the many complex subsystems involved suddenly started playing in concert
and produce something really nice.
JA: A much earlier version of the rmap code was originally in the 2.4 kernel, but
got ripped out. Do you feel it has improved enough that this won't happen again?
Ingo Molnar: this most definitely wont happen. We already rely on rmap for some
other features, so it's not just a matter of undoing one patch. Rmap is essential
to the new VM, without rmap the VM would be like a ferrari with an old diesel motor
- looks good but is pretty unusable.
the problem of rmap in 2.4 was simply its complexity, relative youth as a project
and the relative low number of people that tested it. So in 2.4 it would have been
quite a stretch to keep it in. But it was a fair game for 2.5, and with Andrew's
simplification/robustization/speedup of Rik's rmap code it was very manageable.
JA: Would it be safe to say that 2.5 will outperform even a heavily performance
tuned 2.4?
Ingo Molnar: i'd expect it to - if it does not at least give comparable performance
for any given workload (with 2.5 tuned to that workload as well) then we have not
done a good job.
JA: What other major improvements have gone into 2.5, beyond the scheduler and VM
rewrites?
Ingo Molnar: the block IO rewrite, lots of VFS changes, a rework of the module code
and (plug) the new threading implementation. The block IO rewrite was long overdue
and that's the one i'm most happy about.
JA: Do you feel the changes are significant enough to call the next major kernel
3.0 instead of 2.6?
Ingo Molnar: well, i do think they are significant enough to be called 3.0 - on
the other hand it might not matter much whether it's called 2.6 or 3.0, after all
what ordinarily people know about is this new shiny Linux 9.0 release, right? ;)
JA: Looking into the future, what do you see in store for the next development kernel,
version 2.7?
Ingo Molnar: no idea, really, i dont think trying to look into the future brings
many fruits, the kernel needs to handle what is available here and today. Sometimes
we are lucky and create stuff that happens to work for years :-) Perhaps something
like OpenMosix would be nice to have in the kernel. Plus even better (native) support
for User Mode Linux. Things like this.
JA: Where do you intend to focus your attention after you're content with how the
O(1) scheduler is tuned?
Ingo Molnar: i have no idea. Threading and scalability i suspect is going to remain
an area of interest.
JA: Do you have any advice to offer those aspiring to become productive kernel developers?
Ingo Molnar: only the old mantra: to read the source and the mailing lists. And
take it easy - do what you like doing most.
JA: Thank you for taking time away from your coding to talk with me. I am awed by
all your accomplishments, and look forward to seeing where your kernel development
interests lead you in the future.(c)2002 KernelTrap