SMP infoletter #1

From: bri...@wintelcom.net (Alfred Perlstein)
Subject: SMP infoletter #1
Date: 1999/10/27
Message-ID: <7v6bqp$1s94$1@FreeBSD.csie.NCTU.edu.tw>
X-Deja-AN: 541114716
X-Trace: FreeBSD.csie.NCTU.edu.tw 941011609 61733 140.113.235.250 (27 Oct 1999 08:06:49 GMT)
Organization: NCTU CSIE FreeBSD Server
NNTP-Posting-Date: 27 Oct 1999 08:06:49 GMT
Newsgroups: mailing.freebsd.smp
X-Complaints-To: usenet@FreeBSD.csie.NCTU.edu.tw

Infoletter #1

This is the start of what I hope to be several informative documents
describing the current and ongoing state of SMP in FreeBSD.

The purpose is to avoid duplicate research of the current state of
FreeBSD's SMP behavior by those who haven't been following FreeBSD-SMP
since 'day one'.  It also points out some areas that are still
unclear to me.

This document was written on Tue Oct 26 1999 referencing the HEAD
branch of the code, things may have significantly changed since.

I also hope that this series helps to shed some light onto the low
level routines in the kernel such as trap and interrupt handling,
ASTs and scheduling.

Where possible direct pointers are given to source code to reduce
the amount of digging one must do to locate routines of interest.

It is also important to note that the document is the result of
the author investigation into the code, and much appreciated help
from various members of the FreeBSD development team, (Poul-Henning
Kamp (phk), Alan Cox (alc), Matt Dillon (dillon)) and Terry Lambert.
As I am not the writer of the code there may be missing or incorrect
information contained in this document.

Please email any corrections or comments to s...@freebsd.org and
please make sure I get a CC. (alf...@freebsd.org)

------------------------------------------------------------

The Big Giant Lock: (src/sys/i386/i386/mplock.s)

The current state of SMP in FreeBSD is by means of the Big Giant
Lock, (BGL).

The BGL is an exclusive counting semaphore, the lock may be
recursively acquired by a single CPU, from that point on other CPUs
will spin while waiting to acquire the lock.

The implementation on i386 is contained in the file
src/sys/i386/i386/mplock.s

The function 'void MPgetlock(unsigned int *lock)' acquires the BGL.

An important side effect of MPgetlock is that it routes all interrupts
to the processor that has acquired the lock.  This is done so that
if an interrupt occurs the handler doesn't need to spin waiting for
the BGL.

The code that is responsible for routing the interrupts is the GRAB_HWI
macro within the MPgetlock code.  Which fiddles the local APIC's
interrupt priority level.

Other MPlock functions exist in mplock.s to initialize, test and
release the lock.

---

Usage of the BGL: (src/sys/i386/i386/mplock.s)

The BGL is pushed down (acquired) on all entry into the kernel, by
means of syscall, trap or interrupt.

The file src/sys/i386/i386/exception.s contains all the initial
entry points for syscalls, traps and interrupts.

syscalls and 'altsyscalls' acquire the lock through the macros
SYSCALL_LOCK, and ALTSYSCALL_LOCK which map to the functions assembler
functions _get_syscall_lock and _get_altsyscall_lock on SMP machines
(if SMP is not defined they are not called)

_get_syscall_lock and _get_altsyscall_lock are also present in
src/sys/i386/i386/mplock.s, they save the contents of the local
apic's interrupt priority and call MPgetlock.

It would seem that the syscall lock could simply be delayed until
entry to the actual system call (write/read/...) however several
issues arise:

1) fault on copyin of user's syscall arguments

This is actually a non-issue, if a fault occurs the processor will
spin to acquire the MPlock, before potentially recursing into the
non-re-entrant vm system.  Although this leaves the processor in
a faulted state for quite some time, it is no different than when
CPU 1 has the lock and a process running on CPU 2 page faults.

Problem #1 takes care of itself because of the recursive MPlock.

2) ktrace hooks

src/sys/kern/kern_ktrace.c

The ktrace hooks in the syscalls manipulate kernel resources that
are not MP safe, ktrace touches many parts of the kernel that need
work to become MP safe, a temporary solution would be to raise the
BGL when entering the ktrace code.

3) STOPEVENT aka void stopevent(struct proc*, unsigned int, unsigned int);

/home/src/sys/kern/sys_process.c

stopevent will be called if the process is marked to sleep via
procfs, stopping the process requires entry into the scheduler
which is not MP safe.

again a temporary hack would be to conditionally set the MPlock if
the condition exists.

---

SPL issues:    (src/sys/i386/isa/ipl_funcs.c)

There exists an inherent race condition with the spl() system in
a MP environment, consider:

  system is at splbio:

  process A          process B

  int s;             int s;
  s = splhigh();                            /* spl raised to high however, 
                                               saved spl 's' has old value
                                               of splbio */
                     s = splhigh();         /* spl still high */
  splx(s);                                  /* processor spl now at bio
                                               even though B still needs
                                               splhigh */
                     splx(s);

Process B may be interrupted in a critical section.

Also note that the asymmetric nature of the spl system makes it
very difficult to pinpoint down locations in the the bottom half
of the kernel (the part that services interrupts) that may collide
with the top half (user process context).

A short sighted solution would be to enforce spl as an MPlock, an
exclusive counting semaphore, however since no locking protocol or
ordering of spl pushdown is required deadlock becomes a major
problem.

The only solution that may work with spl, is adding the pushdown
of the BGL when first asserting any level of spl and releasing the
MPlock when spl0 is reached.

It may also be interesting to see what a separate lock based only
on spl would accomplish, moving to a model where the spl entry
points become our new BGL might also be something to investigate.

Since spl is used only for short time mutual exclusion it may
actually work nicely as a course grained locking system for the
time being.

---

Simple locks:   (src/sys/i386/i386/simplelock.s)

cursory research into the CVS logs reveals:

on the file kern/vfs_syscalls.c:

   1.28 Thu Jul 13 8:47:42 1995 UTC by davidg 
   Diffs to 1.27 

   NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
         proc or any VM system structure will have to be rebuilt!!!

   Much needed overhaul of the VM system. Included in this first round of
   changes:
 ...
   4) simple_lock's removed. Discussion with several people reveals that the
      SMP locking primitives used in the VM system aren't likely the mechanism
      that we'll be adopting. Even if it were, the locking that was in the code
      was very inadequate and would have to be mostly re-done anyway. The
      locking in a uni-processor kernel was a no-op but went a long way toward
      making the code difficult to read and debug.

However with the Lite/2 merge they were re-introduced and the kernel
is littered with them, the ones in place seem somewhat adequate
for short term exclusion.  essentially they are spinlocks.

What's interesting is that the simplelocks seem to provide for MP
sync with lockmgr locks, however the code is littered with calls
to unsafe functions such as MALLOC.

It looks like someone decided to do the hard stuff first.

Why are the simplelocks necessary if the kernel is still guarded
by the BGL?  (besides use in the lockmgr)

---

Scheduler:

The scheduler in cpu_switch() (src/sys/i386/i386/swtch.s) saves the
current nesting level of the process's MPlock (after masking off
the CPUid bits from it) into the PCB (process control block) (lines
317-324) before attempting to switch to another process where it
restores the next process's nesting level (lines 453-455).

---

-Alfred Perlstein - [bri...@rush.net|alf...@freebsd.org]
Wintelcom systems administrator and programmer
   - http://www.wintelcom.net/ [bri...@wintelcom.net]

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: lu...@watermarkgroup.com (Luoqi Chen)
Subject: Re:  SMP infoletter #1
Date: 1999/10/27
Message-ID: <7v767e$2p0t$1@FreeBSD.csie.NCTU.edu.tw>
X-Deja-AN: 541236506
X-Trace: FreeBSD.csie.NCTU.edu.tw 941038638 91167 140.113.235.250 (27 Oct 1999 15:37:18 GMT)
Organization: NCTU CSIE FreeBSD Server
NNTP-Posting-Date: 27 Oct 1999 15:37:18 GMT
Newsgroups: mailing.freebsd.smp
X-Complaints-To: usenet@FreeBSD.csie.NCTU.edu.tw

I would like to offer some comments here.
> 
> Usage of the BGL: (src/sys/i386/i386/mplock.s)
> 
> The BGL is pushed down (acquired) on all entry into the kernel, by
             ^^^^^^^^^^^
My understanding is push down is a completely different term.

> means of syscall, trap or interrupt.
...
> It would seem that the syscall lock could simply be delayed until
> entry to the actual system call (write/read/...) however several
> issues arise:
> 
> 1) fault on copyin of user's syscall arguments
...
> 
> Problem #1 takes care of itself because of the recursive MPlock.
> 
> 2) ktrace hooks
...
> 
> 3) STOPEVENT aka void stopevent(struct proc*, unsigned int, unsigned int);
...

You missed one very important part of the code path, which is also the most
difficult one to be dealt with to make the path MP safe: userret(),
it involves scheduling (relatively easier) and signal delivery (difficult).

> 
> ---
> 
> SPL issues:    (src/sys/i386/isa/ipl_funcs.c)
> 
> There exists an inherent race condition with the spl() system in
> a MP environment, consider:
> 
>   system is at splbio:
> 
>   process A          process B
> 
>   int s;             int s;
>   s = splhigh();                            /* spl raised to high however, 
>                                                saved spl 's' has old value
>                                                of splbio */
>                      s = splhigh();         /* spl still high */
>   splx(s);                                  /* processor spl now at bio
>                                                even though B still needs
>                                                splhigh */
>                      splx(s);
> 
> 
> Process B may be interrupted in a critical section.
> 
> Also note that the asymmetric nature of the spl system makes it
> very difficult to pinpoint down locations in the the bottom half
> of the kernel (the part that services interrupts) that may collide
> with the top half (user process context).
> 
> A short sighted solution would be to enforce spl as an MPlock, an
> exclusive counting semaphore, however since no locking protocol or
> ordering of spl pushdown is required deadlock becomes a major
> problem.
> 
> The only solution that may work with spl, is adding the pushdown
> of the BGL when first asserting any level of spl and releasing the
> MPlock when spl0 is reached.
> 
> It may also be interesting to see what a separate lock based only
> on spl would accomplish, moving to a model where the spl entry
> points become our new BGL might also be something to investigate.
> 
I actually have a working implementation for this (I'm willing to provide
the patch if anyone is interested to try). But I believe this leads to a
dead-end. We should use some kind of interrupt level aware mutex instead,
something like this
	s = splimp();
	simple_lock(&mbuf_lock);
	...
	simple_unlock(&mbuf_lock);
	splx(s);

If everyone agrees on this direction, there is an immediate benefit we can
reap by moving cpl to per-cpu storage and getting rid of cpl_lock, which
might a reduce significant amount of system time (5~10% unscientifically).

> Since spl is used only for short time mutual exclusion it may
> actually work nicely as a course grained locking system for the
> time being.
> 
The reason I believe this is leading us to nowhere is that it is a hack and
it could only marginally improve the performance as most of the kernel is
running under some spl protection, e.g., it's impossible to move tcp stack
outside the BGL under this scheme.
> 
> Why are the simplelocks necessary if the kernel is still guarded
> by the BGL?  (besides use in the lockmgr)
> 
Under BGL it's not even necessary in the lockmgr, in fact, the only useful
simplelocks are the fast interrupt lock and the lock on i/o apic register
window (fast interrupt handlers are not under BGL protection, IIRC, the
only instance is sio). But BGL is to go and hence simplelock is here to stay.

One interest things NetBSD has done was a read/write spinlock, it could be
used to protect lists like allproc. I think it is nice to have in our system
too.

> ---
> 
> Scheduler:
> 
> The scheduler in cpu_switch() (src/sys/i386/i386/swtch.s) saves the
> current nesting level of the process's MPlock (after masking off
> the CPUid bits from it) into the PCB (process control block) (lines
> 317-324) before attempting to switch to another process where it
> restores the next process's nesting level (lines 453-455).
> 
One thing that is relatively easy to do in this area is to allow a processor
spin waiting for the BGL to pick up another user process. I'm currently
looking into this problem myself, one thing I would like to do is to move
the nesting level field from the u area to struc proc, so that we could
easily tell if a process was (involuntarily) context switched in the user
mode and a candidate to schedule on a non-lock holding processor.

-lq


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Charles Randall <crand...@matchlogic.com>
Subject: Big Giant Lock progress?
Date: 1999/11/17
Message-ID: <64003B21ECCAD11185C500805F31EC0304621F00@houston.matchlogic.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Content-Type: text/plain; charset="iso-8859-1"
MIME-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

Where can I find an update on the progress of slowly chipping away at the
BGL?

I saw Alfred's "SMP Infoletter #1". Is that the latest info?

-Charles

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Matthew Dillon <dil...@apollo.backplane.com>
Subject: Re: Big Giant Lock progress?
Date: 1999/11/17
Message-ID: <199911171851.KAA65437@apollo.backplane.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
References: <64003B21ECCAD11185C500805F31EC0304621F00@houston.matchlogic.com>
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

:Where can I find an update on the progress of slowly chipping away at the
:BGL?
:
:I saw Alfred's "SMP Infoletter #1". Is that the latest info?
:
:-Charles

    Alfred got caught up in real life work so it's been on hold for a while.

					-Matt
					Matthew Dillon 
					<dil...@backplane.com>

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Charles Randall <crand...@matchlogic.com>
Subject: RE: Big Giant Lock progress?
Date: 1999/11/18
Message-ID: <64003B21ECCAD11185C500805F31EC0304621F4F@houston.matchlogic.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Content-Type: text/plain; charset="iso-8859-1"
MIME-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

From: Matthew Dillon [mailto:dil...@apollo.backplane.com]
> Alfred got caught up in real life work so it's been on hold for a while.

Has anyone profiled an SMP kernel in a standard role (Web server, NFS
server, development machine, etc) and compared the points of BGL contention
with the ease (or difficulty) of more fine-grained locking in those areas?

In other words, have the "bang for the buck" areas been identified?

Charles

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Matthew Dillon <dil...@apollo.backplane.com>
Subject: Re: RE: Big Giant Lock progress?
Date: 1999/11/18
Message-ID: <199911180828.AAA79093@apollo.backplane.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
References: <64003B21ECCAD11185C500805F31EC0304621F4F@houston.matchlogic.com>
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

:From: Matthew Dillon [mailto:dil...@apollo.backplane.com]
:> Alfred got caught up in real life work so it's been on hold for a while.
:
:Has anyone profiled an SMP kernel in a standard role (Web server, NFS
:server, development machine, etc) and compared the points of BGL contention
:with the ease (or difficulty) of more fine-grained locking in those areas?
:
:In other words, have the "bang for the buck" areas been identified?
:
:Charles

    There are three major areas of interest:

	* parallelizing within the network stack

	* parallelizing network interrupts and the 
	  network stack

	* parallelizing the cached read/write data
	  path, so the supervisor can copy data
	  to user processes on several cpu's at
	  once.

    A whole lot of groundwork needs to happen before
    we can do any of this stuff, though.  A previous
    attempt to optimizing just #3 in uiomove did not
    produce very good results, mainly oweing to the
    bgl being held too long in other places.

    There are also some neat optimizations that 
    can be done, especially with the simplelocks.
    For example, when unlocking a simplelock you do
    not need to used a locked instruction or even
    a cmpexg instruction, because you already own
    the lock and nobody else can mess with it.
    Nor do you need to use a locked assembly instruction
    when bumping the ref count on a simplelock you
    already hold.  I think I am going to commit those even 
    without Alfred's work, once I separate them out
    and have a little time, because they at least double 
    the speed of the simplelocks.

					-Matt
					Matthew Dillon 
					<dil...@backplane.com>

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Alfred Perlstein <bri...@wintelcom.net>
Subject: Re: RE: Big Giant Lock progress?
Date: 1999/11/18
Message-ID: <Pine.BSF.4.05.9911180217540.12797-100000@fw.wintelcom.net>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
References: <199911180828.AAA79093@apollo.backplane.com>
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

On Thu, 18 Nov 1999, Matthew Dillon wrote:

> :From: Matthew Dillon [mailto:dil...@apollo.backplane.com]
> :> Alfred got caught up in real life work so it's been on hold for a while.
> :
> :Has anyone profiled an SMP kernel in a standard role (Web server, NFS
> :server, development machine, etc) and compared the points of BGL contention
> :with the ease (or difficulty) of more fine-grained locking in those areas?
> :
> :In other words, have the "bang for the buck" areas been identified?
> :
> :Charles
> 
>     There are three major areas of interest:
> 
> 	* parallelizing within the network stack
> 
> 	* parallelizing network interrupts and the 
> 	  network stack
> 
> 	* parallelizing the cached read/write data
> 	  path, so the supervisor can copy data
> 	  to user processes on several cpu's at
> 	  once.

A fourth area that i'm most interested in for a base for the
other work is working on the low level routines that require
locking that's not visible, the big example is malloc which
splhigh()s while in use.

Although my coding hands have been busy it doesn't stop me
from thinking about this stuff in my sleep. :)

I've been thinking of something along the line of per-processor
pools of resources with high and low watermarks to determine when
to borrow/return from a global pool.  This ought to reduce malloc
contention quite a bit.  Since a CPU never has to worry about any
other CPU toucing its memory pools it doesn't need to lock anything
unless the private pool is exhausted.  This is discussed in Vahalia's
book in re the Dynix allocator.

I've also spoken to Alan Cox (from Linux) and he's explained that
the way Linux deals with malloc from device drivers (*) is that
there is an 'atomic' memory pool from which drivers grab memory
from.

(*) the problem of interrupts for those just joining the discusion
is that they may cause a recursive attempt on a lock.  With
the bgl it's ok (exclusive counting semaphore) but with plain
spinlocks it leads to deadlock if the CPU holding the lock (on
let's say the malloc pool) is interrupted and the interrupt then
tries to spin on the lock already held.

I like this idea quite a bit (back to atomic memory pools), combined
with a per-cpu pool of memory that can be grabbed in an atomic
fashion we can reduce a major contention problem as well as interrupt
allocations.

The problem is possible pre-mature out-of-memory-situations, but
I'm confident that tuning the high/low watermarks for allocations
and atomic pools can make that a rare occurrance.

Fifth:
It's also very important that the scheduler becomes MP safe.

>     A whole lot of groundwork needs to happen before
>     we can do any of this stuff, though.  A previous
>     attempt to optimizing just #3 in uiomove did not
>     produce very good results, mainly oweing to the
>     bgl being held too long in other places.

I wasn't around when this was attempted, did the code only
touch the BGL when the amount to copy was greater than let's
say 2k?  Or was the bgl toggled on every uiomove?

>     There are also some neat optimizations that 
>     can be done, especially with the simplelocks.
>     For example, when unlocking a simplelock you do
>     not need to used a locked instruction or even
>     a cmpexg instruction, because you already own
>     the lock and nobody else can mess with it.
>     Nor do you need to use a locked assembly instruction
>     when bumping the ref count on a simplelock you
>     already hold.  I think I am going to commit those even 
>     without Alfred's work, once I separate them out
>     and have a little time, because they at least double 
>     the speed of the simplelocks.

It'd be great to get that code committed asap, it's really a keen
observation and the benefit are immediate and un-obtrusive.

-Alfred

> 
> 					-Matt
> 					Matthew Dillon 
> 					<dil...@backplane.com>

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Matthew Dillon <dil...@apollo.backplane.com>
Subject: Re: RE: Big Giant Lock progress?
Date: 1999/11/18
Message-ID: <199911181643.IAA85662@apollo.backplane.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
References: <Pine.BSF.4.05.9911180217540.12797-100000@fw.wintelcom.net>
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

:Fifth:
:It's also very important that the scheduler becomes MP safe.

    I forgot that one.  Yes, absolutely.

:>     we can do any of this stuff, though.  A previous
:>     attempt to optimizing just #3 in uiomove did not
:...
:
:I wasn't around when this was attempted, did the code only
:touch the BGL when the amount to copy was greater than let's
:say 2k?  Or was the bgl toggled on every uiomove?

    BDE tried his hand at this and spent a few minutes working
    up a simple patch that essentially turned off the bgl
    during the uiomove and then turned it back on again.  I
    messed around with it for a while and ran a bunch of tests
    and just didn't get the expected performance improvement.
    I think there was a lock inversion problem too but I'm not
    sure.

    I concluded that the problem was that there were too many
    other places in the code path that held the BGL and turning
    it off in that one place was not sufficient.
:
:It'd be great to get that code committed asap, it's really a keen
:observation and the benefit are immediate and un-obtrusive.
:
:-Alfred

    Ok, I will.  I have put the adjusted patch up for a final
    review at:

	http://www.backplane.com/FreeBSD4/

	in the second section 'SMP PatchSet ....', the file is:

	http://www.backplane.com/FreeBSD4/smp-patch-02.diff

    I've been running the patch for several weeks without
    any problem.

    I will commit it tonight if nobody sees any problems.  
    Essentially what this code does is two things:  First,
    it gets rid of totally unnecessary argument pushes onto
    the stack for lock-related assembly that is only called by
    other assembly and already a NON-GPROF entry.  Second, it
    optimizes the two cases that do not require a locked 
    instruction sequence or cmpexg by making them not use a 
    locked instruction or cmpexg:

	* locking when you already own the lock (e.g. recursion)
	* unlocking a lock

    That it!  Optimizing this path makes the MP locking 
    routines even more efficient then the SMP spl*() code in
    the case where lock recursion occurs.  

    Since one of the things we will have to do soon is start
    encapsulating many blocks of code with the BGL in order to be
    able to start de-encapsulating it from the top-down, these
    optimizations will allow us to retain good SMP performance 
    through the work plus have the added benefit of reducing the
    base lock/unlock overhead by a factor of 2 (the unlock becomes
    very cheap).

					-Matt
					Matthew Dillon 
					<dil...@backplane.com>




To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message