Scalable Scheduling

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!newsfeeds.belnet.be!news.belnet.be!opentransit.net!news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Wed, 8 Aug 2001 09:16:52 -0700
From: Mike Kravetz <mkrav...@sequent.com>
To: linux-ker...@vger.kernel.org, torva...@transmeta.com
Subject: [RFC][PATCH] Scalable Scheduling
Original-Message-ID: <20010808091652.B1088@w-mikek2.des.beaverton.ibm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Wed, 8 Aug 2001 16:23:52 GMT
Message-ID: <fa.lbpmt8v.73kc3m@ifi.uio.no>
Lines: 1551

I have been working on scheduler scalability.  Specifically,
the concern is running Linux on bigger machines (higher CPU
count, SMP only for now).  

I am aware of most of the objections to making scheduler
changes.  However, I believe the patch below addresses a
number of these objections.

This patch implements a multi-queue (one runquue per CPU)
scheduler.  Unlike most other multi-queue schedulers that
rely on complicated load balancing schemes, this scheduler
attempts to make global scheduling decisions and emulate
the behavior as the current SMP scheduler.

Performance at the 'low end' (low CPU and thread count)
is comparable to that of the current scheduler.  As the
number of CPUs or threads is increased, performance is
much improved over the current scheduler.  For a more
detailed description as well as benchmark results, please
see: http://lse.sourceforge.net/scheduling/
(OLS paper section).

I would like to get some input as to whether this is an
appropriate direction to take in addressing scalability
limits with the current scheduler.  The general consensus
is that the default scheduler in the kernel should work
well for most cases.  In my opinion, the attached scheduler
implementation accomplishes this by scaling with the
number of CPUs in the system.

Comments/Suggestions/Flames welcome,
-- 
Mike Kravetz                                 mkrav...@sequent.com

Patch

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Original-Date: 	Wed, 8 Aug 2001 09:40:07 -0700 (PDT)
From: Linus Torvalds <torva...@transmeta.com>
To: Mike Kravetz <mkrav...@sequent.com>
cc: <linux-ker...@vger.kernel.org>
Subject: Re: [RFC][PATCH] Scalable Scheduling
In-Reply-To: <20010808091652.B1088@w-mikek2.des.beaverton.ibm.com>
Original-Message-ID: <Pine.LNX.4.33.0108080929170.1530-100000@penguin.transmeta.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
Organization: Internet mailing list
Date: Wed, 8 Aug 2001 16:46:28 GMT
Message-ID: <fa.o8qpd7v.h4e5rd@ifi.uio.no>
References: <fa.lbpmt8v.73kc3m@ifi.uio.no>
Lines: 69

On Wed, 8 Aug 2001, Mike Kravetz wrote:
>
> I have been working on scheduler scalability.  Specifically,
> the concern is running Linux on bigger machines (higher CPU
> count, SMP only for now).

Note that there is no way I will ever apply this particular patch for a
very simple reason: #ifdef's in code.

Why do you have things like

	#ifdef CONFIG_SMP
		.. use nr_running() ..
	#else
		.. use nr_running ..
	#endif

and

	#ifdef CONFIG_SMP
	       list_add(&p->run_list, &runqueue(task_to_runqueue(p)));
	#else
	       list_add(&p->run_list, &runqueue_head);
	#endif

when it just shows that you did NOT properly abstract your thinking to
realize that the non-SMP case should be the same as the SMP case with 1
CPU (+ optimization).

I find code like the above physically disgusting.

What's wrong with using

	nr_running()

unconditionally, and make sure that it degrades gracefully to just the
single-CPU case?

What's wrong whit just using

	runqueue(task_to_runqueue(p))

and having the UP case realize that the "runqueue()" macro is a fixed
entry?

Same thing applies to that runqueue_lock stuff. That is some of the
ugliest code I've seen in a long time. Please use inline functions, sane
defines that work both ways, and take advantage of the fact that gcc will
optimize constant loops and numbers (it's ok to reference arrays in UP
with "array[smp_processor_id()]", and it's ok to have loops that look like
"for (i = 0; i < NR_CPUS; i++)" that will do the right thing on UP _and_
SMP.

And make your #ifdef's be _outside_ the code.

I hate code that has #ifdef's. It's a magjor design mistake, and shows
that the person who coded it didn't think of it as _one_ problem, but as
two.

So please spend some time cleaning it up, I can't look at it like this.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Wed, 8 Aug 2001 10:05:27 -0700
From: Mike Kravetz <mkrav...@sequent.com>
To: Linus Torvalds <torva...@transmeta.com>
Cc: linux-ker...@vger.kernel.org
Subject: Re: [RFC][PATCH] Scalable Scheduling
Original-Message-ID: <20010808100527.D1088@w-mikek2.des.beaverton.ibm.com>
Original-References: <20010808091652.B1...@w-mikek2.des.beaverton.ibm.com> <Pine.LNX.4.33.0108080929170.1530-100...@penguin.transmeta.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <Pine.LNX.4.33.0108080929170.1530-100000@penguin.transmeta.com>; from torvalds@transmeta.com on Wed, Aug 08, 2001 at 09:40:07AM -0700
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Wed, 8 Aug 2001 17:08:11 GMT
Message-ID: <fa.lapet0v.43ccbj@ifi.uio.no>
References: <fa.o8qpd7v.h4e5rd@ifi.uio.no>
Lines: 11

Thanks for the input.  I'll try to get the code into some form
that you can stomach.  Hopefully, after that we can discuss the
merits of the approach/design.

-
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Importance: Normal
Subject: Re: [RFC][PATCH] Scalable Scheduling
To: Linus Torvalds <torva...@transmeta.com>
Cc: Mike Kravetz <mkrav...@beaverton.ibm.com>, <linux-ker...@vger.kernel.org>
Original-Message-ID: <OFF9CB2CBE.6FCCA7C5-ON85256AA2.005FE800@pok.ibm.com>
From: "Hubertus Franke" <fran...@us.ibm.com>
Original-Date: 	Wed, 8 Aug 2001 13:32:40 -0400
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
Organization: Internet mailing list
Date: Wed, 8 Aug 2001 17:33:37 GMT
Message-ID: <fa.kjl1r8v.v6sgp6@ifi.uio.no>
Lines: 100

Linus, great input on the FLAME side, criticism accepted :-)

More importantly, we wanted to get some input (particular from you)
on whether our approach is actually an acceptable one, not
withstanding the #ifdef's :-),
These are easy to fix, but we wanted to follow up
on this topic after OLS ASAP, before the thoughts on this got lost due to
time.

We will clean this code up ASAP and resubmit.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: fran...@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003

Linus Torvalds <torva...@transmeta.com> on 08/08/2001 12:40:07 PM

To:   Mike Kravetz <mkrav...@beaverton.ibm.com>
cc:   <linux-ker...@vger.kernel.org>
Subject:  Re: [RFC][PATCH] Scalable Scheduling

On Wed, 8 Aug 2001, Mike Kravetz wrote:
>
> I have been working on scheduler scalability.  Specifically,
> the concern is running Linux on bigger machines (higher CPU
> count, SMP only for now).

Note that there is no way I will ever apply this particular patch for a
very simple reason: #ifdef's in code.

Why do you have things like

     #ifdef CONFIG_SMP
          .. use nr_running() ..
     #else
          .. use nr_running ..
     #endif

and

     #ifdef CONFIG_SMP
            list_add(&p->run_list, &runqueue(task_to_runqueue(p)));
     #else
            list_add(&p->run_list, &runqueue_head);
     #endif

when it just shows that you did NOT properly abstract your thinking to
realize that the non-SMP case should be the same as the SMP case with 1
CPU (+ optimization).

I find code like the above physically disgusting.

What's wrong with using

     nr_running()

unconditionally, and make sure that it degrades gracefully to just the
single-CPU case?

What's wrong whit just using

     runqueue(task_to_runqueue(p))

and having the UP case realize that the "runqueue()" macro is a fixed
entry?

Same thing applies to that runqueue_lock stuff. That is some of the
ugliest code I've seen in a long time. Please use inline functions, sane
defines that work both ways, and take advantage of the fact that gcc will
optimize constant loops and numbers (it's ok to reference arrays in UP
with "array[smp_processor_id()]", and it's ok to have loops that look like
"for (i = 0; i < NR_CPUS; i++)" that will do the right thing on UP _and_
SMP.

And make your #ifdef's be _outside_ the code.

I hate code that has #ifdef's. It's a magjor design mistake, and shows
that the person who coded it didn't think of it as _one_ problem, but as
two.

So please spend some time cleaning it up, I can't look at it like this.

          Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Original-Date: 	Wed, 8 Aug 2001 10:43:09 -0700 (PDT)
From: Linus Torvalds <torva...@transmeta.com>
To: Hubertus Franke <fran...@us.ibm.com>
cc: Mike Kravetz <mkrav...@beaverton.ibm.com>, <linux-ker...@vger.kernel.org>
Subject: Re: [RFC][PATCH] Scalable Scheduling
In-Reply-To: <OFF9CB2CBE.6FCCA7C5-ON85256AA2.005FE800@pok.ibm.com>
Original-Message-ID: <Pine.LNX.4.33.0108081041260.8047-100000@penguin.transmeta.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
Organization: Internet mailing list
Date: Wed, 8 Aug 2001 17:46:32 GMT
Message-ID: <fa.od9rdvv.kkk5j2@ifi.uio.no>
References: <fa.kjl1r8v.v6sgp6@ifi.uio.no>
Lines: 23


On Wed, 8 Aug 2001, Hubertus Franke wrote:
>
> Linus, great input on the FLAME side, criticism accepted :-)
>
> More importantly, we wanted to get some input (particular from you)
> on whether our approach is actually an acceptable one, not
> withstanding the #ifdef's :-),

I think what the code itself tried to do looked reasonable, but it was so
distracting to read the patch that I can't make any really intelligent
comments about it.

The only thing that looked really ugly was that real-time runqueue thing.
Does it _really_ have to be done that way?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Original-Date: 	Wed, 8 Aug 2001 11:00:50 -0700 (PDT)
From: Linus Torvalds <torva...@transmeta.com>
To: Hubertus Franke <fran...@us.ibm.com>
cc: Mike Kravetz <mkrav...@beaverton.ibm.com>, <linux-ker...@vger.kernel.org>
Subject: Re: [RFC][PATCH] Scalable Scheduling
In-Reply-To: <Pine.LNX.4.33.0108081041260.8047-100000@penguin.transmeta.com>
Original-Message-ID: <Pine.LNX.4.33.0108081058420.8103-100000@penguin.transmeta.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
Organization: Internet mailing list
Date: Wed, 8 Aug 2001 18:04:21 GMT
Message-ID: <fa.oe9ld7v.nku4rf@ifi.uio.no>
References: <fa.od9rdvv.kkk5j2@ifi.uio.no>
Lines: 21


On Wed, 8 Aug 2001, Linus Torvalds wrote:
>
> The only thing that looked really ugly was that real-time runqueue
> thing. Does it _really_ have to be done that way?

Oh, and as I didn't actually run it, I have no idea about what performance
is really like. I assume you've done lmbench runs across wide variety (ie
UP to SMP) of machines with and without this?

"Scalability" is useless if the baseline you scale from is bad. In the
end, the only thing that matters is "performance", not "scalability".
Which is why sometimes O(n) is better than O(logn).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Wed, 8 Aug 2001 11:28:00 -0700
From: Mike Kravetz <mkrav...@sequent.com>
To: Linus Torvalds <torva...@transmeta.com>
Cc: Hubertus Franke <fran...@us.ibm.com>, linux-ker...@vger.kernel.org
Subject: Re: [RFC][PATCH] Scalable Scheduling
Original-Message-ID: <20010808112800.F1088@w-mikek2.des.beaverton.ibm.com>
Original-References: <Pine.LNX.4.33.0108081041260.8047-100...@penguin.transmeta.com> <Pine.LNX.4.33.0108081058420.8103-100...@penguin.transmeta.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <Pine.LNX.4.33.0108081058420.8103-100000@penguin.transmeta.com>; from torvalds@transmeta.com on Wed, Aug 08, 2001 at 11:00:50AM -0700
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Wed, 8 Aug 2001 18:30:36 GMT
Message-ID: <fa.l9p2tgv.530crs@ifi.uio.no>
References: <fa.oe9ld7v.nku4rf@ifi.uio.no>
Lines: 32

On Wed, Aug 08, 2001 at 11:00:50AM -0700, Linus Torvalds wrote:
> 
> On Wed, 8 Aug 2001, Linus Torvalds wrote:
> >
> > The only thing that looked really ugly was that real-time runqueue
> > thing. Does it _really_ have to be done that way?

The issue here is maintaining FIFO and RR semantics for real-time
tasks.  If the real-time tasks are distributed among multiple
runqueues, maintaining these semantics can be quite difficult.  We
thought the best way to handle this would be a separate real-time
runqueue.  Granted, it is not beautiful but it was the simplest
solution that we could come up with.  We'll give it some more
thought when cleaning up the code.

> Oh, and as I didn't actually run it, I have no idea about what performance
> is really like. I assume you've done lmbench runs across wide variety (ie
> UP to SMP) of machines with and without this?

Yes we have, we'll provide those numbers with the updated patch.
One challenge will be maintaining the same level of performance
for UP as in the current code.  The current code has #ifdefs to
separate some of the UP/SMP code paths and we will try to eliminate
these.

-- 
Mike Kravetz                                 mkrav...@sequent.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/