Tuesday, November 17, 2009

linux brainfuck scheduler ROCKIN! con kolivas

http://ck.kolivas.org/patches/bfs/bfs-faq.txt

FAQS about BFS. v0.310

Why did I write it?

After years of using my old kernel and numerous hardware upgrades, I finally
had hardware that needed a newer kernel for drivers and to try out the newer
filesystems. Booting the mainline kernel was relatively reassuring in that
the scheduler behaviour was much better than what was in earlier kernels.
However, it didn't take long before I started being disappointed in that too.
Random stalls in mouse movements, keypresses, strange cpu distribution in
various workloads and unpredictable behaviour all around were exactly what I
was hoping had gone away. So I did what I vowed never to do, looked at the
code. After seeing it had grown into a monster of epic proportions I sat down
and thought about what was wrong. One of the key features of fairness and
interactivity that I always argued for were very simple semantics for how
cpu should be distributed, with guaranteed low latencies so that interactivity
was assured by design instead of bolted on. CFS in essence does that, but it
does something else too. It varies timeslice length to try and preserve some
deadline list and it determines cpu distribution based on a run/sleep
relationship. It also is designed to scale to monster proportion hardware
that the common man will never see. The whole sleep calculation thing is
exactly what I found was responsible for making varied behaviour under
different loads and relative starvation and unfairness. It's not a profound
effect in CFS and that's admirable. It just doesn't behave the way I feel
the scheduler should being forward looking only (not calculating sleep) and
it doesn't really make the most of a relatively lightly loaded machine without
many many cpus. So I threw it all out and wrote exactly the opposite.

What is it?

BFS is the Brain Fuck Scheduler. It was designed to be forward looking only,
make the most of lower spec machines, and not scale to massive hardware. ie
it is a desktop orientated scheduler, with extremely low latencies for
excellent interactivity by design rather than "calculated", with rigid
fairness, nice priority distribution and extreme scalability within normal
load levels.

Extreme scalability within normal load levels? Isn't that a contradiction?

For years we've been doing our workloads on linux to have more work than we
had CPUs because we thought that the "jobservers" were limited in their
ability to utilise the CPUs effectively (so we did make -j6 or more on a
quad core machine for example). This scheduler proves that the jobservers
weren't at fault at all, because make -j4 on a quad core machine with BFS
is faster than *any* choice of job numbers on CFS. See reverse scalability
graph courtesy of Serge Belyshev showing various job numbers on a kernel build
on a quad core machine. The problem has always been that the mainline
scheduler can't keep the CPUs busy enough; ie it doesn't make the most of
your hardware in the most common situations on a desktop! Note that the
reverse scalability graph is old; the scalability has improved since then.

Why "Brain Fuck"?

Because it throws out everything about what we know is good about how to
design a modern scheduler in scalability.
Because it's so ridiculously simple.
Because it performs so ridiculously well on what it's good at despite being
that simple.
Because it's designed in such a way that mainline would never be interested
in adopting it, which is how I like it.
Because it will make people sit up and take notice of where the problems are
in the current design.
Because it throws out the philosophy that one scheduler fits all and shows
that you can do a -lot- better with a scheduler designed for a particular
purpose. I don't want to use a steamroller to crack nuts.
Because it actually means that more CPUs means better latencies.
Because I must be fucked in the head to be working on this again.
I'll think of some more becauses later.

How scalable is it?

I don't own the sort of hardware that is likely to suffer from using it, so
I can't find the upper limit. Based on first principles about the overhead
of locking, and the way lookups occur, I'd guess that a machine with 16 CPUS
or more would start to have exponentially less performance (thanks Ingo for
confirming this). Note that the number of logical CPUs is what affects BFS'
scalability, not the physical ones. By that I mean that a hyperthreaded CPU
that is a quad core hyperthreaded is 8 EIGHT logical CPUs. So it is NOT the
same as a quad core without hyperthreading.

Since version 0.300, scalability improvements have been added that should
further improve performance, including NUMA support! No scalability benchmarks
on very big machines.have been performed on new versions to compare its
performance.

The O(n) lookup of BFS will cause people some concern because of the
notation. However, if the actual overhead is very small, then even with
large numbers of n, it can be lower overhead than an O(1) design. Testing
this scheduler vs CFS with the test app "forks" which forks 1000 tasks that
do simple work, shows no difference in time to completion compared to CFS.
That's a load of 1000 on a quad core machine. But note that BFS gets much
faster when the loads are lower and approximate the number of CPUs, which
is much more what you would experience on a desktop.


What about interbench numbers?

Interbench does too many jobs by default on the burn/compile tests. I've put
up interbench results from a quad core where the jobs (4) is equal to the
number of CPUs so the test is more meaningful, and added comments. It appears
I'm about the only person who understands interbench numbers since I wrote
the benchmark, so don't place too much weight on them. The 'latt' test app
recently written by Jens Axboe is a better place for simpler to understand
and useful numbers. It'll be interesting to see how this application evolves.

What features does BFS have and not have?

On top of the current scheduler design, it has a SCHED_IDLEPRIO which actually
does only schedule tasks when idle, and SCHED_ISO for unprivileged realtime
performance. BFS does NOT implement CGROUPS. A desktop user should not need
know about CGROUPS, nor should they need to use them. BFS also does not have
the feature of "lots of tunables I don't understand".

How do you recommend I use this?

It's designed so that you just patch it in and use it. You shouldn't need to
do anything at all. But since people still want to know every last thing...

THESE ARE OPTIONAL FOR LOWEST LATENCY. YOU DO NOT NEED THESE!
Configure your kernel with 1000Hz, preempt ON and disable dynamic ticks.

You shouldn't need to tune BFS virtually ever. The only tunable for the
scheduler itself is the rr_interval value (see documentation). Try 3ms if
latency is everything to you. When compiling software, do not use more jobs
than you have CPUs! So make -j2 on dual core, -j4 on quad core and so on.
Nice levels are strictly obeyed so if you nice your compiles they'll be
virtually unnoticeable. (nice -n 19 make -j2). Run your distributed computing
clients SCHED_IDLEPRIO (eg folding at home, mprime etc):
schedtool -D -e mprime
This will make your distributed computing client *never* cause slowdowns in
your other userspace applications, at the cost of slightly slower progress
of the client.
Run your audio and video apps SCHED_ISO:
schedtool -I -e amarok
This will run amarok as an unprivileged real-time task. Note that if you start
an application that tries to get real-time scheduling (eg jackd) and you are
not starting it as root, BFS will automatically elevate it to SCHED_ISO for
you to give it the next best thing.

NUMA aware?

It is NOT NUMA aware in the sense that it does any fancy shit on NUMA, but
it will work on NUMA hardware just fine. Only the really big NUMA hardware
is likely to suffer in performance, and this is theoretically only, since
no one has that sort of hardware to prove it to me, but it seems almost
certain. v0.300 onwards have NUMA enhancements.

Multicore processors?

This is where BFS shines.

Single processors?

Single processors benefit a lot from BFS too.

I found a bug!

Great! Help me debug it. A scheduler is subtle and quick to anger. If you can
code then delve away and see what you can find! I'll take help from anyone.
It's a major ordeal trying to get this thing working on all sorts of hardware.
You can't code? Give me whatever details you've got and I'll see what I can
do. As per usual this stuff comes with no guarantee, and I do not have
infinite time to spend on it. I do NOT get paid to do this and do it just for
the fun of it. I'll do whatever I can to help you but I cannot support this
like a paid project. I'd *love* to see people hacking on the code themselves.

Are you looking at getting this into mainline?

LOL.

No really, are you?

LOL.

Really really, are you?

No. They would be crazy to use this scheduler anyway since it won't scale to
their 4096 cpu machines. The only way is to rewrite it to work that way, or
to have more than one scheduler in the kernel. I don't want to do the former,
and mainline doesn't want to do the latter. Besides, apparently I'm a bad
maintainer, which makes sense since for some reason I seem to want to have
a career, a life, raise a family with kids and have hobbies, all of which
have nothing to do with linux.

Can it be made to scale to 4096 CPUs?

Sure I guess you could run one runqueue per CPU package instead of a global
one and so on, but I have no intention whatsoever at doing that because it
will compromise the performance where *I* care.

Is this stable?

It is now pretty stable. It has been a while since serious crashes have been
reported. It first booted on 25th August 2009 but the codebase has since
become a lot more robust. Of course the usual warnings apply that it might
eat up your children and spit out your data, or worse, eat up your data and
spit out children (but I doubt it).

Currently known problems?

Nil.

GIT repository?

Sorry, it's not the right tool for me so it's not worth me investing the time
in setting one up.

WINE Games?

Because of weak and sometimes non-existent locking with windows games under
WINE, they can actually perform *worse* on BFS due to BFS distributing
load to both CPUs. Your mileage may vary, but if you do get worse behaviour,
this can be improved by binding WINE and the game to one cpu like this:
schedtool -a 0 -e winegamecommandline

CPU Accounting is different?

The mainline kernel simply samples cpu accounting per tick. This can be
very inaccurate for very short lived processes. BFS uses accounting of actual
cpu time consumed within the limitations of the currently kernel and hardware
and may report very different cpu usage. Do not use the reported cpu usage
between different kernels as some marker of performance. Compare actual work
progress instead (eg when comparing distributed computing clients).

Will I be maintaining this, even though mainline won't have it?

Yes. For the foreseeable future at least. Once the bugs are ironed out, it
shouldn't be too much effort to keep in sync with mainline.

What is the relationship of BFS to SD?

BFS implements the same forward-looking rigidly fair timeslicer philosophy
that SD used, but is otherwise a completely different scheduler. It
is an attempt to fuse all the ideas I have about scheduling so far, without
worrying about trying to make it infinitely scalable.


Thanks to the guys in irc.oftc.net #ck for inspiration to work on this and
early testing! Many of them sat idle in that channel for years while nothing
happened. The xkcd comic supported_features also helped motivate this
work. Yes I know you probably still can't watch full screen videos on youtube,
but that's not entirely the scheduler's fault.


Work in progress.

Last updated: Sat Nov 7 08:51:28 2009
Con Kolivas

No comments: