Name: Hyperthread Scheduler Modifications
Author: Ingo Molnar
Status: Experimental

D: symmetric multithreading (hyperthreading) is an interesting new concept
D: that IMO deserves full scheduler support. Physical CPUs can have multiple
D: (typically 2) logical CPUs embedded, and can run multiple tasks 'in
D: parallel' by utilizing fast hardware-based context-switching between the
D: two register sets upon things like cache-misses or special instructions.
D: To the OSs the logical CPUs are almost undistinguishable from physical
D: CPUs. In fact the current scheduler treats each logical CPU as a separate
D: physical CPU - which works but does not maximize multiprocessing
D: performance on SMT/HT boxes.
D: 
D: The following properties have to be provided by a scheduler that wants to
D: be 'fully HT-aware':
D: 
D:  - HT-aware passive load-balancing: the irq-driven balancing has to be
D:    per-physical-CPU, not per-logical-CPU.
D: 
D:    Otherwise it might happen that one physical CPU runs 2 tasks, while
D:    another physical CPU runs no threads. The stock scheduler does not
D:    recognize this condition as 'imbalance' - to the scheduler it appears
D:    as if the first two CPUs had 1-1 task running, the second two CPUs had
D:    0-0 tasks running. The stock scheduler does not realize that the two
D:    logical CPUs belong to the same physical CPU.
D: 
D:  - 'active' load-balancing when a logical CPU goes idle and thus causes a
D:    physical CPU imbalance.
D: 
D:    This is a mechanism that simply does not exist in the stock 1:1
D:    scheduler - the imbalance caused by an idle CPU can be solved via the
D:    normal load-balancer. In the HT case the situation is special because
D:    the source physical CPU might have just two tasks running, both
D:    runnable - this is a situation that the stock load-balancer is unable
D:    to handle - running tasks are hard to be migrated away. But it's
D:    essential to do this - otherwise a physical CPU can get stuck running 2 
D:    tasks, while another physical CPU stays idle.
D: 
D:  - HT-aware task pickup.
D: 
D:    When the scheduler picks a new task, it should prefer all tasks that 
D:    share the same physical CPU - before trying to pull in tasks from other 
D:    CPUs. The stock scheduler only picked tasks that were scheduled to that
D:    particular logical CPU.
D: 
D:  - HT-aware affinity.
D: 
D:    Tasks should attempt to 'stick' to physical CPUs, not logical CPUs.
D: 
D:  - HT-aware wakeup.
D: 
D:    again this is something completely new - the stock scheduler only knows
D:    about the 'current' CPU, it does not know about any sibling [== logical
D:    CPUs on the same physical CPU] logical CPUs. On HT, if a thread is
D:    woken up on a logical CPU that is already executing a task, and if a
D:    sibling CPU is idle, then the sibling CPU has to be woken up and has to
D:    execute the newly woken up task immediately.
D: 
D: the attached patch (against 2.5.31-BK-curr) implements all the above
D: HT-scheduling needs by introducing the concept of a shared runqueue:
D: multiple CPUs can share the same runqueue. A shared, per-physical-CPU
D: runqueue magically fulfills all the above HT-scheduling needs. Obviously
D: this complicates scheduling and load-balancing somewhat (see the patch for
D: details), so great care has been taken to not impact the non-HT schedulers
D: (SMP, UP). In fact the SMP scheduler is a compile-time special case of the
D: HT scheduler. (and the UP scheduler is a compile-time special case of the
D: SMP scheduler)
D: 
D: the patch is based on Jun Nakajima's prototyping work - the lowlevel
D: x86/Intel bits are still those from Jun, the sched.c bits are newly
D: implemented and generalized.
D: 
D: There's a single flexible interface for lowlevel boot code to set up
D: physical CPUs: sched_map_runqueue(cpu1, cpu2) maps cpu2 into cpu1's
D: runqueue. The patch also implements the lowlevel bits for P4 HT boxes for
D: the 2/package case.
D: 
D: (NUMA systems which have tightly coupled CPUs with a smaller cache and
D: protected by a large L3 cache might benefit from sharing the runqueue as
D: well - but the target for this concept is SMT.)
D: 
D: some numbers:
D: 
D: compiling a standalone floppy.c in an infinite loop takes 2.55 seconds per
D: iteration. Starting up two such loops in parallel, on a 2-physical,
D: 2-logical (total of 4 logical CPUs) P4 HT box gives the following numbers:
D: 
D:   2.5.31-BK-curr:     - fluctuates between 2.60 secs and 4.6 seconds.
D: 
D:   BK-curr + sched-F3: - stable 2.60 sec results.
D: 
D: the results under the stock scheduler depends on pure luck: which CPUs get
D: the tasks scheduled. In the HT-aware case each task gets scheduled on a
D: separate physical CPU, all the time.
D: 
D: compiling the kernel source via "make -j2" [under-utilizes CPUs]:
D: 
D:   2.5.31-BK-curr:              45.3 sec
D: 
D:   BK-curr + sched-F3:          41.3 sec
D: 
D: ie. a ~10% improvement. The tests were the best results picked from lots
D: of (>10) runs. The no-HT numbers fluctuate much more (again the randomness
D: effect), so the average compilation time in the no-HT case is higher.
D: 
D: saturated compilation "make -j5" results are roughly equivalent, as
D: expected - the one-runqueue-per-CPU concept works adequately when the
D: number of tasks is larger than the number of logical CPUs. The stock
D: scheduler works well on HT boxes in the boundary conditions: when there's
D: 1 task running, and when there's more nr_cpus tasks running.
D: 
D: the patch also unifies some of the other code and removes a few more
D: #ifdef CONFIG_SMP branches from the scheduler proper.
D: 
D: (the patch compiles/boots/works just fine on UP and SMP as well, on the P4
D: box and on another PIII SMP box as well.)
D: 
D: Testreports, comments, suggestions welcome,
D: 
D: 	Ingo

diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .8164-linux-2.5.33/arch/i386/config.in .8164-linux-2.5.33.updated/arch/i386/config.in
--- .8164-linux-2.5.33/arch/i386/config.in	2002-08-28 09:29:39.000000000 +1000
+++ .8164-linux-2.5.33.updated/arch/i386/config.in	2002-09-09 17:26:06.000000000 +1000
@@ -155,6 +155,20 @@ if [ "$CONFIG_MWINCHIP3D" = "y" ]; then
 fi
 
 bool 'Symmetric multi-processing support' CONFIG_SMP
+
+if [ "$CONFIG_SMP" = "y" ]; then
+   if [ "$CONFIG_MPENTIUM4" = "y" ]; then
+      choice 'Hyperthreading Support' \
+          "off           CONFIG_NR_SIBLINGS_0 \
+           2-siblings    CONFIG_NR_SIBLINGS_2" off
+   fi
+fi
+
+if [ "$CONFIG_NR_SIBLINGS_2" = "y" ]; then
+   define_bool CONFIG_SHARE_RUNQUEUE y
+   define_int CONFIG_MAX_NR_SIBLINGS 2
+fi
+
 bool 'Preemptible Kernel' CONFIG_PREEMPT
 if [ "$CONFIG_SMP" != "y" ]; then
    bool 'Local APIC support on uniprocessors' CONFIG_X86_UP_APIC
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .8164-linux-2.5.33/arch/i386/kernel/smpboot.c .8164-linux-2.5.33.updated/arch/i386/kernel/smpboot.c
--- .8164-linux-2.5.33/arch/i386/kernel/smpboot.c	2002-09-01 12:22:57.000000000 +1000
+++ .8164-linux-2.5.33.updated/arch/i386/kernel/smpboot.c	2002-09-09 17:26:06.000000000 +1000
@@ -1149,16 +1149,32 @@ static void __init smp_boot_cpus(unsigne
 	Dprintk("Boot done.\n");
 
 	/*
-	 * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so
+	 * Here we can be sure that there is an IO-APIC in the system. Let's
+	 * go and set it up:
+	 */
+	if (!skip_ioapic_setup && nr_ioapics)
+		setup_IO_APIC();
+
+	setup_boot_APIC_clock();
+
+	/*
+	 * Synchronize the TSC with the AP
+	 */
+	if (cpu_has_tsc && cpucount)
+		synchronize_tsc_bp();
+	/*
+	 * If Hyper-Threading is available, construct cpu_sibling_map[], so
 	 * that we can tell the sibling CPU efficiently.
 	 */
+printk("cpu_has_ht: %d, smp_num_siblings: %d, num_online_cpus(): %d.\n", cpu_has_ht, smp_num_siblings, num_online_cpus());
 	if (cpu_has_ht && smp_num_siblings > 1) {
 		for (cpu = 0; cpu < NR_CPUS; cpu++)
 			cpu_sibling_map[cpu] = NO_PROC_ID;
 		
 		for (cpu = 0; cpu < NR_CPUS; cpu++) {
 			int 	i;
-			if (!test_bit(cpu, &cpu_callout_map)) continue;
+			if (!test_bit(cpu, &cpu_callout_map))
+				continue;
 
 			for (i = 0; i < NR_CPUS; i++) {
 				if (i == cpu || !test_bit(i, &cpu_callout_map))
@@ -1174,24 +1190,34 @@ static void __init smp_boot_cpus(unsigne
 				printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu);
 			}
 		}
-	}
-	     
-#ifndef CONFIG_VISWS
-	/*
-	 * Here we can be sure that there is an IO-APIC in the system. Let's
-	 * go and set it up:
+#if CONFIG_SHARE_RUNQUEUE
+	/* At this point APs would have synchronised TSC and waiting for 
+	 * smp_commenced, with their APIC timer disabled. So BP can go ahead
+	 * do some initializations for Hyper-Threading(if needed).
 	 */
-	if (!skip_ioapic_setup && nr_ioapics)
-		setup_IO_APIC();
-#endif
-
-	setup_boot_APIC_clock();
+		for (cpu = 0; cpu < NR_CPUS; cpu++) {
+			int i;
+			if (!test_bit(cpu, &cpu_callout_map))
+				continue;
+			for (i = 0; i < NR_CPUS; i++) {
+				if (i <= cpu)
+					continue;
+				if (!test_bit(i, &cpu_callout_map))
+					continue;
 
-	/*
-	 * Synchronize the TSC with the AP
-	 */
-	if (cpu_has_tsc && cpucount)
-		synchronize_tsc_bp();
+				if (phys_proc_id[cpu] != phys_proc_id[i])
+					continue;
+				/*
+				 * merge runqueues, resulting in one
+				 * runqueue per package:
+				 */
+				sched_map_runqueue(cpu, i);
+				break;
+			}
+		}
+#endif
+	}
+	     
 }
 
 /* These are wrappers to interface to the new boot process.  Someone
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .8164-linux-2.5.33/include/asm-i386/apic.h .8164-linux-2.5.33.updated/include/asm-i386/apic.h
--- .8164-linux-2.5.33/include/asm-i386/apic.h	2002-07-27 15:24:39.000000000 +1000
+++ .8164-linux-2.5.33.updated/include/asm-i386/apic.h	2002-09-09 17:26:06.000000000 +1000
@@ -98,4 +98,6 @@ extern unsigned int nmi_watchdog;
 
 #endif /* CONFIG_X86_LOCAL_APIC */
 
+extern int phys_proc_id[NR_CPUS];
+
 #endif /* __ASM_APIC_H */
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .8164-linux-2.5.33/include/linux/sched.h .8164-linux-2.5.33.updated/include/linux/sched.h
--- .8164-linux-2.5.33/include/linux/sched.h	2002-09-01 12:23:07.000000000 +1000
+++ .8164-linux-2.5.33.updated/include/linux/sched.h	2002-09-09 17:26:06.000000000 +1000
@@ -147,6 +147,7 @@ typedef struct task_struct task_t;
 
 extern void sched_init(void);
 extern void init_idle(task_t *idle, int cpu);
+extern void sched_map_runqueue(int cpu1, int cpu2);
 extern void show_state(void);
 extern void cpu_init (void);
 extern void trap_init(void);
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .8164-linux-2.5.33/kernel/sched.c .8164-linux-2.5.33.updated/kernel/sched.c
--- .8164-linux-2.5.33/kernel/sched.c	2002-09-01 12:23:08.000000000 +1000
+++ .8164-linux-2.5.33.updated/kernel/sched.c	2002-09-09 17:34:11.000000000 +1000
@@ -20,15 +20,15 @@
 #include <linux/nmi.h>
 #include <linux/init.h>
 #include <asm/uaccess.h>
+#include <linux/delay.h>
 #include <linux/highmem.h>
+#include <linux/security.h>
+#include <linux/notifier.h>
 #include <linux/smp_lock.h>
 #include <asm/mmu_context.h>
 #include <linux/interrupt.h>
 #include <linux/completion.h>
 #include <linux/kernel_stat.h>
-#include <linux/security.h>
-#include <linux/notifier.h>
-#include <linux/delay.h>
 
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -137,6 +137,48 @@ struct prio_array {
 };
 
 /*
+ * It's possible for two CPUs to share the same runqueue.
+ * This makes sense if they eg. share caches.
+ *
+ * We take the common 1:1 (SMP, UP) case and optimize it,
+ * the rest goes via remapping: rq_idx(cpu) gives the
+ * runqueue on which a particular cpu is on, cpu_idx(cpu)
+ * gives the rq-specific index of the cpu.
+ *
+ * (Note that the generic scheduler code does not impose any
+ *  restrictions on the mappings - there can be 4 CPUs per
+ *  runqueue or even assymetric mappings.)
+ */
+#if CONFIG_SHARE_RUNQUEUE
+# define MAX_NR_SIBLINGS CONFIG_MAX_NR_SIBLINGS
+  static long __rq_idx[NR_CPUS] __cacheline_aligned;
+  static long __cpu_idx[NR_CPUS] __cacheline_aligned;
+# define rq_idx(cpu) (__rq_idx[(cpu)])
+# define cpu_idx(cpu) (__cpu_idx[(cpu)])
+# define for_each_sibling(idx, rq) \
+		for ((idx) = 0; (idx) < (rq)->nr_cpus; (idx)++)
+# define rq_nr_cpus(rq) ((rq)->nr_cpus)
+# define cpu_active_balance(c) (cpu_rq(c)->cpu[0].active_balance)
+#else
+# define MAX_NR_SIBLINGS 1
+# define rq_idx(cpu) (cpu)
+# define cpu_idx(cpu) 0
+# define for_each_sibling(idx, rq) while (0)
+# define cpu_active_balance(c) 0
+# define do_active_balance(rq, cpu) do { } while (0)
+# define rq_nr_cpus(rq) 1
+  static inline void active_load_balance(runqueue_t *rq, int this_cpu) { }
+#endif
+
+typedef struct cpu_s {
+	task_t *curr, *idle;
+	task_t *migration_thread;
+	list_t migration_queue;
+	int active_balance;
+	int cpu;
+} cpu_t;
+
+/*
  * This is the main, per-CPU runqueue data structure.
  *
  * Locking rule: those places that want to lock multiple runqueues
@@ -147,30 +189,42 @@ struct runqueue {
 	spinlock_t lock;
 	unsigned long nr_running, nr_switches, expired_timestamp,
 			nr_uninterruptible;
-	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
 
-	task_t *migration_thread;
-	list_t migration_queue;
+	int nr_cpus;
+	cpu_t cpu[MAX_NR_SIBLINGS];
 
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
 
-#define cpu_rq(cpu)		(runqueues + (cpu))
+#define cpu_rq(cpu)		(runqueues + (rq_idx(cpu)))
+#define cpu_int(c)		((cpu_rq(c))->cpu + cpu_idx(c))
+#define cpu_curr_ptr(cpu)	(cpu_int(cpu)->curr)
+#define cpu_idle_ptr(cpu)	(cpu_int(cpu)->idle)
+
 #define this_rq()		cpu_rq(smp_processor_id())
 #define task_rq(p)		cpu_rq(task_cpu(p))
-#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define rt_task(p)		((p)->prio < MAX_RT_PRIO)
 
+#define migration_thread(cpu)	(cpu_int(cpu)->migration_thread)
+#define migration_queue(cpu)	(&cpu_int(cpu)->migration_queue)
+
+#if NR_CPUS > 1
+# define task_allowed(p, cpu)	((p)->cpus_allowed & (1UL << (cpu)))
+#else
+# define task_allowed(p, cpu)	1
+#endif
+
 /*
  * Default context-switch locking:
  */
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(rq, next)	do { } while(0)
 # define finish_arch_switch(rq, next)	spin_unlock_irq(&(rq)->lock)
-# define task_running(rq, p)		((rq)->curr == (p))
+# define task_running(p) \
+		(cpu_curr_ptr(task_cpu(p)) == (p))
 #endif
 
 /*
@@ -217,6 +271,15 @@ static inline void rq_unlock(runqueue_t 
 	spin_unlock_irq(&rq->lock);
 }
 
+/**
+ * idle_cpu - is a given cpu idle currently?
+ * @cpu: the processor in question.
+ */
+inline int idle_cpu(int cpu)
+{
+	return cpu_curr_ptr(cpu) == cpu_idle_ptr(cpu);
+}
+
 /*
  * Adding/removing a task to/from a priority array:
  */
@@ -305,32 +368,6 @@ static inline void deactivate_task(struc
 	p->array = NULL;
 }
 
-/*
- * resched_task - mark a task 'to be rescheduled now'.
- *
- * On UP this means the setting of the need_resched flag, on SMP it
- * might also involve a cross-CPU call to trigger the scheduler on
- * the target CPU.
- */
-static inline void resched_task(task_t *p)
-{
-#ifdef CONFIG_SMP
-	int need_resched, nrpolling;
-
-	preempt_disable();
-	/* minimise the chance of sending an interrupt to poll_idle() */
-	nrpolling = test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);
-	need_resched = test_and_set_tsk_thread_flag(p,TIF_NEED_RESCHED);
-	nrpolling |= test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);
-
-	if (!need_resched && !nrpolling && (task_cpu(p) != smp_processor_id()))
-		smp_send_reschedule(task_cpu(p));
-	preempt_enable();
-#else
-	set_tsk_need_resched(p);
-#endif
-}
-
 #ifdef CONFIG_SMP
 
 /*
@@ -347,7 +384,7 @@ void wait_task_inactive(task_t * p)
 repeat:
 	preempt_disable();
 	rq = task_rq(p);
-	if (unlikely(task_running(rq, p))) {
+	if (unlikely(task_running(p))) {
 		cpu_relax();
 		/*
 		 * enable/disable preemption just to make this
@@ -358,7 +395,7 @@ repeat:
 		goto repeat;
 	}
 	rq = task_rq_lock(p, &flags);
-	if (unlikely(task_running(rq, p))) {
+	if (unlikely(task_running(p))) {
 		task_rq_unlock(rq, &flags);
 		preempt_enable();
 		goto repeat;
@@ -366,8 +403,43 @@ repeat:
 	task_rq_unlock(rq, &flags);
 	preempt_enable();
 }
+
+/*
+ * resched_task - mark a task 'to be rescheduled now'.
+ *
+ * On UP this means the setting of the need_resched flag, on SMP it
+ * might also involve a cross-CPU call to trigger the scheduler on
+ * the target CPU.
+ */
+static inline void resched_task(task_t *p)
+{
+	int need_resched, nrpolling;
+
+	preempt_disable();
+	/* minimise the chance of sending an interrupt to poll_idle() */
+	nrpolling = test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);
+	need_resched = test_and_set_tsk_thread_flag(p,TIF_NEED_RESCHED);
+	nrpolling |= test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);
+
+	if (!need_resched && !nrpolling && (task_cpu(p) != smp_processor_id()))
+		smp_send_reschedule(task_cpu(p));
+	preempt_enable();
+}
+
+#else
+
+static inline void resched_task(task_t *p)
+{
+	set_tsk_need_resched(p);
+}
+
 #endif
 
+static inline void resched_cpu(int cpu)
+{
+	resched_task(cpu_curr_ptr(cpu));
+}
+
 /*
  * kick_if_running - kick the remote CPU if the task is running currently.
  *
@@ -380,10 +452,40 @@ repeat:
  */
 void kick_if_running(task_t * p)
 {
-	if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id()))
+	if ((task_running(p)) && (task_cpu(p) != smp_processor_id()))
 		resched_task(p);
 }
 
+static void wake_up_cpu(runqueue_t *rq, int cpu, task_t *p)
+{
+	cpu_t *curr_cpu;
+	task_t *curr;
+	int idx;
+
+	if (idle_cpu(cpu))
+		return resched_cpu(cpu);
+
+	for_each_sibling(idx, rq) {
+		curr_cpu = rq->cpu + idx;
+		if (!task_allowed(p, curr_cpu->cpu))
+			continue;
+		if (curr_cpu->idle == curr_cpu->curr)
+			return resched_cpu(curr_cpu->cpu);
+	}
+
+	if (p->prio < cpu_curr_ptr(cpu)->prio)
+		return resched_task(cpu_curr_ptr(cpu));
+
+	for_each_sibling(idx, rq) {
+		curr_cpu = rq->cpu + idx;
+		if (!task_allowed(p, curr_cpu->cpu))
+			continue;
+		curr = curr_cpu->curr;
+		if (p->prio < curr->prio)
+			return resched_task(curr);
+	}
+}
+
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -412,7 +514,7 @@ repeat_lock_task:
 		 * Fast-migrate the task if it's not running or runnable
 		 * currently. Do not violate hard affinity.
 		 */
-		if (unlikely(sync && !task_running(rq, p) &&
+		if (unlikely(sync && !task_running(p) &&
 			(task_cpu(p) != smp_processor_id()) &&
 			(p->cpus_allowed & (1UL << smp_processor_id())))) {
 
@@ -424,8 +526,8 @@ repeat_lock_task:
 			rq->nr_uninterruptible--;
 		activate_task(p, rq);
 
-		if (p->prio < rq->curr->prio)
-			resched_task(rq->curr);
+		wake_up_cpu(rq, task_cpu(p), p);
+
 		success = 1;
 	}
 	p->state = TASK_RUNNING;
@@ -543,7 +645,9 @@ unsigned long nr_running(void)
 	unsigned long i, sum = 0;
 
 	for (i = 0; i < NR_CPUS; i++)
-		sum += cpu_rq(i)->nr_running;
+		/* Shared runqueues are counted only once. */
+		if (!cpu_idx(i))
+			sum += cpu_rq(i)->nr_running;
 
 	return sum;
 }
@@ -553,7 +657,9 @@ unsigned long nr_uninterruptible(void)
 	unsigned long i, sum = 0;
 
 	for (i = 0; i < NR_CPUS; i++)
-		sum += cpu_rq(i)->nr_uninterruptible;
+		/* Shared runqueues are counted only once. */
+		if (!cpu_idx(i))
+			sum += cpu_rq(i)->nr_uninterruptible;
 
 	return sum;
 }
@@ -602,7 +708,23 @@ static inline void double_rq_unlock(runq
 		spin_unlock(&rq2->lock);
 }
 
-#if CONFIG_SMP
+/*
+ * One of the idle_cpu_tick() and busy_cpu_tick() functions will
+ * get called every timer tick, on every CPU. Our balancing action
+ * frequency and balancing agressivity depends on whether the CPU is
+ * idle or not.
+ *
+ * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * systems with HZ=100, every 10 msecs.)
+ */
+#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+
+#if !CONFIG_SMP
+
+static inline void load_balance(runqueue_t *rq, int this_cpu, int idle) { }
+
+#else
 
 /*
  * double_lock_balance - lock the busiest runqueue
@@ -718,12 +840,7 @@ static inline void pull_task(runqueue_t 
 	set_task_cpu(p, this_cpu);
 	this_rq->nr_running++;
 	enqueue_task(p, this_rq->active);
-	/*
-	 * Note that idle threads have a prio of MAX_PRIO, for this test
-	 * to be always true for them.
-	 */
-	if (p->prio < this_rq->curr->prio)
-		set_need_resched();
+	wake_up_cpu(this_rq, this_cpu, p);
 }
 
 /*
@@ -734,9 +851,9 @@ static inline void pull_task(runqueue_t 
  * We call this with the current runqueue locked,
  * irqs disabled.
  */
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int this_cpu, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, idx;
 	runqueue_t *busiest;
 	prio_array_t *array;
 	list_t *head, *curr;
@@ -783,12 +900,14 @@ skip_queue:
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
+	 *
+	 * (except if we are in idle mode which is a more agressive
+	 * form of rebalancing.)
 	 */
 
-#define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
-	((jiffies - (p)->sleep_timestamp > cache_decay_ticks) &&	\
-		!task_running(rq, p) &&					\
-			((p)->cpus_allowed & (1UL << (this_cpu))))
+#define CAN_MIGRATE_TASK(p,rq,cpu)					\
+	((idle || (jiffies - (p)->sleep_timestamp > cache_decay_ticks)) && \
+		!task_running(p) && task_allowed(p, cpu))
 
 	curr = curr->prev;
 
@@ -811,27 +930,134 @@ out:
 	;
 }
 
-/*
- * One of the idle_cpu_tick() and busy_cpu_tick() functions will
- * get called every timer tick, on every CPU. Our balancing action
- * frequency and balancing agressivity depends on whether the CPU is
- * idle or not.
- *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
- * systems with HZ=100, every 10 msecs.)
- */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
-#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#endif
 
-static inline void idle_tick(runqueue_t *rq)
+static inline void idle_tick(runqueue_t *rq, unsigned long j)
 {
-	if (jiffies % IDLE_REBALANCE_TICK)
+	if (j % IDLE_REBALANCE_TICK)
 		return;
 	spin_lock(&rq->lock);
-	load_balance(rq, 1);
+	load_balance(rq, smp_processor_id(), 1);
 	spin_unlock(&rq->lock);
 }
 
+#if CONFIG_SHARE_RUNQUEUE
+static void active_load_balance(runqueue_t *this_rq, int this_cpu)
+{
+	runqueue_t *rq;
+	int i, idx;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+		rq = cpu_rq(i);
+		if (rq == this_rq)
+			continue;
+		/*
+		 * Any SMT-specific imbalance?
+		 */
+		for_each_sibling(idx, rq)
+			if (rq->cpu[idx].idle == rq->cpu[idx].curr)
+				goto next_cpu;
+
+		/*
+		 * At this point it's sure that we have a SMT
+		 * imbalance: this (physical) CPU is idle but
+		 * another CPU has two tasks running.
+		 *
+		 * We wake up one of the migration threads (it
+		 * doesnt matter which one) and let it fix things up:
+		 */
+		if (!cpu_active_balance(this_cpu)) {
+			cpu_active_balance(this_cpu) = 1;
+			spin_unlock(&this_rq->lock);
+			wake_up_process(rq->cpu[0].migration_thread);
+			spin_lock(&this_rq->lock);
+		}
+next_cpu:
+		;
+	}
+}
+
+static void do_active_balance(runqueue_t *this_rq, int this_cpu)
+{
+	runqueue_t *rq;
+	int i, idx;
+
+	spin_unlock(&this_rq->lock);
+
+	cpu_active_balance(this_cpu) = 0;
+
+	/*
+	 * Is the imbalance still present?
+	 */
+	for_each_sibling(idx, this_rq)
+		if (this_rq->cpu[idx].idle == this_rq->cpu[idx].curr)
+			goto out;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+		rq = cpu_rq(i);
+		if (rq == this_rq)
+			continue;
+
+		/* completely idle CPU? */
+		if (rq->nr_running)
+			continue;
+
+		/*
+		 * At this point it's reasonably sure that we have an
+		 * imbalance. Since we are the migration thread, try to
+	 	 * balance a thread over to the target queue.
+		 */
+		spin_lock(&rq->lock);
+		load_balance(rq, i, 1);
+		spin_unlock(&rq->lock);
+		goto out;
+	}
+out:
+	spin_lock(&this_rq->lock);
+}
+
+/*
+ * This routine is called to map a CPU into another CPU's runqueue.
+ *
+ * This must be called during bootup with the merged runqueue having
+ * no tasks.
+ */
+void sched_map_runqueue(int cpu1, int cpu2)
+{
+	runqueue_t *rq1 = cpu_rq(cpu1);
+	runqueue_t *rq2 = cpu_rq(cpu2);
+	int cpu2_idx_orig = cpu_idx(cpu2), cpu2_idx;
+
+	printk("sched_merge_runqueues: CPU#%d <=> CPU#%d, on CPU#%d.\n", cpu1, cpu2, smp_processor_id());
+	if (rq1 == rq2)
+		BUG();
+	if (rq2->nr_running)
+		BUG();
+	/*
+	 * At this point, we dont have anything in the runqueue yet. So,
+	 * there is no need to move processes between the runqueues.
+	 * Only, the idle processes should be combined and accessed
+	 * properly.
+	 */
+	cpu2_idx = rq1->nr_cpus++;
+
+	if (rq_idx(cpu1) != cpu1)
+		BUG();
+	rq_idx(cpu2) = cpu1;
+	cpu_idx(cpu2) = cpu2_idx;
+	rq1->cpu[cpu2_idx].cpu = cpu2;
+	rq1->cpu[cpu2_idx].idle = rq2->cpu[cpu2_idx_orig].idle;
+	rq1->cpu[cpu2_idx].curr = rq2->cpu[cpu2_idx_orig].curr;
+	INIT_LIST_HEAD(&rq1->cpu[cpu2_idx].migration_queue);
+
+	/* just to be safe: */
+	rq2->cpu[cpu2_idx_orig].idle = NULL;
+	rq2->cpu[cpu2_idx_orig].curr = NULL;
+}
 #endif
 
 /*
@@ -856,15 +1082,14 @@ void scheduler_tick(int user_ticks, int 
 {
 	int cpu = smp_processor_id();
 	runqueue_t *rq = this_rq();
+	unsigned long j = jiffies;
 	task_t *p = current;
 
-	if (p == rq->idle) {
+	if (p == cpu_idle_ptr(cpu)) {
 		/* note: this timer irq context must be accounted for as well */
 		if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
 			kstat.per_cpu_system[cpu] += sys_ticks;
-#if CONFIG_SMP
-		idle_tick(rq);
-#endif
+		idle_tick(rq, j);
 		return;
 	}
 	if (TASK_NICE(p) > 0)
@@ -873,12 +1098,13 @@ void scheduler_tick(int user_ticks, int 
 		kstat.per_cpu_user[cpu] += user_ticks;
 	kstat.per_cpu_system[cpu] += sys_ticks;
 
+	spin_lock(&rq->lock);
 	/* Task might have expired already, but not scheduled off yet */
 	if (p->array != rq->active) {
 		set_tsk_need_resched(p);
+		spin_unlock(&rq->lock);
 		return;
 	}
-	spin_lock(&rq->lock);
 	if (unlikely(rt_task(p))) {
 		/*
 		 * RR tasks need a special form of timeslice management.
@@ -914,16 +1140,14 @@ void scheduler_tick(int user_ticks, int 
 
 		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
 			if (!rq->expired_timestamp)
-				rq->expired_timestamp = jiffies;
+				rq->expired_timestamp = j;
 			enqueue_task(p, rq->expired);
 		} else
 			enqueue_task(p, rq->active);
 	}
 out:
-#if CONFIG_SMP
-	if (!(jiffies % BUSY_REBALANCE_TICK))
-		load_balance(rq, 0);
-#endif
+	if (!(j % BUSY_REBALANCE_TICK))
+		load_balance(rq, smp_processor_id(), 0);
 	spin_unlock(&rq->lock);
 }
 
@@ -935,18 +1159,20 @@ void scheduling_functions_start_here(voi
 asmlinkage void schedule(void)
 {
 	task_t *prev, *next;
-	runqueue_t *rq;
 	prio_array_t *array;
+	int idx, this_cpu;
+	runqueue_t *rq;
 	list_t *queue;
-	int idx;
+	int retry = 0;
 
 	if (unlikely(in_interrupt()))
 		BUG();
 
+need_resched:
 #if CONFIG_DEBUG_HIGHMEM
 	check_highmem_ptes();
 #endif
-need_resched:
+	this_cpu = smp_processor_id();
 	preempt_disable();
 	prev = current;
 	rq = this_rq();
@@ -975,12 +1201,14 @@ need_resched:
 	}
 pick_next_task:
 	if (unlikely(!rq->nr_running)) {
-#if CONFIG_SMP
-		load_balance(rq, 1);
+		load_balance(rq, this_cpu, 1);
 		if (rq->nr_running)
 			goto pick_next_task;
-#endif
-		next = rq->idle;
+		active_load_balance(rq, this_cpu);
+		if (rq->nr_running)
+			goto pick_next_task;
+pick_idle:
+		next = cpu_idle_ptr(this_cpu);
 		rq->expired_timestamp = 0;
 		goto switch_tasks;
 	}
@@ -996,9 +1224,38 @@ pick_next_task:
 		rq->expired_timestamp = 0;
 	}
 
+new_array:
 	idx = sched_find_first_bit(array->bitmap);
 	queue = array->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
+	if ((next != prev) && (rq_nr_cpus(rq) > 1)) {
+		list_t *tmp = queue->next;
+
+		while (task_running(next) || !task_allowed(next, this_cpu)) {
+			tmp = tmp->next;
+			if (tmp != queue) {
+				next = list_entry(tmp, task_t, run_list);
+				continue;
+			}
+			idx = find_next_bit(array->bitmap, MAX_PRIO, ++idx);
+			if (idx == MAX_PRIO) {
+				if (retry || !rq->expired->nr_active)
+					goto pick_idle;
+				/*
+				 * To avoid infinite changing of arrays,
+				 * when we have only tasks runnable by
+				 * sibling.
+				 */
+				retry = 1;
+
+				array = rq->expired;
+				goto new_array;
+			}
+			queue = array->queue + idx;
+			tmp = queue->next;
+			next = list_entry(tmp, task_t, run_list);
+		}
+	}
 
 switch_tasks:
 	prefetch(next);
@@ -1006,7 +1263,8 @@ switch_tasks:
 
 	if (likely(prev != next)) {
 		rq->nr_switches++;
-		rq->curr = next;
+		cpu_curr_ptr(this_cpu) = next;
+		set_task_cpu(next, this_cpu);
 	
 		prepare_arch_switch(rq, next);
 		prev = context_switch(prev, next);
@@ -1264,9 +1522,8 @@ void set_user_nice(task_t *p, long nice)
 		 * If the task is running and lowered its priority,
 		 * or increased its priority then reschedule its CPU:
 		 */
-		if ((NICE_TO_PRIO(nice) < p->static_prio) ||
-							task_running(rq, p))
-			resched_task(rq->curr);
+		if ((NICE_TO_PRIO(nice) < p->static_prio) || task_running(p))
+			resched_task(cpu_curr_ptr(task_cpu(p)));
 	}
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -1339,15 +1596,6 @@ int task_nice(task_t *p)
 }
 
 /**
- * idle_cpu - is a given cpu idle currently?
- * @cpu: the processor in question.
- */
-int idle_cpu(int cpu)
-{
-	return cpu_curr(cpu) == cpu_rq(cpu)->idle;
-}
-
-/**
  * find_process_by_pid - find a process with a matching PID value.
  * @pid: the pid in question.
  */
@@ -1863,7 +2111,7 @@ void __init init_idle(task_t *idle, int 
 	local_irq_save(flags);
 	double_rq_lock(idle_rq, rq);
 
-	idle_rq->curr = idle_rq->idle = idle;
+	cpu_curr_ptr(cpu) = cpu_idle_ptr(cpu) = idle;
 	deactivate_task(idle, rq);
 	idle->array = NULL;
 	idle->prio = MAX_PRIO;
@@ -1918,6 +2166,7 @@ void set_cpus_allowed(task_t *p, unsigne
 	unsigned long flags;
 	migration_req_t req;
 	runqueue_t *rq;
+	int cpu;
 
 #if 0 /* FIXME: Grab cpu_lock, return error on this case. --RR */
 	new_mask &= cpu_online_map;
@@ -1940,16 +2189,17 @@ void set_cpus_allowed(task_t *p, unsigne
 	 * If the task is not on a runqueue (and not running), then
 	 * it is sufficient to simply update the task's cpu field.
 	 */
-	if (!p->array && !task_running(rq, p)) {
+	if (!p->array && !task_running(p)) {
 		set_task_cpu(p, __ffs(p->cpus_allowed));
 		task_rq_unlock(rq, &flags);
 		goto out;
 	}
 	init_completion(&req.done);
 	req.task = p;
-	list_add(&req.list, &rq->migration_queue);
+	cpu = task_cpu(p);
+	list_add(&req.list, migration_queue(cpu));
 	task_rq_unlock(rq, &flags);
-	wake_up_process(rq->migration_thread);
+	wake_up_process(migration_thread(cpu));
 
 	wait_for_completion(&req.done);
 out:
@@ -1957,15 +2207,15 @@ out:
 }
 
 /*
- * migration_thread - this is a highprio system thread that performs
+ * migration_task - this is a highprio system thread that performs
  * thread migration by 'pulling' threads into the target runqueue.
  */
-static int migration_thread(void * data)
+static int migration_task(void * data)
 {
 	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 	int cpu = (long) data;
 	runqueue_t *rq;
-	int ret;
+	int ret, idx;
 
 	daemonize();
 	sigfillset(&current->blocked);
@@ -1980,11 +2230,12 @@ static int migration_thread(void * data)
 	 * to the target CPU - we'll wake up there.
 	 */
 	if (smp_processor_id() != cpu)
-	printk("migration_task %d on cpu=%d\n", cpu, smp_processor_id());
+	printk("migration_thread %d on cpu=%d\n", cpu, smp_processor_id());
 	ret = setscheduler(0, SCHED_FIFO, &param);
 
 	rq = this_rq();
-	rq->migration_thread = current;
+	migration_thread(cpu) = current;
+	idx = cpu_idx(cpu);
 
 	sprintf(current->comm, "migration_CPU%d", smp_processor_id());
 
@@ -1997,7 +2248,9 @@ static int migration_thread(void * data)
 		task_t *p;
 
 		spin_lock_irqsave(&rq->lock, flags);
-		head = &rq->migration_queue;
+		if (cpu_active_balance(cpu))
+			do_active_balance(rq, cpu);
+		head = migration_queue(cpu);
 		current->state = TASK_INTERRUPTIBLE;
 		if (list_empty(head)) {
 			spin_unlock_irqrestore(&rq->lock, flags);
@@ -2044,13 +2297,14 @@ static int migration_call(struct notifie
 			  unsigned long action,
 			  void *hcpu)
 {
+	long cpu = (long) hcpu;
+
 	switch (action) {
 	case CPU_ONLINE:
-		printk("Starting migration thread for cpu %li\n",
-		       (long)hcpu);
-		kernel_thread(migration_thread, hcpu,
+		printk("Starting migration thread for cpu %li\n", cpu);
+		kernel_thread(migration_task, hcpu,
 			      CLONE_FS | CLONE_FILES | CLONE_SIGNAL);
-		while (!cpu_rq((long)hcpu)->migration_thread)
+		while (!migration_thread(cpu))
 			yield();
 		break;
 	}
@@ -2097,11 +2351,20 @@ void __init sched_init(void)
 	for (i = 0; i < NR_CPUS; i++) {
 		prio_array_t *array;
 
+		/*
+		 * Start with a 1:1 mapping between CPUs and runqueues:
+		 */
+#if CONFIG_SHARE_RUNQUEUE
+		rq_idx(i) = i;
+		cpu_idx(i) = 0;
+#endif
 		rq = cpu_rq(i);
 		rq->active = rq->arrays;
 		rq->expired = rq->arrays + 1;
 		spin_lock_init(&rq->lock);
-		INIT_LIST_HEAD(&rq->migration_queue);
+		INIT_LIST_HEAD(migration_queue(i));
+		rq->nr_cpus = 1;
+		rq->cpu[cpu_idx(i)].cpu = i;
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;
@@ -2117,9 +2380,8 @@ void __init sched_init(void)
 	 * We have to do a little magic to get the first
 	 * thread right in SMP mode.
 	 */
-	rq = this_rq();
-	rq->curr = current;
-	rq->idle = current;
+	cpu_curr_ptr(smp_processor_id()) = current;
+	cpu_idle_ptr(smp_processor_id()) = current;
 	set_task_cpu(current, smp_processor_id());
 	wake_up_process(current);