Name: Hyperthread Scheduler Modifications Author: Ingo Molnar Status: Experimental D: symmetric multithreading (hyperthreading) is an interesting new concept D: that IMO deserves full scheduler support. Physical CPUs can have multiple D: (typically 2) logical CPUs embedded, and can run multiple tasks 'in D: parallel' by utilizing fast hardware-based context-switching between the D: two register sets upon things like cache-misses or special instructions. D: To the OSs the logical CPUs are almost undistinguishable from physical D: CPUs. In fact the current scheduler treats each logical CPU as a separate D: physical CPU - which works but does not maximize multiprocessing D: performance on SMT/HT boxes. D: D: The following properties have to be provided by a scheduler that wants to D: be 'fully HT-aware': D: D: - HT-aware passive load-balancing: the irq-driven balancing has to be D: per-physical-CPU, not per-logical-CPU. D: D: Otherwise it might happen that one physical CPU runs 2 tasks, while D: another physical CPU runs no threads. The stock scheduler does not D: recognize this condition as 'imbalance' - to the scheduler it appears D: as if the first two CPUs had 1-1 task running, the second two CPUs had D: 0-0 tasks running. The stock scheduler does not realize that the two D: logical CPUs belong to the same physical CPU. D: D: - 'active' load-balancing when a logical CPU goes idle and thus causes a D: physical CPU imbalance. D: D: This is a mechanism that simply does not exist in the stock 1:1 D: scheduler - the imbalance caused by an idle CPU can be solved via the D: normal load-balancer. In the HT case the situation is special because D: the source physical CPU might have just two tasks running, both D: runnable - this is a situation that the stock load-balancer is unable D: to handle - running tasks are hard to be migrated away. But it's D: essential to do this - otherwise a physical CPU can get stuck running 2 D: tasks, while another physical CPU stays idle. D: D: - HT-aware task pickup. D: D: When the scheduler picks a new task, it should prefer all tasks that D: share the same physical CPU - before trying to pull in tasks from other D: CPUs. The stock scheduler only picked tasks that were scheduled to that D: particular logical CPU. D: D: - HT-aware affinity. D: D: Tasks should attempt to 'stick' to physical CPUs, not logical CPUs. D: D: - HT-aware wakeup. D: D: again this is something completely new - the stock scheduler only knows D: about the 'current' CPU, it does not know about any sibling [== logical D: CPUs on the same physical CPU] logical CPUs. On HT, if a thread is D: woken up on a logical CPU that is already executing a task, and if a D: sibling CPU is idle, then the sibling CPU has to be woken up and has to D: execute the newly woken up task immediately. D: D: the attached patch (against 2.5.31-BK-curr) implements all the above D: HT-scheduling needs by introducing the concept of a shared runqueue: D: multiple CPUs can share the same runqueue. A shared, per-physical-CPU D: runqueue magically fulfills all the above HT-scheduling needs. Obviously D: this complicates scheduling and load-balancing somewhat (see the patch for D: details), so great care has been taken to not impact the non-HT schedulers D: (SMP, UP). In fact the SMP scheduler is a compile-time special case of the D: HT scheduler. (and the UP scheduler is a compile-time special case of the D: SMP scheduler) D: D: the patch is based on Jun Nakajima's prototyping work - the lowlevel D: x86/Intel bits are still those from Jun, the sched.c bits are newly D: implemented and generalized. D: D: There's a single flexible interface for lowlevel boot code to set up D: physical CPUs: sched_map_runqueue(cpu1, cpu2) maps cpu2 into cpu1's D: runqueue. The patch also implements the lowlevel bits for P4 HT boxes for D: the 2/package case. D: D: (NUMA systems which have tightly coupled CPUs with a smaller cache and D: protected by a large L3 cache might benefit from sharing the runqueue as D: well - but the target for this concept is SMT.) D: D: some numbers: D: D: compiling a standalone floppy.c in an infinite loop takes 2.55 seconds per D: iteration. Starting up two such loops in parallel, on a 2-physical, D: 2-logical (total of 4 logical CPUs) P4 HT box gives the following numbers: D: D: 2.5.31-BK-curr: - fluctuates between 2.60 secs and 4.6 seconds. D: D: BK-curr + sched-F3: - stable 2.60 sec results. D: D: the results under the stock scheduler depends on pure luck: which CPUs get D: the tasks scheduled. In the HT-aware case each task gets scheduled on a D: separate physical CPU, all the time. D: D: compiling the kernel source via "make -j2" [under-utilizes CPUs]: D: D: 2.5.31-BK-curr: 45.3 sec D: D: BK-curr + sched-F3: 41.3 sec D: D: ie. a ~10% improvement. The tests were the best results picked from lots D: of (>10) runs. The no-HT numbers fluctuate much more (again the randomness D: effect), so the average compilation time in the no-HT case is higher. D: D: saturated compilation "make -j5" results are roughly equivalent, as D: expected - the one-runqueue-per-CPU concept works adequately when the D: number of tasks is larger than the number of logical CPUs. The stock D: scheduler works well on HT boxes in the boundary conditions: when there's D: 1 task running, and when there's more nr_cpus tasks running. D: D: the patch also unifies some of the other code and removes a few more D: #ifdef CONFIG_SMP branches from the scheduler proper. D: D: (the patch compiles/boots/works just fine on UP and SMP as well, on the P4 D: box and on another PIII SMP box as well.) D: D: Testreports, comments, suggestions welcome, D: D: Ingo diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .8164-linux-2.5.33/arch/i386/config.in .8164-linux-2.5.33.updated/arch/i386/config.in --- .8164-linux-2.5.33/arch/i386/config.in 2002-08-28 09:29:39.000000000 +1000 +++ .8164-linux-2.5.33.updated/arch/i386/config.in 2002-09-09 17:26:06.000000000 +1000 @@ -155,6 +155,20 @@ if [ "$CONFIG_MWINCHIP3D" = "y" ]; then fi bool 'Symmetric multi-processing support' CONFIG_SMP + +if [ "$CONFIG_SMP" = "y" ]; then + if [ "$CONFIG_MPENTIUM4" = "y" ]; then + choice 'Hyperthreading Support' \ + "off CONFIG_NR_SIBLINGS_0 \ + 2-siblings CONFIG_NR_SIBLINGS_2" off + fi +fi + +if [ "$CONFIG_NR_SIBLINGS_2" = "y" ]; then + define_bool CONFIG_SHARE_RUNQUEUE y + define_int CONFIG_MAX_NR_SIBLINGS 2 +fi + bool 'Preemptible Kernel' CONFIG_PREEMPT if [ "$CONFIG_SMP" != "y" ]; then bool 'Local APIC support on uniprocessors' CONFIG_X86_UP_APIC diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .8164-linux-2.5.33/arch/i386/kernel/smpboot.c .8164-linux-2.5.33.updated/arch/i386/kernel/smpboot.c --- .8164-linux-2.5.33/arch/i386/kernel/smpboot.c 2002-09-01 12:22:57.000000000 +1000 +++ .8164-linux-2.5.33.updated/arch/i386/kernel/smpboot.c 2002-09-09 17:26:06.000000000 +1000 @@ -1149,16 +1149,32 @@ static void __init smp_boot_cpus(unsigne Dprintk("Boot done.\n"); /* - * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so + * Here we can be sure that there is an IO-APIC in the system. Let's + * go and set it up: + */ + if (!skip_ioapic_setup && nr_ioapics) + setup_IO_APIC(); + + setup_boot_APIC_clock(); + + /* + * Synchronize the TSC with the AP + */ + if (cpu_has_tsc && cpucount) + synchronize_tsc_bp(); + /* + * If Hyper-Threading is available, construct cpu_sibling_map[], so * that we can tell the sibling CPU efficiently. */ +printk("cpu_has_ht: %d, smp_num_siblings: %d, num_online_cpus(): %d.\n", cpu_has_ht, smp_num_siblings, num_online_cpus()); if (cpu_has_ht && smp_num_siblings > 1) { for (cpu = 0; cpu < NR_CPUS; cpu++) cpu_sibling_map[cpu] = NO_PROC_ID; for (cpu = 0; cpu < NR_CPUS; cpu++) { int i; - if (!test_bit(cpu, &cpu_callout_map)) continue; + if (!test_bit(cpu, &cpu_callout_map)) + continue; for (i = 0; i < NR_CPUS; i++) { if (i == cpu || !test_bit(i, &cpu_callout_map)) @@ -1174,24 +1190,34 @@ static void __init smp_boot_cpus(unsigne printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu); } } - } - -#ifndef CONFIG_VISWS - /* - * Here we can be sure that there is an IO-APIC in the system. Let's - * go and set it up: +#if CONFIG_SHARE_RUNQUEUE + /* At this point APs would have synchronised TSC and waiting for + * smp_commenced, with their APIC timer disabled. So BP can go ahead + * do some initializations for Hyper-Threading(if needed). */ - if (!skip_ioapic_setup && nr_ioapics) - setup_IO_APIC(); -#endif - - setup_boot_APIC_clock(); + for (cpu = 0; cpu < NR_CPUS; cpu++) { + int i; + if (!test_bit(cpu, &cpu_callout_map)) + continue; + for (i = 0; i < NR_CPUS; i++) { + if (i <= cpu) + continue; + if (!test_bit(i, &cpu_callout_map)) + continue; - /* - * Synchronize the TSC with the AP - */ - if (cpu_has_tsc && cpucount) - synchronize_tsc_bp(); + if (phys_proc_id[cpu] != phys_proc_id[i]) + continue; + /* + * merge runqueues, resulting in one + * runqueue per package: + */ + sched_map_runqueue(cpu, i); + break; + } + } +#endif + } + } /* These are wrappers to interface to the new boot process. Someone diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .8164-linux-2.5.33/include/asm-i386/apic.h .8164-linux-2.5.33.updated/include/asm-i386/apic.h --- .8164-linux-2.5.33/include/asm-i386/apic.h 2002-07-27 15:24:39.000000000 +1000 +++ .8164-linux-2.5.33.updated/include/asm-i386/apic.h 2002-09-09 17:26:06.000000000 +1000 @@ -98,4 +98,6 @@ extern unsigned int nmi_watchdog; #endif /* CONFIG_X86_LOCAL_APIC */ +extern int phys_proc_id[NR_CPUS]; + #endif /* __ASM_APIC_H */ diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .8164-linux-2.5.33/include/linux/sched.h .8164-linux-2.5.33.updated/include/linux/sched.h --- .8164-linux-2.5.33/include/linux/sched.h 2002-09-01 12:23:07.000000000 +1000 +++ .8164-linux-2.5.33.updated/include/linux/sched.h 2002-09-09 17:26:06.000000000 +1000 @@ -147,6 +147,7 @@ typedef struct task_struct task_t; extern void sched_init(void); extern void init_idle(task_t *idle, int cpu); +extern void sched_map_runqueue(int cpu1, int cpu2); extern void show_state(void); extern void cpu_init (void); extern void trap_init(void); diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .8164-linux-2.5.33/kernel/sched.c .8164-linux-2.5.33.updated/kernel/sched.c --- .8164-linux-2.5.33/kernel/sched.c 2002-09-01 12:23:08.000000000 +1000 +++ .8164-linux-2.5.33.updated/kernel/sched.c 2002-09-09 17:34:11.000000000 +1000 @@ -20,15 +20,15 @@ #include #include #include +#include #include +#include +#include #include #include #include #include #include -#include -#include -#include /* * Convert user-nice values [ -20 ... 0 ... 19 ] @@ -137,6 +137,48 @@ struct prio_array { }; /* + * It's possible for two CPUs to share the same runqueue. + * This makes sense if they eg. share caches. + * + * We take the common 1:1 (SMP, UP) case and optimize it, + * the rest goes via remapping: rq_idx(cpu) gives the + * runqueue on which a particular cpu is on, cpu_idx(cpu) + * gives the rq-specific index of the cpu. + * + * (Note that the generic scheduler code does not impose any + * restrictions on the mappings - there can be 4 CPUs per + * runqueue or even assymetric mappings.) + */ +#if CONFIG_SHARE_RUNQUEUE +# define MAX_NR_SIBLINGS CONFIG_MAX_NR_SIBLINGS + static long __rq_idx[NR_CPUS] __cacheline_aligned; + static long __cpu_idx[NR_CPUS] __cacheline_aligned; +# define rq_idx(cpu) (__rq_idx[(cpu)]) +# define cpu_idx(cpu) (__cpu_idx[(cpu)]) +# define for_each_sibling(idx, rq) \ + for ((idx) = 0; (idx) < (rq)->nr_cpus; (idx)++) +# define rq_nr_cpus(rq) ((rq)->nr_cpus) +# define cpu_active_balance(c) (cpu_rq(c)->cpu[0].active_balance) +#else +# define MAX_NR_SIBLINGS 1 +# define rq_idx(cpu) (cpu) +# define cpu_idx(cpu) 0 +# define for_each_sibling(idx, rq) while (0) +# define cpu_active_balance(c) 0 +# define do_active_balance(rq, cpu) do { } while (0) +# define rq_nr_cpus(rq) 1 + static inline void active_load_balance(runqueue_t *rq, int this_cpu) { } +#endif + +typedef struct cpu_s { + task_t *curr, *idle; + task_t *migration_thread; + list_t migration_queue; + int active_balance; + int cpu; +} cpu_t; + +/* * This is the main, per-CPU runqueue data structure. * * Locking rule: those places that want to lock multiple runqueues @@ -147,30 +189,42 @@ struct runqueue { spinlock_t lock; unsigned long nr_running, nr_switches, expired_timestamp, nr_uninterruptible; - task_t *curr, *idle; prio_array_t *active, *expired, arrays[2]; int prev_nr_running[NR_CPUS]; - task_t *migration_thread; - list_t migration_queue; + int nr_cpus; + cpu_t cpu[MAX_NR_SIBLINGS]; } ____cacheline_aligned; static struct runqueue runqueues[NR_CPUS] __cacheline_aligned; -#define cpu_rq(cpu) (runqueues + (cpu)) +#define cpu_rq(cpu) (runqueues + (rq_idx(cpu))) +#define cpu_int(c) ((cpu_rq(c))->cpu + cpu_idx(c)) +#define cpu_curr_ptr(cpu) (cpu_int(cpu)->curr) +#define cpu_idle_ptr(cpu) (cpu_int(cpu)->idle) + #define this_rq() cpu_rq(smp_processor_id()) #define task_rq(p) cpu_rq(task_cpu(p)) -#define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define rt_task(p) ((p)->prio < MAX_RT_PRIO) +#define migration_thread(cpu) (cpu_int(cpu)->migration_thread) +#define migration_queue(cpu) (&cpu_int(cpu)->migration_queue) + +#if NR_CPUS > 1 +# define task_allowed(p, cpu) ((p)->cpus_allowed & (1UL << (cpu))) +#else +# define task_allowed(p, cpu) 1 +#endif + /* * Default context-switch locking: */ #ifndef prepare_arch_switch # define prepare_arch_switch(rq, next) do { } while(0) # define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock) -# define task_running(rq, p) ((rq)->curr == (p)) +# define task_running(p) \ + (cpu_curr_ptr(task_cpu(p)) == (p)) #endif /* @@ -217,6 +271,15 @@ static inline void rq_unlock(runqueue_t spin_unlock_irq(&rq->lock); } +/** + * idle_cpu - is a given cpu idle currently? + * @cpu: the processor in question. + */ +inline int idle_cpu(int cpu) +{ + return cpu_curr_ptr(cpu) == cpu_idle_ptr(cpu); +} + /* * Adding/removing a task to/from a priority array: */ @@ -305,32 +368,6 @@ static inline void deactivate_task(struc p->array = NULL; } -/* - * resched_task - mark a task 'to be rescheduled now'. - * - * On UP this means the setting of the need_resched flag, on SMP it - * might also involve a cross-CPU call to trigger the scheduler on - * the target CPU. - */ -static inline void resched_task(task_t *p) -{ -#ifdef CONFIG_SMP - int need_resched, nrpolling; - - preempt_disable(); - /* minimise the chance of sending an interrupt to poll_idle() */ - nrpolling = test_tsk_thread_flag(p,TIF_POLLING_NRFLAG); - need_resched = test_and_set_tsk_thread_flag(p,TIF_NEED_RESCHED); - nrpolling |= test_tsk_thread_flag(p,TIF_POLLING_NRFLAG); - - if (!need_resched && !nrpolling && (task_cpu(p) != smp_processor_id())) - smp_send_reschedule(task_cpu(p)); - preempt_enable(); -#else - set_tsk_need_resched(p); -#endif -} - #ifdef CONFIG_SMP /* @@ -347,7 +384,7 @@ void wait_task_inactive(task_t * p) repeat: preempt_disable(); rq = task_rq(p); - if (unlikely(task_running(rq, p))) { + if (unlikely(task_running(p))) { cpu_relax(); /* * enable/disable preemption just to make this @@ -358,7 +395,7 @@ repeat: goto repeat; } rq = task_rq_lock(p, &flags); - if (unlikely(task_running(rq, p))) { + if (unlikely(task_running(p))) { task_rq_unlock(rq, &flags); preempt_enable(); goto repeat; @@ -366,8 +403,43 @@ repeat: task_rq_unlock(rq, &flags); preempt_enable(); } + +/* + * resched_task - mark a task 'to be rescheduled now'. + * + * On UP this means the setting of the need_resched flag, on SMP it + * might also involve a cross-CPU call to trigger the scheduler on + * the target CPU. + */ +static inline void resched_task(task_t *p) +{ + int need_resched, nrpolling; + + preempt_disable(); + /* minimise the chance of sending an interrupt to poll_idle() */ + nrpolling = test_tsk_thread_flag(p,TIF_POLLING_NRFLAG); + need_resched = test_and_set_tsk_thread_flag(p,TIF_NEED_RESCHED); + nrpolling |= test_tsk_thread_flag(p,TIF_POLLING_NRFLAG); + + if (!need_resched && !nrpolling && (task_cpu(p) != smp_processor_id())) + smp_send_reschedule(task_cpu(p)); + preempt_enable(); +} + +#else + +static inline void resched_task(task_t *p) +{ + set_tsk_need_resched(p); +} + #endif +static inline void resched_cpu(int cpu) +{ + resched_task(cpu_curr_ptr(cpu)); +} + /* * kick_if_running - kick the remote CPU if the task is running currently. * @@ -380,10 +452,40 @@ repeat: */ void kick_if_running(task_t * p) { - if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id())) + if ((task_running(p)) && (task_cpu(p) != smp_processor_id())) resched_task(p); } +static void wake_up_cpu(runqueue_t *rq, int cpu, task_t *p) +{ + cpu_t *curr_cpu; + task_t *curr; + int idx; + + if (idle_cpu(cpu)) + return resched_cpu(cpu); + + for_each_sibling(idx, rq) { + curr_cpu = rq->cpu + idx; + if (!task_allowed(p, curr_cpu->cpu)) + continue; + if (curr_cpu->idle == curr_cpu->curr) + return resched_cpu(curr_cpu->cpu); + } + + if (p->prio < cpu_curr_ptr(cpu)->prio) + return resched_task(cpu_curr_ptr(cpu)); + + for_each_sibling(idx, rq) { + curr_cpu = rq->cpu + idx; + if (!task_allowed(p, curr_cpu->cpu)) + continue; + curr = curr_cpu->curr; + if (p->prio < curr->prio) + return resched_task(curr); + } +} + /*** * try_to_wake_up - wake up a thread * @p: the to-be-woken-up thread @@ -412,7 +514,7 @@ repeat_lock_task: * Fast-migrate the task if it's not running or runnable * currently. Do not violate hard affinity. */ - if (unlikely(sync && !task_running(rq, p) && + if (unlikely(sync && !task_running(p) && (task_cpu(p) != smp_processor_id()) && (p->cpus_allowed & (1UL << smp_processor_id())))) { @@ -424,8 +526,8 @@ repeat_lock_task: rq->nr_uninterruptible--; activate_task(p, rq); - if (p->prio < rq->curr->prio) - resched_task(rq->curr); + wake_up_cpu(rq, task_cpu(p), p); + success = 1; } p->state = TASK_RUNNING; @@ -543,7 +645,9 @@ unsigned long nr_running(void) unsigned long i, sum = 0; for (i = 0; i < NR_CPUS; i++) - sum += cpu_rq(i)->nr_running; + /* Shared runqueues are counted only once. */ + if (!cpu_idx(i)) + sum += cpu_rq(i)->nr_running; return sum; } @@ -553,7 +657,9 @@ unsigned long nr_uninterruptible(void) unsigned long i, sum = 0; for (i = 0; i < NR_CPUS; i++) - sum += cpu_rq(i)->nr_uninterruptible; + /* Shared runqueues are counted only once. */ + if (!cpu_idx(i)) + sum += cpu_rq(i)->nr_uninterruptible; return sum; } @@ -602,7 +708,23 @@ static inline void double_rq_unlock(runq spin_unlock(&rq2->lock); } -#if CONFIG_SMP +/* + * One of the idle_cpu_tick() and busy_cpu_tick() functions will + * get called every timer tick, on every CPU. Our balancing action + * frequency and balancing agressivity depends on whether the CPU is + * idle or not. + * + * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on + * systems with HZ=100, every 10 msecs.) + */ +#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) +#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) + +#if !CONFIG_SMP + +static inline void load_balance(runqueue_t *rq, int this_cpu, int idle) { } + +#else /* * double_lock_balance - lock the busiest runqueue @@ -718,12 +840,7 @@ static inline void pull_task(runqueue_t set_task_cpu(p, this_cpu); this_rq->nr_running++; enqueue_task(p, this_rq->active); - /* - * Note that idle threads have a prio of MAX_PRIO, for this test - * to be always true for them. - */ - if (p->prio < this_rq->curr->prio) - set_need_resched(); + wake_up_cpu(this_rq, this_cpu, p); } /* @@ -734,9 +851,9 @@ static inline void pull_task(runqueue_t * We call this with the current runqueue locked, * irqs disabled. */ -static void load_balance(runqueue_t *this_rq, int idle) +static void load_balance(runqueue_t *this_rq, int this_cpu, int idle) { - int imbalance, idx, this_cpu = smp_processor_id(); + int imbalance, idx; runqueue_t *busiest; prio_array_t *array; list_t *head, *curr; @@ -783,12 +900,14 @@ skip_queue: * 1) running (obviously), or * 2) cannot be migrated to this CPU due to cpus_allowed, or * 3) are cache-hot on their current CPU. + * + * (except if we are in idle mode which is a more agressive + * form of rebalancing.) */ -#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ - ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ - !task_running(rq, p) && \ - ((p)->cpus_allowed & (1UL << (this_cpu)))) +#define CAN_MIGRATE_TASK(p,rq,cpu) \ + ((idle || (jiffies - (p)->sleep_timestamp > cache_decay_ticks)) && \ + !task_running(p) && task_allowed(p, cpu)) curr = curr->prev; @@ -811,27 +930,134 @@ out: ; } -/* - * One of the idle_cpu_tick() and busy_cpu_tick() functions will - * get called every timer tick, on every CPU. Our balancing action - * frequency and balancing agressivity depends on whether the CPU is - * idle or not. - * - * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on - * systems with HZ=100, every 10 msecs.) - */ -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) -#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) +#endif -static inline void idle_tick(runqueue_t *rq) +static inline void idle_tick(runqueue_t *rq, unsigned long j) { - if (jiffies % IDLE_REBALANCE_TICK) + if (j % IDLE_REBALANCE_TICK) return; spin_lock(&rq->lock); - load_balance(rq, 1); + load_balance(rq, smp_processor_id(), 1); spin_unlock(&rq->lock); } +#if CONFIG_SHARE_RUNQUEUE +static void active_load_balance(runqueue_t *this_rq, int this_cpu) +{ + runqueue_t *rq; + int i, idx; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + rq = cpu_rq(i); + if (rq == this_rq) + continue; + /* + * Any SMT-specific imbalance? + */ + for_each_sibling(idx, rq) + if (rq->cpu[idx].idle == rq->cpu[idx].curr) + goto next_cpu; + + /* + * At this point it's sure that we have a SMT + * imbalance: this (physical) CPU is idle but + * another CPU has two tasks running. + * + * We wake up one of the migration threads (it + * doesnt matter which one) and let it fix things up: + */ + if (!cpu_active_balance(this_cpu)) { + cpu_active_balance(this_cpu) = 1; + spin_unlock(&this_rq->lock); + wake_up_process(rq->cpu[0].migration_thread); + spin_lock(&this_rq->lock); + } +next_cpu: + ; + } +} + +static void do_active_balance(runqueue_t *this_rq, int this_cpu) +{ + runqueue_t *rq; + int i, idx; + + spin_unlock(&this_rq->lock); + + cpu_active_balance(this_cpu) = 0; + + /* + * Is the imbalance still present? + */ + for_each_sibling(idx, this_rq) + if (this_rq->cpu[idx].idle == this_rq->cpu[idx].curr) + goto out; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + rq = cpu_rq(i); + if (rq == this_rq) + continue; + + /* completely idle CPU? */ + if (rq->nr_running) + continue; + + /* + * At this point it's reasonably sure that we have an + * imbalance. Since we are the migration thread, try to + * balance a thread over to the target queue. + */ + spin_lock(&rq->lock); + load_balance(rq, i, 1); + spin_unlock(&rq->lock); + goto out; + } +out: + spin_lock(&this_rq->lock); +} + +/* + * This routine is called to map a CPU into another CPU's runqueue. + * + * This must be called during bootup with the merged runqueue having + * no tasks. + */ +void sched_map_runqueue(int cpu1, int cpu2) +{ + runqueue_t *rq1 = cpu_rq(cpu1); + runqueue_t *rq2 = cpu_rq(cpu2); + int cpu2_idx_orig = cpu_idx(cpu2), cpu2_idx; + + printk("sched_merge_runqueues: CPU#%d <=> CPU#%d, on CPU#%d.\n", cpu1, cpu2, smp_processor_id()); + if (rq1 == rq2) + BUG(); + if (rq2->nr_running) + BUG(); + /* + * At this point, we dont have anything in the runqueue yet. So, + * there is no need to move processes between the runqueues. + * Only, the idle processes should be combined and accessed + * properly. + */ + cpu2_idx = rq1->nr_cpus++; + + if (rq_idx(cpu1) != cpu1) + BUG(); + rq_idx(cpu2) = cpu1; + cpu_idx(cpu2) = cpu2_idx; + rq1->cpu[cpu2_idx].cpu = cpu2; + rq1->cpu[cpu2_idx].idle = rq2->cpu[cpu2_idx_orig].idle; + rq1->cpu[cpu2_idx].curr = rq2->cpu[cpu2_idx_orig].curr; + INIT_LIST_HEAD(&rq1->cpu[cpu2_idx].migration_queue); + + /* just to be safe: */ + rq2->cpu[cpu2_idx_orig].idle = NULL; + rq2->cpu[cpu2_idx_orig].curr = NULL; +} #endif /* @@ -856,15 +1082,14 @@ void scheduler_tick(int user_ticks, int { int cpu = smp_processor_id(); runqueue_t *rq = this_rq(); + unsigned long j = jiffies; task_t *p = current; - if (p == rq->idle) { + if (p == cpu_idle_ptr(cpu)) { /* note: this timer irq context must be accounted for as well */ if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET) kstat.per_cpu_system[cpu] += sys_ticks; -#if CONFIG_SMP - idle_tick(rq); -#endif + idle_tick(rq, j); return; } if (TASK_NICE(p) > 0) @@ -873,12 +1098,13 @@ void scheduler_tick(int user_ticks, int kstat.per_cpu_user[cpu] += user_ticks; kstat.per_cpu_system[cpu] += sys_ticks; + spin_lock(&rq->lock); /* Task might have expired already, but not scheduled off yet */ if (p->array != rq->active) { set_tsk_need_resched(p); + spin_unlock(&rq->lock); return; } - spin_lock(&rq->lock); if (unlikely(rt_task(p))) { /* * RR tasks need a special form of timeslice management. @@ -914,16 +1140,14 @@ void scheduler_tick(int user_ticks, int if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { if (!rq->expired_timestamp) - rq->expired_timestamp = jiffies; + rq->expired_timestamp = j; enqueue_task(p, rq->expired); } else enqueue_task(p, rq->active); } out: -#if CONFIG_SMP - if (!(jiffies % BUSY_REBALANCE_TICK)) - load_balance(rq, 0); -#endif + if (!(j % BUSY_REBALANCE_TICK)) + load_balance(rq, smp_processor_id(), 0); spin_unlock(&rq->lock); } @@ -935,18 +1159,20 @@ void scheduling_functions_start_here(voi asmlinkage void schedule(void) { task_t *prev, *next; - runqueue_t *rq; prio_array_t *array; + int idx, this_cpu; + runqueue_t *rq; list_t *queue; - int idx; + int retry = 0; if (unlikely(in_interrupt())) BUG(); +need_resched: #if CONFIG_DEBUG_HIGHMEM check_highmem_ptes(); #endif -need_resched: + this_cpu = smp_processor_id(); preempt_disable(); prev = current; rq = this_rq(); @@ -975,12 +1201,14 @@ need_resched: } pick_next_task: if (unlikely(!rq->nr_running)) { -#if CONFIG_SMP - load_balance(rq, 1); + load_balance(rq, this_cpu, 1); if (rq->nr_running) goto pick_next_task; -#endif - next = rq->idle; + active_load_balance(rq, this_cpu); + if (rq->nr_running) + goto pick_next_task; +pick_idle: + next = cpu_idle_ptr(this_cpu); rq->expired_timestamp = 0; goto switch_tasks; } @@ -996,9 +1224,38 @@ pick_next_task: rq->expired_timestamp = 0; } +new_array: idx = sched_find_first_bit(array->bitmap); queue = array->queue + idx; next = list_entry(queue->next, task_t, run_list); + if ((next != prev) && (rq_nr_cpus(rq) > 1)) { + list_t *tmp = queue->next; + + while (task_running(next) || !task_allowed(next, this_cpu)) { + tmp = tmp->next; + if (tmp != queue) { + next = list_entry(tmp, task_t, run_list); + continue; + } + idx = find_next_bit(array->bitmap, MAX_PRIO, ++idx); + if (idx == MAX_PRIO) { + if (retry || !rq->expired->nr_active) + goto pick_idle; + /* + * To avoid infinite changing of arrays, + * when we have only tasks runnable by + * sibling. + */ + retry = 1; + + array = rq->expired; + goto new_array; + } + queue = array->queue + idx; + tmp = queue->next; + next = list_entry(tmp, task_t, run_list); + } + } switch_tasks: prefetch(next); @@ -1006,7 +1263,8 @@ switch_tasks: if (likely(prev != next)) { rq->nr_switches++; - rq->curr = next; + cpu_curr_ptr(this_cpu) = next; + set_task_cpu(next, this_cpu); prepare_arch_switch(rq, next); prev = context_switch(prev, next); @@ -1264,9 +1522,8 @@ void set_user_nice(task_t *p, long nice) * If the task is running and lowered its priority, * or increased its priority then reschedule its CPU: */ - if ((NICE_TO_PRIO(nice) < p->static_prio) || - task_running(rq, p)) - resched_task(rq->curr); + if ((NICE_TO_PRIO(nice) < p->static_prio) || task_running(p)) + resched_task(cpu_curr_ptr(task_cpu(p))); } out_unlock: task_rq_unlock(rq, &flags); @@ -1339,15 +1596,6 @@ int task_nice(task_t *p) } /** - * idle_cpu - is a given cpu idle currently? - * @cpu: the processor in question. - */ -int idle_cpu(int cpu) -{ - return cpu_curr(cpu) == cpu_rq(cpu)->idle; -} - -/** * find_process_by_pid - find a process with a matching PID value. * @pid: the pid in question. */ @@ -1863,7 +2111,7 @@ void __init init_idle(task_t *idle, int local_irq_save(flags); double_rq_lock(idle_rq, rq); - idle_rq->curr = idle_rq->idle = idle; + cpu_curr_ptr(cpu) = cpu_idle_ptr(cpu) = idle; deactivate_task(idle, rq); idle->array = NULL; idle->prio = MAX_PRIO; @@ -1918,6 +2166,7 @@ void set_cpus_allowed(task_t *p, unsigne unsigned long flags; migration_req_t req; runqueue_t *rq; + int cpu; #if 0 /* FIXME: Grab cpu_lock, return error on this case. --RR */ new_mask &= cpu_online_map; @@ -1940,16 +2189,17 @@ void set_cpus_allowed(task_t *p, unsigne * If the task is not on a runqueue (and not running), then * it is sufficient to simply update the task's cpu field. */ - if (!p->array && !task_running(rq, p)) { + if (!p->array && !task_running(p)) { set_task_cpu(p, __ffs(p->cpus_allowed)); task_rq_unlock(rq, &flags); goto out; } init_completion(&req.done); req.task = p; - list_add(&req.list, &rq->migration_queue); + cpu = task_cpu(p); + list_add(&req.list, migration_queue(cpu)); task_rq_unlock(rq, &flags); - wake_up_process(rq->migration_thread); + wake_up_process(migration_thread(cpu)); wait_for_completion(&req.done); out: @@ -1957,15 +2207,15 @@ out: } /* - * migration_thread - this is a highprio system thread that performs + * migration_task - this is a highprio system thread that performs * thread migration by 'pulling' threads into the target runqueue. */ -static int migration_thread(void * data) +static int migration_task(void * data) { struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; int cpu = (long) data; runqueue_t *rq; - int ret; + int ret, idx; daemonize(); sigfillset(¤t->blocked); @@ -1980,11 +2230,12 @@ static int migration_thread(void * data) * to the target CPU - we'll wake up there. */ if (smp_processor_id() != cpu) - printk("migration_task %d on cpu=%d\n", cpu, smp_processor_id()); + printk("migration_thread %d on cpu=%d\n", cpu, smp_processor_id()); ret = setscheduler(0, SCHED_FIFO, ¶m); rq = this_rq(); - rq->migration_thread = current; + migration_thread(cpu) = current; + idx = cpu_idx(cpu); sprintf(current->comm, "migration_CPU%d", smp_processor_id()); @@ -1997,7 +2248,9 @@ static int migration_thread(void * data) task_t *p; spin_lock_irqsave(&rq->lock, flags); - head = &rq->migration_queue; + if (cpu_active_balance(cpu)) + do_active_balance(rq, cpu); + head = migration_queue(cpu); current->state = TASK_INTERRUPTIBLE; if (list_empty(head)) { spin_unlock_irqrestore(&rq->lock, flags); @@ -2044,13 +2297,14 @@ static int migration_call(struct notifie unsigned long action, void *hcpu) { + long cpu = (long) hcpu; + switch (action) { case CPU_ONLINE: - printk("Starting migration thread for cpu %li\n", - (long)hcpu); - kernel_thread(migration_thread, hcpu, + printk("Starting migration thread for cpu %li\n", cpu); + kernel_thread(migration_task, hcpu, CLONE_FS | CLONE_FILES | CLONE_SIGNAL); - while (!cpu_rq((long)hcpu)->migration_thread) + while (!migration_thread(cpu)) yield(); break; } @@ -2097,11 +2351,20 @@ void __init sched_init(void) for (i = 0; i < NR_CPUS; i++) { prio_array_t *array; + /* + * Start with a 1:1 mapping between CPUs and runqueues: + */ +#if CONFIG_SHARE_RUNQUEUE + rq_idx(i) = i; + cpu_idx(i) = 0; +#endif rq = cpu_rq(i); rq->active = rq->arrays; rq->expired = rq->arrays + 1; spin_lock_init(&rq->lock); - INIT_LIST_HEAD(&rq->migration_queue); + INIT_LIST_HEAD(migration_queue(i)); + rq->nr_cpus = 1; + rq->cpu[cpu_idx(i)].cpu = i; for (j = 0; j < 2; j++) { array = rq->arrays + j; @@ -2117,9 +2380,8 @@ void __init sched_init(void) * We have to do a little magic to get the first * thread right in SMP mode. */ - rq = this_rq(); - rq->curr = current; - rq->idle = current; + cpu_curr_ptr(smp_processor_id()) = current; + cpu_idle_ptr(smp_processor_id()) = current; set_task_cpu(current, smp_processor_id()); wake_up_process(current);