Skip to content

Blog

Custom perf with custom kernel

There is no doubt that perf is an awesome tool. And eBPF enables us to attach an arbitrary code to any tracepoint. But both has one limitation. They cannot execute a kernel function. As if I want to call a kernel function whenever a tracepoint is hit. Its not possible today. It is not a big caveat. But will be very much useful while debugging kernel.

Lets say I want to trace all IPIs and want to record the source CPU, target CPU. And I want to know whether Idle task was running on the target CPU during the IPI. Inter Processor Interrupt or IPI is and interrupt sent from one CPU to another to wake it up.

NOTE: All the code referred here is from Linux-5.4.0

The code that sends IPI in x86 platform is native_smp_send_reschedule. Below is the code of that function.

void native_smp_send_reschedule(int cpu)
{
        if (unlikely(cpu_is_offline(cpu))) {
                WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
                return;
        }
        apic->send_IPI(cpu, RESCHEDULE_VECTOR);
}
The only argument of this function cpu mentions the target CPU - to which CPU IPI is being sent. I want to source CPU from which the IPI is originates.

To get current CPU kernel has a macro - smp_processor_id().

We can get the task-run-queue of that CPU using the macro cpu_rq(int cpu). And cpu_rq(int cpu)->curr will point to the task that is currently running on that CPU. If that task's pid is 0, it is the Idle task.

So my deduced requirement will be setting and tracepoint on native_smp_send_reschedule and call these kernel functions upon hitting that tracepoint. There is no way today. Thus I'm left with only one option - build my own kernel with necessary modifications.

Building custom kernel

I've cloned kernel repository from kernel.org. And checked-out to the desired branch. I'm running Ubuntu-20.05 inside a Virtualbox. So I'm using the kernel-5.4.0 which is already installed in my distribution.

bala@ubuntu-vm-1:~/source/$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
bala@ubuntu-vm-1:~/source/$ cd linux
bala@ubuntu-vm-1:~/source/linux/$ git checkout -b v5.4.0 v5.4
bala@ubuntu-vm-1:~/source/linux/$ head -n6 Makefile
 # SPDX-License-Identifier: GPL-2.0
 VERSION = 5
 PATCHLEVEL = 4
 SUBLEVEL = 0
 EXTRAVERSION =
 NAME = Kleptomaniac Octopus
bala@ubuntu-vm-1:~/source/linux/$

Copied config from current OS. And make oldconfig

bala@ubuntu-vm-1:~/source/linux/$ cp /boot/config-5.4.0-52-generic ./.config
bala@ubuntu-vm-1:~/source/linux/$ make oldconfig
If any new configs were, select appropriately if you know it. Otherwise simply ignore it.

Applied following patch.

bala@ubuntu-vm-1:~/source/linux$ git diff
diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c
index 6ca0f91372fd..31df1e1f4922 100644
--- a/arch/x86/kernel/apic/ipi.c
+++ b/arch/x86/kernel/apic/ipi.c
@@ -2,6 +2,7 @@

 #include <linux/cpumask.h>
 #include <linux/smp.h>
+#include "../../../../kernel/sched/sched.h"

 #include "local.h"

@@ -61,14 +62,22 @@ void apic_send_IPI_allbutself(unsigned int vector)
  * wastes no time serializing anything. Worst case is that we lose a
  * reschedule ...
  */
+#pragma GCC push_options
+#pragma GCC optimize ("O0")
 void native_smp_send_reschedule(int cpu)
 {
+       int this_cpu = smp_processor_id();
+       int that_cpu = cpu;
+       struct rq *rq = cpu_rq(that_cpu);
+       int is_idle_task = ((rcu_dereference(rq->curr))->pid == 0);
+
        if (unlikely(cpu_is_offline(cpu))) {
                WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
                return;
        }
        apic->send_IPI(cpu, RESCHEDULE_VECTOR);
 }
+#pragma GCC pop_options

 void native_send_call_func_single_ipi(int cpu)
 {
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 93d97f9b0157..706fdfc715ab 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -359,6 +359,7 @@ config SECTION_MISMATCH_WARN_ONLY
 #
 config ARCH_WANT_FRAME_POINTERS
        bool
+       default y

 config FRAME_POINTER
        bool "Compile the kernel with frame pointers"
bala@ubuntu-vm-1:~/source/linux$
This patch contains two changes, * lib/Kconfig.debug - It is to enable frame pointers. Frame pointers will be very much useful for stack trace. * arch/x86/kernel/apic/ipi.c * pragma instructions are compiler directives. They tell GCC not to optimize the code below - O0 optimization. Otherwise, GCC may optimize away the variables as they are unused variables. * this_cpu is got from smp_processor_id() * is_idle_task will be set to 1 if target CPU is executing Idle task. * Last pragma instruction is to reset GCC options back to default.

Enabled kernel debug info by setting CONFIG_DEBUG_INFO in the .config file. And built the kernel. My VM has 4 CPUs, so started 8 parallel build threads.

bala@ubuntu-vm-1:~/source/linux$ make bzImage -j8
...
...
...
  OBJCOPY arch/x86/boot/setup.bin
  BUILD   arch/x86/boot/bzImage
Setup is 16412 bytes (padded to 16896 bytes).
System is 8097 kB
CRC 65acc728
Kernel: arch/x86/boot/bzImage is ready  (#1)
bala@ubuntu-vm-1:~/source/linux$

Now copied the bzImage to /boot/ directory and updated boot loader.

bala@ubuntu-vm-1:~/source/linux$ sudo cp arch/x86/boot/bzImage /boot/vmlinuz-dbg-custom
bala@ubuntu-vm-1:~/source/linux$ sudo update-grub
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
dpkg: warning: version 'dbg-custom' has bad syntax: version number does not start with digit
Found linux image: /boot/vmlinuz-dbg-custom
Found linux image: /boot/vmlinuz-5.4.0-52-generic
Found initrd image: /boot/initrd.img-5.4.0-52-generic
done
bala@ubuntu-vm-1:~/source/linux$

Rebooted the VM into newly built kernel.

Custom build perf

As the kernel is custom built, the perf package comes with Ubuntu-20.04 didn't get executed. It threw following error.

bala@ubuntu-vm-1:~/source/linux$ perf --help
WARNING: perf not found for kernel 5.4.0+

  You may need to install the following packages for this specific kernel:
    linux-tools-5.4.0+-5.4.0+
    linux-cloud-tools-5.4.0+-5.4.0+

  You may also want to install one of the following packages to keep up to date:
    linux-tools-5.4.0+
    linux-cloud-tools-5.4.0+
bala@ubuntu-vm-1:~/source/linux$

So I went ahead and build perf from kernel source itself. Perf requires following packages for adding a probe and source line probing. * libelf-dev * libdw-dev

I installed both of them.

bala@ubuntu-vm-1:~/source/linux/$ sudo apt install libelf-dev libdw-dev

Now built perf inside the kernel source tree.

bala@ubuntu-vm-1:~/source/linux/$ cd tools/perf
bala@ubuntu-vm-1:~/source/linux/tools/perf/$ make

Running the probe on custom kernel with custom perf

bala@ubuntu-vm-1:~/source/linux/tools/perf$ sudo ./perf probe -s /home/bala/source/linux/ -k ../../vmlinux native_smp_send_reschedule="native_smp_send_reschedule:7 this_cpu that_cpu is_idle_task"
Added new events:
  probe:native_smp_send_reschedule (on native_smp_send_reschedule:7 with this_cpu that_cpu is_idle_task)
  probe:native_smp_send_reschedule_1 (on native_smp_send_reschedule:7 with this_cpu that_cpu is_idle_task)

You can now use it in all perf tools, such as:

        perf record -e probe:native_smp_send_reschedule_1 -aR sleep 1

bala@ubuntu-vm-1:~/source/linux/tools/perf$ sudo ./perf record -e probe:native_smp_send_reschedule_1 -aR sleep 1
Couldn't synthesize bpf events.
[ perf record: Woken up 1 times to write data ]
way too many cpu caches..[ perf record: Captured and wrote 0.101 MB perf.data (10 samples) ]
bala@ubuntu-vm-1:~/source/linux/tools/perf$ sudo ./perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 10  of event 'probe:native_smp_send_reschedule_1'
# Event count (approx.): 10
#
# Overhead  Trace output
# ........  .......................................................
#
    20.00%  (ffffffffa5e46bc9) this_cpu=0 that_cpu=1 is_idle_task=1
    20.00%  (ffffffffa5e46bc9) this_cpu=3 that_cpu=0 is_idle_task=1
    10.00%  (ffffffffa5e46bc9) this_cpu=0 that_cpu=2 is_idle_task=1
    10.00%  (ffffffffa5e46bc9) this_cpu=0 that_cpu=3 is_idle_task=1
    10.00%  (ffffffffa5e46bc9) this_cpu=1 that_cpu=0 is_idle_task=1
    10.00%  (ffffffffa5e46bc9) this_cpu=2 that_cpu=3 is_idle_task=1
    10.00%  (ffffffffa5e46bc9) this_cpu=3 that_cpu=1 is_idle_task=1
    10.00%  (ffffffffa5e46bc9) this_cpu=3 that_cpu=2 is_idle_task=1


#
# (Tip: Skip collecting build-id when recording: perf record -B)
#
bala@ubuntu-vm-1:~/source/linux/tools/perf$

Why is_idle_task is always 1 during and IPI?! More on later post ;).

References

  • https://github.com/iovisor/bpftrace/issues/792
  • https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html
  • https://news.ycombinator.com/item?id=4711571
  • https://www.cyberciti.biz/tips/compiling-linux-kernel-26.html
  • https://tldp.org/LDP/lame/LAME/linux-admin-made-easy/kernel-custom.html
  • https://stackoverflow.com/questions/28136815/linux-kernel-how-to-obtain-a-particular-version-right-upto-sublevel
  • https://git-scm.com/docs/git-fetch
  • https://www.quora.com/How-do-I-compile-a-Linux-perf-tool-with-all-features-For-Linux-4-0-on-Ubuntu
  • https://serverfault.com/questions/251134/how-to-compile-the-kernel-with-debug-symbols

perf setup

Linux-perf aka perf is versatile, like a batmobile. It has all the tools and functionalities you need. And you'll feel like a superhero once you master it.

By default perf comes with may tools that relying on debug and trace symbols exported via procfs. But to add custom probes and probes with line numbers, kernel debug symbols and kernel source is necessary. In this post I'll walk you through the necessary setup process. I'm using an Ubuntu-20.04 VM running on Virtual box. I'm not going to rebuild and install kernel. The steps will be,

  • Install Linux-kernel debug symbols
  • Fetch Linux-kernel source
  • Install perf
  • First run

Enable non-common repositories

Enable debug repositories in apt source list.

bala@ubuntu-vm-1:~$ echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/ddebs.list

Install debug keyring

bala@ubuntu-vm-1:~$ sudo apt install ubuntu-dbgsym-keyring

Enable source repositories in apt source list.

bala@ubuntu-vm-1:~$ grep deb-src /etc/apt/sources.list
deb-src http://in.archive.ubuntu.com/ubuntu/ focal main restricted

Do apt update

bala@ubuntu-vm-1:~$ sudo apt update

1. Install Linux-kernel debug symbols

Install Linux debug symbols corresponding the kernel installed in your machine.

bala@ubuntu-vm-1:~$ sudo apt install -y linux-image-`uname -r`-dbgsym

Linux image with debug symbols will be installed in the directory /usr/lib/debug/boot/

bala@ubuntu-vm-1:~$ ls -lh /usr/lib/debug/boot/vmlinux-5.4.0-52-generic
-rw-r--r-- 2 root root 742M Oct 15 15:58 /usr/lib/debug/boot/vmlinux-5.4.0-52-generic
bala@ubuntu-vm-1:~$

2. Fetch Linux kernel source

Fetch the source package corresponding to the installed kernel.

bala@ubuntu-vm-1:~$ sudo apt install linux-source
Kernel source with debian packaging files will be installed in the path /usr/src/linux-source-5.4.0. The kernel source is available in a tarball inside this directory. Copy to your desired location and extract.
bala@ubuntu-vm-1:~$ ls -lh /usr/src/linux-source-5.4.0/linux-source-5.4.0.tar.bz2
-rw-r--r-- 1 root root 129M Oct 15 15:58 /usr/src/linux-source-5.4.0/linux-source-5.4.0.tar.bz2
bala@ubuntu-vm-1:~$ cp -f /usr/src/linux-source-5.4.0/linux-source-5.4.0.tar.bz2 ~/source/
bala@ubuntu-vm-1:~$ cd ~/source/
bala@ubuntu-vm-1:~/source$ tar -xvf linux-source-5.4.0.tar.bz2
bala@ubuntu-vm-1:~/source$ ls ~/source/linux-source-5.4.0/
arch   certs    CREDITS  Documentation  dropped.txt  include  ipc     Kconfig  lib       MAINTAINERS  mm   README   scripts   snapcraft.yaml  tools   update-version-dkms  virt
block  COPYING  crypto   drivers        fs           init     Kbuild  kernel   LICENSES  Makefile     net  samples  security  sound           ubuntu  usr
bala@ubuntu-vm-1:~$

3. Install Linux perf

It comes with linux-tools-generic package on Ubuntu-20.04.

bala@ubuntu-vm-1:~$ sudo apt install linux-tools-generic

4. Run your first perf command

I want to count number of IPI (Inter Processor Interrupts) sent by resched_curr. It sends IPI when the target CPU is not the current CPU (the one executing the function itself). Here is the source code of that function.

void resched_curr(struct rq *rq)
{
    struct task_struct *curr = rq->curr;
    int cpu;

    lockdep_assert_held(&rq->lock);

    if (test_tsk_need_resched(curr))
        return;

    cpu = cpu_of(rq);

    if (cpu == smp_processor_id()) {
        set_tsk_need_resched(curr);
        set_preempt_need_resched();
        return;
    }

    if (set_nr_and_not_polling(curr))
        smp_send_reschedule(cpu);
    else
        trace_sched_wake_idle_without_ipi(cpu);
}

So if target CPU is the current CPU, line number 14 will get executed. Otherwise execution continues from line number 18. Also I want to record the target CPU in both cases.

Get the line numbers where you can insert probes from perf itself.

bala@ubuntu-vm-1:~/source$ sudo perf probe -k /usr/lib/debug/boot/vmlinux-5.4.0-52-generic -s ~/source/linux-source-5.4.0 -L resched_curr
<resched_curr@/home/bala/source/linux-source-5.4.0//kernel/sched/core.c:0>
      0  void resched_curr(struct rq *rq)
      1  {
      2         struct task_struct *curr = rq->curr;
                int cpu;

                lockdep_assert_held(&rq->lock);

      7         if (test_tsk_need_resched(curr))
                        return;

     10         cpu = cpu_of(rq);

     12         if (cpu == smp_processor_id()) {
     13                 set_tsk_need_resched(curr);
     14                 set_preempt_need_resched();
     15                 return;
                }

     18         if (set_nr_and_not_polling(curr))
     19                 smp_send_reschedule(cpu);
                else
     21                 trace_sched_wake_idle_without_ipi(cpu);
         }

         void resched_cpu(int cpu)

bala@ubuntu-vm-1:~/source$

Here is the probe for non-IPI case. I name it as resched_curr_same_cpu.

bala@ubuntu-vm-1:~$ sudo perf probe -k /usr/lib/debug/boot/vmlinux-5.4.0-52-generic -s source/linux-source-5.4.0 resched_curr_same_cpu='resched_curr:14 rq->cpu'

Probe for IPI case. And I name it as resched_curr_send_ipi.

bala@ubuntu-vm-1:~$ sudo perf probe -k /usr/lib/debug/boot/vmlinux-5.4.0-52-generic -s source/linux-source-5.4.0 resched_curr_send_ipi='resched_curr:19 rq->cpu'

Note: To probe the function resched_curr and its argument rq, we need Linux debug symbols. And to probe on line numbers we need Linux source. So that we have installed both of them earlier.

Now lets capture the execution of a stress-ng test.

bala@ubuntu-vm-1:~$ sudo perf record -e probe:resched_curr_same_cpu,probe:resched_curr_send_ipi stress-ng --mq 8 -t 5 --metrics-brief
stress-ng: info:  [22439] dispatching hogs: 8 mq
stress-ng: info:  [22439] successful run completed in 5.01s
stress-ng: info:  [22439] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [22439]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [22439] mq              2225397      5.00      3.57     16.14    445062.30    112907.00
[ perf record: Woken up 421 times to write data ]
[ perf record: Captured and wrote 105.404 MB perf.data (1380709 samples) ]
bala@ubuntu-vm-1:~$

And the report is,

bala@ubuntu-vm-1:~$ sudo perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1M of event 'probe:resched_curr_same_cpu'
# Event count (approx.): 1380698
#
# Overhead  Trace output
# ........  ........................
#
    29.13%  (ffffffff83ad740d) cpu=1
    27.77%  (ffffffff83ad740d) cpu=2
    24.74%  (ffffffff83ad740d) cpu=0
    18.36%  (ffffffff83ad740d) cpu=3


# Samples: 11  of event 'probe:resched_curr_send_ipi'
# Event count (approx.): 11
#
# Overhead  Trace output
# ........  ........................
#
    45.45%  (ffffffff83ad73af) cpu=1
    36.36%  (ffffffff83ad73af) cpu=3
     9.09%  (ffffffff83ad73af) cpu=0
     9.09%  (ffffffff83ad73af) cpu=2


#
# (Cannot load tips.txt file, please install perf!)
#
bala@ubuntu-vm-1:~$
As you can see only 11 times out of a million times an IPI is sent. More on this in later posts. Until then... "Perhaps you should read the instructions first?".

References

  • http://www.brendangregg.com/perf.html
  • https://wiki.ubuntu.com/Kernel/Reference/stress-ng
  • https://man7.org/linux/man-pages/man1/perf-probe.1.html
  • https://wiki.ubuntu.com/Debug%20Symbol%20Packages
  • https://askubuntu.com/questions/50145/how-to-install-perf-monitoring-tool

Quick kernel upgrade with kexec

One of the major issues we are facing is keeping up to date with security patches. That too keeping the kernel up to date is little harder. Because it requires a reboot. As reboot will take minutes to complete, there will be a significant service downtime. Or doing a service migration to avoid downtime will come with its own complexity.

kexec will be help in these situations. It can upgrade the kernel without complete reboot process. Though not zero, the downtime is very less compared to a full reboot. In this post, I'll demo upgrading kernel of a Virtual machine running Debian-9.

This VM is running Debian Linux-4.9.0-12. Let me update to the latest kernel available now - Linux-4.9.0-13.

Install kexec-tools

root@debian:~# apt install kexec-tools -qq

Install latest Linux-image package. This will not overwrite the existing kernel or initrd image in your /boot/ directory. So you can safely rollback if required.

root@debian:~# ls -lh /boot/
total 25M
-rw-r--r-- 1 root root 3.1M Jan 21  2020 System.map-4.9.0-12-amd64
-rw-r--r-- 1 root root 183K Jan 21  2020 config-4.9.0-12-amd64
drwxr-xr-x 5 root root 4.0K Apr 24 12:40 grub
-rw-r--r-- 1 root root  18M Apr 24 12:23 initrd.img-4.9.0-12-amd64
-rw-r--r-- 1 root root 4.1M Jan 21  2020 vmlinuz-4.9.0-12-amd64

root@debian:~# sudo apt update -qq
43 packages can be upgraded. Run 'apt list --upgradable' to see them.

root@debian:~# sudo apt install linux-image-amd64 -qq
The following additional packages will be installed:
  linux-image-4.9.0-13-amd64
Suggested packages:
  linux-doc-4.9 debian-kernel-handbook
The following NEW packages will be installed:
  linux-image-4.9.0-13-amd64
The following packages will be upgraded:
  linux-image-amd64
1 upgraded, 1 newly installed, 0 to remove and 42 not upgraded.
Need to get 39.3 MB of archives.
After this operation, 193 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Selecting previously unselected package linux-image-4.9.0-13-amd64.
(Reading database ... 26429 files and directories currently installed.)
Preparing to unpack .../linux-image-4.9.0-13-amd64_4.9.228-1_amd64.deb ...
Unpacking linux-image-4.9.0-13-amd64 (4.9.228-1) ...........................]
Preparing to unpack .../linux-image-amd64_4.9+80+deb9u11_amd64.deb .........]
Unpacking linux-image-amd64 (4.9+80+deb9u11) over (4.9+80+deb9u10) .........]
Setting up linux-image-4.9.0-13-amd64 (4.9.228-1) ..........................]
I: /vmlinuz is now a symlink to boot/vmlinuz-4.9.0-13-amd64.................]
I: /initrd.img is now a symlink to boot/initrd.img-4.9.0-13-amd64
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-4.9.0-13-amd64
/etc/kernel/postinst.d/zz-update-grub:
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.9.0-13-amd64
Found initrd image: /boot/initrd.img-4.9.0-13-amd64
Found linux image: /boot/vmlinuz-4.9.0-12-amd64
Found initrd image: /boot/initrd.img-4.9.0-12-amd64
done
Setting up linux-image-amd64 (4.9+80+deb9u11) ...###########................]

root@debian:~# ls -lh /boot/
total 50M
-rw-r--r-- 1 root root 3.1M Jan 21  2020 System.map-4.9.0-12-amd64
-rw-r--r-- 1 root root 3.1M Jul  6 02:59 System.map-4.9.0-13-amd64   <---
-rw-r--r-- 1 root root 183K Jan 21  2020 config-4.9.0-12-amd64
-rw-r--r-- 1 root root 183K Jul  6 02:59 config-4.9.0-13-amd64       <---
drwxr-xr-x 5 root root 4.0K Oct  1 17:25 grub
-rw-r--r-- 1 root root  18M Apr 24 12:23 initrd.img-4.9.0-12-amd64
-rw-r--r-- 1 root root  18M Oct  1 17:25 initrd.img-4.9.0-13-amd64   <---
-rw-r--r-- 1 root root 4.1M Jan 21  2020 vmlinuz-4.9.0-12-amd64
-rw-r--r-- 1 root root 4.1M Jul  6 02:59 vmlinuz-4.9.0-13-amd64      <---

Now copy the kernel command line from /proc/cmdline. We should pass this to kexec.

root@debian:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.9.0-12-amd64 root=UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ro net.ifnames=0 biosdevname=0 cgroup_enable=memory console=tty0 console=ttyS0,115200 notsc scsi_mod.use_blk_mq=Y quiet

Load the new kernel using kexec -l.

root@debian:~# kexec -l /boot/vmlinuz-4.9.0-13-amd64 --initrd=/boot/initrd.img-4.9.0-13-amd64 --command-line="BOOT_IMAGE=/boot/vmlinuz-4.9.0-13-amd64 root=UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ro net.ifnames=0 biosdevname=0 cgroup_enable=memory console=tty0 console=ttyS0,115200 notsc scsi_mod.use_blk_mq=Y quiet"
root@debian:~#

Now upgrade to the new kernel.

root@debian:~# uname -a
Linux debian 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux

root@debian:~# systemctl start kexec.target
[268181.341191] kexec_core: Starting new kernel
/dev/sda1: clean, 35704/655360 files, 366185/2621179 blocks
GROWROOT: NOCHANGE: partition 1 is size 20969439. it cannot be grown

Debian GNU/Linux 9 debian ttyS0

debian login: root
Password:
Last login: Mon Sep 28 14:59:30 IST 2020 on ttyS0
Linux debian 4.9.0-13-amd64 #1 SMP Debian 4.9.228-1 (2020-07-05) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

root@debian:~# uname -a
Linux debian 4.9.0-13-amd64 #1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 GNU/Linux
root@debian:~#

Time to upgrade

This actually took no time. I was pinging this VM from its Host. There was a slight increase in latency while the upgrade was in progress. That was less than a second. But I didn't run any service and tested its status after reboot. Because it may vary from service to service.

64 bytes from 192.168.122.91: icmp_seq=176 ttl=64 time=0.465 ms
64 bytes from 192.168.122.91: icmp_seq=177 ttl=64 time=0.408 ms
64 bytes from 192.168.122.91: icmp_seq=181 ttl=64 time=8.32 ms   <---
64 bytes from 192.168.122.91: icmp_seq=182 ttl=64 time=0.452 ms
64 bytes from 192.168.122.91: icmp_seq=183 ttl=64 time=0.198 ms

perf kvm to profile vm_exit

Optimizing VM_EXITs will significantly improve performance VMs. All the major improvements in VM world is mainly focusing on reducing the number of VM_EXITs. To optimize it, first we should able to measure it. Initially the tool kvm_stat was designed for this purpose, later it has been added inside perf itself.

To profile VM_EXITs while running sysbench, * Get pid of the VM task - 127894 * Get the IP of that machine - 192.168.122.194 Make sure you can ssh to that machine without password * Install sysbench inside the VM

$ sudo perf kvm stat record -p 127894 ssh 192.168.122.194 -l test_user "sysbench --test=cpu --cpu-max-prime=20000 run"
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000


Test execution summary:
    total time:                          22.6607s
    total number of events:              10000
    total time taken by event execution: 22.6598
    per-request statistics:
         min:                                  2.13ms
         avg:                                  2.27ms
         max:                                 12.10ms
         approx.  95 percentile:               2.88ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   22.6598/0.00

[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 4.779 MB perf.data.guest (52461 samples) ]
$

Perf has recorded the data in perf.data.guest in the current directory. Now to view VM_EXITs,

$ sudo perf kvm stat report --event=vmexit


Analyze events for all VMs, all VCPUs:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

           MSR_WRITE       9167    35.40%     0.04%      0.45us   9554.94us      3.00us ( +-  41.94% )
  EXTERNAL_INTERRUPT       5877    22.69%     0.02%      0.37us   1175.48us      2.43us ( +-  17.90% )
    PREEMPTION_TIMER       5728    22.12%     0.01%      0.51us     21.14us      0.62us ( +-   0.87% )
                 HLT       2232     8.62%    99.92%      0.56us 1001118.99us  30567.94us ( +-   9.88% )
               CPUID       2160     8.34%     0.00%      0.40us     12.82us      0.65us ( +-   1.29% )
   PAUSE_INSTRUCTION        390     1.51%     0.00%      0.38us   1490.19us      8.27us ( +-  62.22% )
       EPT_MISCONFIG        303     1.17%     0.01%      1.04us    167.13us     13.33us ( +-   8.61% )
         EOI_INDUCED         37     0.14%     0.00%      0.62us      3.00us      1.24us ( +-   6.58% )
       EXCEPTION_NMI          4     0.02%     0.00%      0.42us      0.56us      0.47us ( +-   6.81% )

Total Samples:25898, Total events handled time:68281638.61us.

$

kexec - A travel to the purgatory

This is one of the unforgettable experience in my engineering life. Last year when we brought-up a ARM64 based platform, we faced lots of hurdles. But this one was very interesting and had lots of surprises. Though this triaging effort tolled multiple frustrating days, I had very good learning. I had to understand kexec_load system call, kernel reboot path and kernel early boot path to root cause the issue. We were using 4.14.19 kernel for the bring-up. But I'll give code samples from 5.4.14 as it is latest. There is no much code difference.

The boot flow of our new platform was uboot --> service Linux --> main Linux. We had service Linux to offload major hardware initialization work from the bootloader. Also to keep that code platform agnostic [same service Linux ran on x86 platforms too].

To make the jump from service Linux to main Linux we used kexec. The problem here was it took almost 2 minutes for the main Linux to start booting. This was a significant difference while comparing with x86 platforms. And this kind of delay would definitely annoy our customers.

After hundreds of rebuilds and thousands of print statements, I found the cause and fixed the issue. Now that 2 minutes delay was reduced to 4 seconds. Let me just tell you the code flow rather dump you with my debugging methods. I kept deferring from writing this article for almost an year. Because I was afraid of the code references and connecting them to show the flow.

Code reference

package version repo
kexec-tools today's latest master 2c9f26ed20a791a7df0182ba82e93abb52f5a615 https://github.com/horms/kexec-tools
linux 5.4.14 https://elixir.bootlin.com/linux/v5.4.14/source
NOTE: I copied code of entire functions for reference. But explained only necessary lines for better understanding.

User story

Login to the service Linux. Load main Linux's kernel and initrd. The current device-tree will be taken for main Linux too. Boot to main Linux.

# kexec -l /main/vmlinuz --initrd=/main/initrd.img
# kexec -e

As I mentioned earlier there was 2 mins delay between last Bye from service Linux and first message from main Linux. So I started looking for time consuming operations between those two prints.

kexec-tools kexec -e

kexec -e calls reboot() system call with LINUX_REBOOT_CMD_KEXEC as argument.

kexec/kexec.c my_exec +900:
---
reboot(LINUX_REBOOT_CMD_KEXEC);

Kernel reboot path

reboot system call

Reboot system call calls kernel_kexec() for the argument LINUX_REBOOT_CMD_KEXEC. You can refer to my earlier article Anatomy of Linux system call in ARM64 to understand how arguments are passed from user space to kernel space.

kernel/reboot.c SYSCALL_DEFINE4(reboot, ...) +380:
---
#ifdef CONFIG_KEXEC_CORE
    case LINUX_REBOOT_CMD_KEXEC:
        ret = kernel_kexec();
        break;
#endif

kernel_kexec()

kernel_kexec() at kernel/kexec_core.c +1119 does following things.

kernel/kexec_core.c +1119
---
/*
 * Move into place and start executing a preloaded standalone
 * executable.  If nothing was preloaded return an error.
 */
int kernel_kexec(void)
{
    int error = 0;

    if (!mutex_trylock(&kexec_mutex))
        return -EBUSY;
    if (!kexec_image) {
        error = -EINVAL;
        goto Unlock;
    }

#ifdef CONFIG_KEXEC_JUMP
    if (kexec_image->preserve_context) {
        lock_system_sleep();
        pm_prepare_console();
        error = freeze_processes();
        if (error) {
            error = -EBUSY;
            goto Restore_console;
        }
        suspend_console();
        error = dpm_suspend_start(PMSG_FREEZE);
        if (error)
            goto Resume_console;
        /* At this point, dpm_suspend_start() has been called,
         * but *not* dpm_suspend_end(). We *must* call
         * dpm_suspend_end() now.  Otherwise, drivers for
         * some devices (e.g. interrupt controllers) become
         * desynchronized with the actual state of the
         * hardware at resume time, and evil weirdness ensues.
         */
        error = dpm_suspend_end(PMSG_FREEZE);
        if (error)
            goto Resume_devices;
        error = suspend_disable_secondary_cpus();
        if (error)
            goto Enable_cpus;
        local_irq_disable();
        error = syscore_suspend();
        if (error)
            goto Enable_irqs;
    } else
#endif
    {
        kexec_in_progress = true;
        kernel_restart_prepare(NULL);
        migrate_to_reboot_cpu();

        /*
         * migrate_to_reboot_cpu() disables CPU hotplug assuming that
         * no further code needs to use CPU hotplug (which is true in
         * the reboot case). However, the kexec path depends on using
         * CPU hotplug again; so re-enable it here.
         */
        cpu_hotplug_enable();
        pr_emerg("Starting new kernel\n");
        machine_shutdown();
    }

    machine_kexec(kexec_image);

#ifdef CONFIG_KEXEC_JUMP
    if (kexec_image->preserve_context) {
        syscore_resume();
 Enable_irqs:
        local_irq_enable();
 Enable_cpus:
        suspend_enable_secondary_cpus();
        dpm_resume_start(PMSG_RESTORE);
 Resume_devices:
        dpm_resume_end(PMSG_RESTORE);
 Resume_console:
        resume_console();
        thaw_processes();
 Restore_console:
        pm_restore_console();
        unlock_system_sleep();
    }
#endif

 Unlock:
    mutex_unlock(&kexec_mutex);
    return error;
}

line code explanation
1123 mutex_trylock(&kexec_mutex) This lock is held to avoid multiple entrance into kexec
1131 if (kexec_image->preserve_context) { Something related to keep the device status not disturbed during kexec. We were not using it. Moving to else part.
1165 migrate_to_reboot_cpu() Continue execution from logical CPU 0. Only one control path is valid during reboot and startup. That will be executed from CPU-0.
1175 machine_shutdown() Don't get confused by the name of this function. Its just a wrapper around disable_nonboot_cpus() in arm64. It does nothing but disables other CPUs
1178 machine_kexec(kexec_image) Execution continues passing kexec_image as argument. This call should not return.

machine_kexec()

arch/arm64/kernel/machine_kexec.c +144
---
/**
 * machine_kexec - Do the kexec reboot.
 *
 * Called from the core kexec code for a sys_reboot with LINUX_REBOOT_CMD_KEXEC.
 */
void machine_kexec(struct kimage *kimage)
{
    phys_addr_t reboot_code_buffer_phys;
    void *reboot_code_buffer;
    bool in_kexec_crash = (kimage == kexec_crash_image);
    bool stuck_cpus = cpus_are_stuck_in_kernel();

    /*
     * New cpus may have become stuck_in_kernel after we loaded the image.
     */
    BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1)));
    WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
        "Some CPUs may be stale, kdump will be unreliable.\n");

    reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
    reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);

    kexec_image_info(kimage);

    pr_debug("%s:%d: control_code_page:        %p\n", __func__, __LINE__,
        kimage->control_code_page);
    pr_debug("%s:%d: reboot_code_buffer_phys:  %pa\n", __func__, __LINE__,
        &reboot_code_buffer_phys);
    pr_debug("%s:%d: reboot_code_buffer:       %p\n", __func__, __LINE__,
        reboot_code_buffer);
    pr_debug("%s:%d: relocate_new_kernel:      %p\n", __func__, __LINE__,
        arm64_relocate_new_kernel);
    pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
        __func__, __LINE__, arm64_relocate_new_kernel_size,
        arm64_relocate_new_kernel_size);

    /*
     * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
     * after the kernel is shut down.
     */
    memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
        arm64_relocate_new_kernel_size);

    /* Flush the reboot_code_buffer in preparation for its execution. */
    __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);

    /*
     * Although we've killed off the secondary CPUs, we don't update
     * the online mask if we're handling a crash kernel and consequently
     * need to avoid flush_icache_range(), which will attempt to IPI
     * the offline CPUs. Therefore, we must use the __* variant here.
     */
    __flush_icache_range((uintptr_t)reboot_code_buffer,
                 arm64_relocate_new_kernel_size);

    /* Flush the kimage list and its buffers. */
    kexec_list_flush(kimage);

    /* Flush the new image if already in place. */
    if ((kimage != kexec_crash_image) && (kimage->head & IND_DONE))
        kexec_segment_flush(kimage);

    pr_info("Bye!\n");

    local_daif_mask();

    /*
     * cpu_soft_restart will shutdown the MMU, disable data caches, then
     * transfer control to the reboot_code_buffer which contains a copy of
     * the arm64_relocate_new_kernel routine.  arm64_relocate_new_kernel
     * uses physical addressing to relocate the new image to its final
     * position and transfers control to the image entry point when the
     * relocation is complete.
     * In kexec case, kimage->start points to purgatory assuming that
     * kernel entry and dtb address are embedded in purgatory by
     * userspace (kexec-tools).
     * In kexec_file case, the kernel starts directly without purgatory.
     */
    cpu_soft_restart(reboot_code_buffer_phys, kimage->head, kimage->start,
#ifdef CONFIG_KEXEC_FILE
                        kimage->arch.dtb_mem);
#else
                        0);
#endif

    BUG(); /* Should never get here. */
}
line code explanation
148 bool in_kexec_crash = (kimage == kexec_crash_image) Not true in our case. kimage is kexec_image
149 bool stuck_cpus = cpus_are_stuck_in_kernel(); Get number of stuck CPUs. Ideally it should be 0
154 BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1))) In non-crash situations, no CPU should be stuck and no CPU other than reboot CPU should be online.
158 reboot_code_buffer_phys = page_to_phys(kimage->control_code_page) Get the physical page address from kimage->control_code_page. arm64_relocate_new_kernel function will be copied to this special location.
159 reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys) Store the virtual address of same to continue working on that area.
179 memcpy(reboot_code_buffer, arm64_relocate_new_kernel, arm64_relocate_new_kernel_size) Copies arm64_relocate_new_kernel_size routines address to the jump location. It is implemented in assembly. This is the routine that copies new kernel to its correct place. To make sure the memory where this routine resides doesn't get overwritten during the copy, it is copied inside kexec_image and executed from there. Never expected right? me too.
183 - 199 __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
__flush_icache_range((uintptr_t)reboot_code_buffer, arm64_relocate_new_kernel_size);
kexec_list_flush(kimage);if ((kimage != kexec_crash_image) && (kimage->head & IND_DONE)) kexec_segment_flush(kimage);
Flush necessary memory areas
201 pr_info("Bye!\n") The last print message from current kernel
203 local_daif_mask() Disable all exceptions including interrupts. As we are entering into reboot path, don't expect any normal operations.
217 cpu_soft_restart(reboot_code_buffer_phys, kimage->head, kimage->start,0) This call won't return. CONFIG_KEXEC_FILE is not necessary in our case. The comment block above this call explains about purgatory code. But unfortunately that was not written when I was actually debugging this issue.

cpu_soft_restart()

Its just a wrapper around __cpu_soft_restart which is an assembly routine. el2_switch would be set to 0 in our case.

arch/arm64/kernel/cpu-reset.h +16
---
void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
    unsigned long arg0, unsigned long arg1, unsigned long arg2);

static inline void __noreturn cpu_soft_restart(unsigned long entry,
                           unsigned long arg0,
                           unsigned long arg1,
                           unsigned long arg2)
{
    typeof(__cpu_soft_restart) *restart;

    unsigned long el2_switch = !is_kernel_in_hyp_mode() &&
        is_hyp_mode_available();
    restart = (void *)__pa_symbol(__cpu_soft_restart);

    cpu_install_idmap();
    restart(el2_switch, entry, arg0, arg1, arg2);
    unreachable();
}

An important call to note here is cpu_install_idmap(). This comes handy when MMU comes in or goes out. Linux kernel has a text section called idmap.text. cpu_install_idmap() will copy this section to two memory areas. There will be virtual memory mapping for one of these areas. The second area will be literal physical memory location of the virtual address. For example code in this section will be loaded at physical address 0x0 and 0x80000000. And the virtual address corresponding to 0x80000000 will be 0x0. It ensures continuous execution after MMU goes off. I'll write a separate article explaining in detail about this.

__cpu_soft_restart()

arch/arm64/kernel/cpu-reset.S +15
---
.text
.pushsection    .idmap.text, "awx"

/*
 * __cpu_soft_restart(el2_switch, entry, arg0, arg1, arg2) - Helper for
 * cpu_soft_restart.
 *
 * @el2_switch: Flag to indicate a switch to EL2 is needed.
 * @entry: Location to jump to for soft reset.
 * arg0: First argument passed to @entry. (relocation list)
 * arg1: Second argument passed to @entry.(physical kernel entry)
 * arg2: Third argument passed to @entry. (physical dtb address)
 *
 * Put the CPU into the same state as it would be if it had been reset, and
 * branch to what would be the reset vector. It must be executed with the
 * flat identity mapping.
 */
ENTRY(__cpu_soft_restart)
    /* Clear sctlr_el1 flags. */
    mrs x12, sctlr_el1
    ldr x13, =SCTLR_ELx_FLAGS
    bic x12, x12, x13
    pre_disable_mmu_workaround
    msr sctlr_el1, x12
    isb

    cbz x0, 1f              // el2_switch?
    mov x0, #HVC_SOFT_RESTART
    hvc #0              // no return

1:  mov x18, x1             // entry
    mov x0, x2              // arg0
    mov x1, x3              // arg1
    mov x2, x4              // arg2
    br  x18
ENDPROC(__cpu_soft_restart)

.popsection
line instruction explanation
16 .pushsection .idmap.text, "awx" As I mentioned earlier, this routine has to be pushed into idmap.text section. Because it is going to disable MMU.
34-38 mrs x12, sctlr_el1
ldr x13, =SCTLR_ELx_FLAGS
bic x12, x12, x13
pre_disable_mmu_workaround
msr sctlr_el1, x12
Disable I,D cache and MMU
45-49 mov x18, x1
mov x0, x2
mov x1, x3
mov x2, x4
br x18
Set arguments and jump to arm64_relocate_new_kernel routine. In arm64 registers x0-x6 are corresponding to first 7 arguments

arm64_relocate_new_kernel()

The assembly routine can be found at arch/arm64/kernel/relocate_kernel.S +29. It does nothing significant. It sets dtb address at x0 and jumps to kexec_image->entry which is [supposed to be] pointing to the new kernel.


kexec load kernel path

As I expected if kexec_image->entry pointed to new kernel, I shouldn't had seen that delay. So to verify that I went through kexec_load() system call's path.

kexec_load()

kernel/kexec.c +232
---
SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
        struct kexec_segment __user *, segments, unsigned long, flags)
{
    int result;

    result = kexec_load_check(nr_segments, flags);
    if (result)
        return result;

    /* Verify we are on the appropriate architecture */
    if (((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH) &&
        ((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH_DEFAULT))
        return -EINVAL;

    /* Because we write directly to the reserved memory
     * region when loading crash kernels we need a mutex here to
     * prevent multiple crash  kernels from attempting to load
     * simultaneously, and to prevent a crash kernel from loading
     * over the top of a in use crash kernel.
     *
     * KISS: always take the mutex.
     */
    if (!mutex_trylock(&kexec_mutex))
        return -EBUSY;

    result = do_kexec_load(entry, nr_segments, segments, flags);

    mutex_unlock(&kexec_mutex);

    return result;
}
It first checks permission with kexec_load_check function. Then it takes the kexec_mutex and goes into do_kexec_load()

do_kexec_load()

kernel/kexec.c +106
---
static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
        struct kexec_segment __user *segments, unsigned long flags)
{
    struct kimage **dest_image, *image;
    unsigned long i;
    int ret;

    if (flags & KEXEC_ON_CRASH) {
        dest_image = &kexec_crash_image;
        if (kexec_crash_image)
            arch_kexec_unprotect_crashkres();
    } else {
        dest_image = &kexec_image;
    }

    if (nr_segments == 0) {
        /* Uninstall image */
        kimage_free(xchg(dest_image, NULL));
        return 0;
    }
    if (flags & KEXEC_ON_CRASH) {
        /*
         * Loading another kernel to switch to if this one
         * crashes.  Free any current crash dump kernel before
         * we corrupt it.
         */
        kimage_free(xchg(&kexec_crash_image, NULL));
    }

    ret = kimage_alloc_init(&image, entry, nr_segments, segments, flags);
    if (ret)
        return ret;

    if (flags & KEXEC_PRESERVE_CONTEXT)
        image->preserve_context = 1;

    ret = machine_kexec_prepare(image);
    if (ret)
        goto out;

    /*
     * Some architecture(like S390) may touch the crash memory before
     * machine_kexec_prepare(), we must copy vmcoreinfo data after it.
     */
    ret = kimage_crash_copy_vmcoreinfo(image);
    if (ret)
        goto out;

    for (i = 0; i < nr_segments; i++) {
        ret = kimage_load_segment(image, &image->segment[i]);
        if (ret)
            goto out;
    }

    kimage_terminate(image);

    /* Install the new kernel and uninstall the old */
    image = xchg(dest_image, image);

out:
    if ((flags & KEXEC_ON_CRASH) && kexec_crash_image)
        arch_kexec_protect_crashkres();

    kimage_free(image);
    return ret;
}
line code explanation
113-120 10-17 Its a simple if block choosing image based on context
135 ret = kimage_alloc_init(&image, entry, nr_segments, segments, flags); Create a new image segment. The second argument is important. This jump address.
142 ret = machine_kexec_prepare(image); Check whether any other CPUs are stuck
154-158 52-57 Copies segments passed from user space to kernel space.
163 image = xchg(dest_image, image); Replaces the older image with new one


The entry argument passed to kexec_alloc_init() function is assigned to kexec_image->start.

kernel/kexec.c kexec_alloc_init +60
---
image->start = entry;
This kexec_image->start is passed to cpu_soft_restart() as the third argument. If you follow the argument flow in kernel reboot path [as explained above], arm64_relocate_new_kernel() jumps to this address. So this must be the address of new kernel.

kexec-tools

Now I went into kexec-tools source. Call to kexec_load() is as below.

 kexec/kexec.c my_load +821
---
    result = kexec_load(info.entry,
                info.nr_segments, info.segment,
                info.kexec_flags);
info.entry is passed as the first argument. As we've seen in previous section, this the jump address. Lets see what is there in info.entry. The code flow goes like this

kexec/kexec.c main +1551
---
    result = my_load(type, fileind, argc, argv,
                kexec_flags, skip_checks, entry);
kexec/kexec.c my_load +772
---
    result = file_type[i].load(argc, argv, kernel_buf, kernel_size, &info);
kexec/arch/arm/kexec-arm.c +82
---
struct file_type file_type[] = {
    /* uImage is probed before zImage because the latter also accepts
       uncompressed images. */
    {"uImage", uImage_arm_probe, uImage_arm_load, zImage_arm_usage},
    {"zImage", zImage_arm_probe, zImage_arm_load, zImage_arm_usage},
};
kexec/arch/arm64/kexec-zImage-arm64.c zImage_arm64_load +212
---
    result = arm64_load_other_segments(info, kernel_segment
        + arm64_mem.text_offset);
kexec/arch/arm64/kexec-arm64.c arm64_load_other_segments +743
---
    info->entry = (void *)elf_rel_get_addr(&info->rhdr, "purgatory_start");

Address of a symbol called purgatory_start is assigned to info->entry. So old kernel makes a jump to this purgatory_start not to the new kernel. This was the big surprise. The last thing I expected was kexec-tools inserts a piece of code between kernels. I felt that I came close to the criminal.

The purgatory

purgatory/arch/arm64/entry.S +9
---
.text

.globl purgatory_start
purgatory_start:

    adr x19, .Lstack
    mov sp, x19

    bl  purgatory

    /* Start new image. */
    ldr x17, arm64_kernel_entry
    ldr x0, arm64_dtb_addr
    mov x1, xzr
    mov x2, xzr
    mov x3, xzr
    br  x17

size purgatory_start
purgatory_start is an assembly subroutine. It calls a function purgatory first. And then stores arm64_dtb_addr in register x0 and jumps to arm64_kernel_entry. As arm64 kernel expects dtb address to be set in register x0, this jump is likely to be the jump to new kernel. We'll see where these symbols are set.

kexec/arch/arm64/kexec-arm64.c arm64_load_other_segments +
---
    elf_rel_build_load(info, &info->rhdr, purgatory, purgatory_size,
        hole_min, hole_max, 1, 0);

    info->entry = (void *)elf_rel_get_addr(&info->rhdr, "purgatory_start");

    elf_rel_set_symbol(&info->rhdr, "arm64_kernel_entry", &image_base,
        sizeof(image_base));

    elf_rel_set_symbol(&info->rhdr, "arm64_dtb_addr", &dtb_base,
        sizeof(dtb_base));
As you see in the code snippet, image_base is arm64_kernel_entry and dtb_base is arm64_dtb_addr. If you go little above in the function, you'll see image_base and dtb_base are kernel address and device-tree address respectively. And the function purgatory is also loaded into the segments. The last function to check between two kernels is purgatory.

purgatory/purgatory.c
---
void purgatory(void)
{
    printf("I'm in purgatory\n");
    setup_arch();
    if (!skip_checks && verify_sha256_digest()) {
        for(;;) {
            /* loop forever */
        }
    }
    post_verification_setup_arch();
}
setup_arch() and post_verification_setup_arch() are no-op for arm64. The actual delay is caused by verify_sha256_digest(). There is nothing wrong with this function. But it is being executed with I and D caches off. The option skip_checks was not introduced by the time I was debugging this issue. So we solved it by enabling I & D cache in setup_arch() and disabling it back in post_verification_setup_arch().

At last I felt like my purgatory sentence was over.