Skip to content

Blog

kexec - A travel to the purgatory

This is one of the unforgettable experience in my engineering life. Last year when we brought-up a ARM64 based platform, we faced lots of hurdles. But this one was very interesting and had lots of surprises. Though this triaging effort tolled multiple frustrating days, I had very good learning. I had to understand kexec_load system call, kernel reboot path and kernel early boot path to root cause the issue. We were using 4.14.19 kernel for the bring-up. But I'll give code samples from 5.4.14 as it is latest. There is no much code difference.

The boot flow of our new platform was uboot --> service Linux --> main Linux. We had service Linux to offload major hardware initialization work from the bootloader. Also to keep that code platform agnostic [same service Linux ran on x86 platforms too].

To make the jump from service Linux to main Linux we used kexec. The problem here was it took almost 2 minutes for the main Linux to start booting. This was a significant difference while comparing with x86 platforms. And this kind of delay would definitely annoy our customers.

After hundreds of rebuilds and thousands of print statements, I found the cause and fixed the issue. Now that 2 minutes delay was reduced to 4 seconds. Let me just tell you the code flow rather dump you with my debugging methods. I kept deferring from writing this article for almost an year. Because I was afraid of the code references and connecting them to show the flow.

Code reference

package version repo
kexec-tools today's latest master 2c9f26ed20a791a7df0182ba82e93abb52f5a615 https://github.com/horms/kexec-tools
linux 5.4.14 https://elixir.bootlin.com/linux/v5.4.14/source
NOTE: I copied code of entire functions for reference. But explained only necessary lines for better understanding.

User story

Login to the service Linux. Load main Linux's kernel and initrd. The current device-tree will be taken for main Linux too. Boot to main Linux.

# kexec -l /main/vmlinuz --initrd=/main/initrd.img
# kexec -e

As I mentioned earlier there was 2 mins delay between last Bye from service Linux and first message from main Linux. So I started looking for time consuming operations between those two prints.

kexec-tools kexec -e

kexec -e calls reboot() system call with LINUX_REBOOT_CMD_KEXEC as argument.

kexec/kexec.c my_exec +900:
---
reboot(LINUX_REBOOT_CMD_KEXEC);

Kernel reboot path

reboot system call

Reboot system call calls kernel_kexec() for the argument LINUX_REBOOT_CMD_KEXEC. You can refer to my earlier article Anatomy of Linux system call in ARM64 to understand how arguments are passed from user space to kernel space.

kernel/reboot.c SYSCALL_DEFINE4(reboot, ...) +380:
---
#ifdef CONFIG_KEXEC_CORE
    case LINUX_REBOOT_CMD_KEXEC:
        ret = kernel_kexec();
        break;
#endif

kernel_kexec()

kernel_kexec() at kernel/kexec_core.c +1119 does following things.

kernel/kexec_core.c +1119
---
/*
 * Move into place and start executing a preloaded standalone
 * executable.  If nothing was preloaded return an error.
 */
int kernel_kexec(void)
{
    int error = 0;

    if (!mutex_trylock(&kexec_mutex))
        return -EBUSY;
    if (!kexec_image) {
        error = -EINVAL;
        goto Unlock;
    }

#ifdef CONFIG_KEXEC_JUMP
    if (kexec_image->preserve_context) {
        lock_system_sleep();
        pm_prepare_console();
        error = freeze_processes();
        if (error) {
            error = -EBUSY;
            goto Restore_console;
        }
        suspend_console();
        error = dpm_suspend_start(PMSG_FREEZE);
        if (error)
            goto Resume_console;
        /* At this point, dpm_suspend_start() has been called,
         * but *not* dpm_suspend_end(). We *must* call
         * dpm_suspend_end() now.  Otherwise, drivers for
         * some devices (e.g. interrupt controllers) become
         * desynchronized with the actual state of the
         * hardware at resume time, and evil weirdness ensues.
         */
        error = dpm_suspend_end(PMSG_FREEZE);
        if (error)
            goto Resume_devices;
        error = suspend_disable_secondary_cpus();
        if (error)
            goto Enable_cpus;
        local_irq_disable();
        error = syscore_suspend();
        if (error)
            goto Enable_irqs;
    } else
#endif
    {
        kexec_in_progress = true;
        kernel_restart_prepare(NULL);
        migrate_to_reboot_cpu();

        /*
         * migrate_to_reboot_cpu() disables CPU hotplug assuming that
         * no further code needs to use CPU hotplug (which is true in
         * the reboot case). However, the kexec path depends on using
         * CPU hotplug again; so re-enable it here.
         */
        cpu_hotplug_enable();
        pr_emerg("Starting new kernel\n");
        machine_shutdown();
    }

    machine_kexec(kexec_image);

#ifdef CONFIG_KEXEC_JUMP
    if (kexec_image->preserve_context) {
        syscore_resume();
 Enable_irqs:
        local_irq_enable();
 Enable_cpus:
        suspend_enable_secondary_cpus();
        dpm_resume_start(PMSG_RESTORE);
 Resume_devices:
        dpm_resume_end(PMSG_RESTORE);
 Resume_console:
        resume_console();
        thaw_processes();
 Restore_console:
        pm_restore_console();
        unlock_system_sleep();
    }
#endif

 Unlock:
    mutex_unlock(&kexec_mutex);
    return error;
}

line code explanation
1123 mutex_trylock(&kexec_mutex) This lock is held to avoid multiple entrance into kexec
1131 if (kexec_image->preserve_context) { Something related to keep the device status not disturbed during kexec. We were not using it. Moving to else part.
1165 migrate_to_reboot_cpu() Continue execution from logical CPU 0. Only one control path is valid during reboot and startup. That will be executed from CPU-0.
1175 machine_shutdown() Don't get confused by the name of this function. Its just a wrapper around disable_nonboot_cpus() in arm64. It does nothing but disables other CPUs
1178 machine_kexec(kexec_image) Execution continues passing kexec_image as argument. This call should not return.

machine_kexec()

arch/arm64/kernel/machine_kexec.c +144
---
/**
 * machine_kexec - Do the kexec reboot.
 *
 * Called from the core kexec code for a sys_reboot with LINUX_REBOOT_CMD_KEXEC.
 */
void machine_kexec(struct kimage *kimage)
{
    phys_addr_t reboot_code_buffer_phys;
    void *reboot_code_buffer;
    bool in_kexec_crash = (kimage == kexec_crash_image);
    bool stuck_cpus = cpus_are_stuck_in_kernel();

    /*
     * New cpus may have become stuck_in_kernel after we loaded the image.
     */
    BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1)));
    WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
        "Some CPUs may be stale, kdump will be unreliable.\n");

    reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
    reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);

    kexec_image_info(kimage);

    pr_debug("%s:%d: control_code_page:        %p\n", __func__, __LINE__,
        kimage->control_code_page);
    pr_debug("%s:%d: reboot_code_buffer_phys:  %pa\n", __func__, __LINE__,
        &reboot_code_buffer_phys);
    pr_debug("%s:%d: reboot_code_buffer:       %p\n", __func__, __LINE__,
        reboot_code_buffer);
    pr_debug("%s:%d: relocate_new_kernel:      %p\n", __func__, __LINE__,
        arm64_relocate_new_kernel);
    pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
        __func__, __LINE__, arm64_relocate_new_kernel_size,
        arm64_relocate_new_kernel_size);

    /*
     * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
     * after the kernel is shut down.
     */
    memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
        arm64_relocate_new_kernel_size);

    /* Flush the reboot_code_buffer in preparation for its execution. */
    __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);

    /*
     * Although we've killed off the secondary CPUs, we don't update
     * the online mask if we're handling a crash kernel and consequently
     * need to avoid flush_icache_range(), which will attempt to IPI
     * the offline CPUs. Therefore, we must use the __* variant here.
     */
    __flush_icache_range((uintptr_t)reboot_code_buffer,
                 arm64_relocate_new_kernel_size);

    /* Flush the kimage list and its buffers. */
    kexec_list_flush(kimage);

    /* Flush the new image if already in place. */
    if ((kimage != kexec_crash_image) && (kimage->head & IND_DONE))
        kexec_segment_flush(kimage);

    pr_info("Bye!\n");

    local_daif_mask();

    /*
     * cpu_soft_restart will shutdown the MMU, disable data caches, then
     * transfer control to the reboot_code_buffer which contains a copy of
     * the arm64_relocate_new_kernel routine.  arm64_relocate_new_kernel
     * uses physical addressing to relocate the new image to its final
     * position and transfers control to the image entry point when the
     * relocation is complete.
     * In kexec case, kimage->start points to purgatory assuming that
     * kernel entry and dtb address are embedded in purgatory by
     * userspace (kexec-tools).
     * In kexec_file case, the kernel starts directly without purgatory.
     */
    cpu_soft_restart(reboot_code_buffer_phys, kimage->head, kimage->start,
#ifdef CONFIG_KEXEC_FILE
                        kimage->arch.dtb_mem);
#else
                        0);
#endif

    BUG(); /* Should never get here. */
}
line code explanation
148 bool in_kexec_crash = (kimage == kexec_crash_image) Not true in our case. kimage is kexec_image
149 bool stuck_cpus = cpus_are_stuck_in_kernel(); Get number of stuck CPUs. Ideally it should be 0
154 BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1))) In non-crash situations, no CPU should be stuck and no CPU other than reboot CPU should be online.
158 reboot_code_buffer_phys = page_to_phys(kimage->control_code_page) Get the physical page address from kimage->control_code_page. arm64_relocate_new_kernel function will be copied to this special location.
159 reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys) Store the virtual address of same to continue working on that area.
179 memcpy(reboot_code_buffer, arm64_relocate_new_kernel, arm64_relocate_new_kernel_size) Copies arm64_relocate_new_kernel_size routines address to the jump location. It is implemented in assembly. This is the routine that copies new kernel to its correct place. To make sure the memory where this routine resides doesn't get overwritten during the copy, it is copied inside kexec_image and executed from there. Never expected right? me too.
183 - 199 __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
__flush_icache_range((uintptr_t)reboot_code_buffer, arm64_relocate_new_kernel_size);
kexec_list_flush(kimage);if ((kimage != kexec_crash_image) && (kimage->head & IND_DONE)) kexec_segment_flush(kimage);
Flush necessary memory areas
201 pr_info("Bye!\n") The last print message from current kernel
203 local_daif_mask() Disable all exceptions including interrupts. As we are entering into reboot path, don't expect any normal operations.
217 cpu_soft_restart(reboot_code_buffer_phys, kimage->head, kimage->start,0) This call won't return. CONFIG_KEXEC_FILE is not necessary in our case. The comment block above this call explains about purgatory code. But unfortunately that was not written when I was actually debugging this issue.

cpu_soft_restart()

Its just a wrapper around __cpu_soft_restart which is an assembly routine. el2_switch would be set to 0 in our case.

arch/arm64/kernel/cpu-reset.h +16
---
void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
    unsigned long arg0, unsigned long arg1, unsigned long arg2);

static inline void __noreturn cpu_soft_restart(unsigned long entry,
                           unsigned long arg0,
                           unsigned long arg1,
                           unsigned long arg2)
{
    typeof(__cpu_soft_restart) *restart;

    unsigned long el2_switch = !is_kernel_in_hyp_mode() &&
        is_hyp_mode_available();
    restart = (void *)__pa_symbol(__cpu_soft_restart);

    cpu_install_idmap();
    restart(el2_switch, entry, arg0, arg1, arg2);
    unreachable();
}

An important call to note here is cpu_install_idmap(). This comes handy when MMU comes in or goes out. Linux kernel has a text section called idmap.text. cpu_install_idmap() will copy this section to two memory areas. There will be virtual memory mapping for one of these areas. The second area will be literal physical memory location of the virtual address. For example code in this section will be loaded at physical address 0x0 and 0x80000000. And the virtual address corresponding to 0x80000000 will be 0x0. It ensures continuous execution after MMU goes off. I'll write a separate article explaining in detail about this.

__cpu_soft_restart()

arch/arm64/kernel/cpu-reset.S +15
---
.text
.pushsection    .idmap.text, "awx"

/*
 * __cpu_soft_restart(el2_switch, entry, arg0, arg1, arg2) - Helper for
 * cpu_soft_restart.
 *
 * @el2_switch: Flag to indicate a switch to EL2 is needed.
 * @entry: Location to jump to for soft reset.
 * arg0: First argument passed to @entry. (relocation list)
 * arg1: Second argument passed to @entry.(physical kernel entry)
 * arg2: Third argument passed to @entry. (physical dtb address)
 *
 * Put the CPU into the same state as it would be if it had been reset, and
 * branch to what would be the reset vector. It must be executed with the
 * flat identity mapping.
 */
ENTRY(__cpu_soft_restart)
    /* Clear sctlr_el1 flags. */
    mrs x12, sctlr_el1
    ldr x13, =SCTLR_ELx_FLAGS
    bic x12, x12, x13
    pre_disable_mmu_workaround
    msr sctlr_el1, x12
    isb

    cbz x0, 1f              // el2_switch?
    mov x0, #HVC_SOFT_RESTART
    hvc #0              // no return

1:  mov x18, x1             // entry
    mov x0, x2              // arg0
    mov x1, x3              // arg1
    mov x2, x4              // arg2
    br  x18
ENDPROC(__cpu_soft_restart)

.popsection
line instruction explanation
16 .pushsection .idmap.text, "awx" As I mentioned earlier, this routine has to be pushed into idmap.text section. Because it is going to disable MMU.
34-38 mrs x12, sctlr_el1
ldr x13, =SCTLR_ELx_FLAGS
bic x12, x12, x13
pre_disable_mmu_workaround
msr sctlr_el1, x12
Disable I,D cache and MMU
45-49 mov x18, x1
mov x0, x2
mov x1, x3
mov x2, x4
br x18
Set arguments and jump to arm64_relocate_new_kernel routine. In arm64 registers x0-x6 are corresponding to first 7 arguments

arm64_relocate_new_kernel()

The assembly routine can be found at arch/arm64/kernel/relocate_kernel.S +29. It does nothing significant. It sets dtb address at x0 and jumps to kexec_image->entry which is [supposed to be] pointing to the new kernel.


kexec load kernel path

As I expected if kexec_image->entry pointed to new kernel, I shouldn't had seen that delay. So to verify that I went through kexec_load() system call's path.

kexec_load()

kernel/kexec.c +232
---
SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
        struct kexec_segment __user *, segments, unsigned long, flags)
{
    int result;

    result = kexec_load_check(nr_segments, flags);
    if (result)
        return result;

    /* Verify we are on the appropriate architecture */
    if (((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH) &&
        ((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH_DEFAULT))
        return -EINVAL;

    /* Because we write directly to the reserved memory
     * region when loading crash kernels we need a mutex here to
     * prevent multiple crash  kernels from attempting to load
     * simultaneously, and to prevent a crash kernel from loading
     * over the top of a in use crash kernel.
     *
     * KISS: always take the mutex.
     */
    if (!mutex_trylock(&kexec_mutex))
        return -EBUSY;

    result = do_kexec_load(entry, nr_segments, segments, flags);

    mutex_unlock(&kexec_mutex);

    return result;
}
It first checks permission with kexec_load_check function. Then it takes the kexec_mutex and goes into do_kexec_load()

do_kexec_load()

kernel/kexec.c +106
---
static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
        struct kexec_segment __user *segments, unsigned long flags)
{
    struct kimage **dest_image, *image;
    unsigned long i;
    int ret;

    if (flags & KEXEC_ON_CRASH) {
        dest_image = &kexec_crash_image;
        if (kexec_crash_image)
            arch_kexec_unprotect_crashkres();
    } else {
        dest_image = &kexec_image;
    }

    if (nr_segments == 0) {
        /* Uninstall image */
        kimage_free(xchg(dest_image, NULL));
        return 0;
    }
    if (flags & KEXEC_ON_CRASH) {
        /*
         * Loading another kernel to switch to if this one
         * crashes.  Free any current crash dump kernel before
         * we corrupt it.
         */
        kimage_free(xchg(&kexec_crash_image, NULL));
    }

    ret = kimage_alloc_init(&image, entry, nr_segments, segments, flags);
    if (ret)
        return ret;

    if (flags & KEXEC_PRESERVE_CONTEXT)
        image->preserve_context = 1;

    ret = machine_kexec_prepare(image);
    if (ret)
        goto out;

    /*
     * Some architecture(like S390) may touch the crash memory before
     * machine_kexec_prepare(), we must copy vmcoreinfo data after it.
     */
    ret = kimage_crash_copy_vmcoreinfo(image);
    if (ret)
        goto out;

    for (i = 0; i < nr_segments; i++) {
        ret = kimage_load_segment(image, &image->segment[i]);
        if (ret)
            goto out;
    }

    kimage_terminate(image);

    /* Install the new kernel and uninstall the old */
    image = xchg(dest_image, image);

out:
    if ((flags & KEXEC_ON_CRASH) && kexec_crash_image)
        arch_kexec_protect_crashkres();

    kimage_free(image);
    return ret;
}
line code explanation
113-120 10-17 Its a simple if block choosing image based on context
135 ret = kimage_alloc_init(&image, entry, nr_segments, segments, flags); Create a new image segment. The second argument is important. This jump address.
142 ret = machine_kexec_prepare(image); Check whether any other CPUs are stuck
154-158 52-57 Copies segments passed from user space to kernel space.
163 image = xchg(dest_image, image); Replaces the older image with new one


The entry argument passed to kexec_alloc_init() function is assigned to kexec_image->start.

kernel/kexec.c kexec_alloc_init +60
---
image->start = entry;
This kexec_image->start is passed to cpu_soft_restart() as the third argument. If you follow the argument flow in kernel reboot path [as explained above], arm64_relocate_new_kernel() jumps to this address. So this must be the address of new kernel.

kexec-tools

Now I went into kexec-tools source. Call to kexec_load() is as below.

 kexec/kexec.c my_load +821
---
    result = kexec_load(info.entry,
                info.nr_segments, info.segment,
                info.kexec_flags);
info.entry is passed as the first argument. As we've seen in previous section, this the jump address. Lets see what is there in info.entry. The code flow goes like this

kexec/kexec.c main +1551
---
    result = my_load(type, fileind, argc, argv,
                kexec_flags, skip_checks, entry);
kexec/kexec.c my_load +772
---
    result = file_type[i].load(argc, argv, kernel_buf, kernel_size, &info);
kexec/arch/arm/kexec-arm.c +82
---
struct file_type file_type[] = {
    /* uImage is probed before zImage because the latter also accepts
       uncompressed images. */
    {"uImage", uImage_arm_probe, uImage_arm_load, zImage_arm_usage},
    {"zImage", zImage_arm_probe, zImage_arm_load, zImage_arm_usage},
};
kexec/arch/arm64/kexec-zImage-arm64.c zImage_arm64_load +212
---
    result = arm64_load_other_segments(info, kernel_segment
        + arm64_mem.text_offset);
kexec/arch/arm64/kexec-arm64.c arm64_load_other_segments +743
---
    info->entry = (void *)elf_rel_get_addr(&info->rhdr, "purgatory_start");

Address of a symbol called purgatory_start is assigned to info->entry. So old kernel makes a jump to this purgatory_start not to the new kernel. This was the big surprise. The last thing I expected was kexec-tools inserts a piece of code between kernels. I felt that I came close to the criminal.

The purgatory

purgatory/arch/arm64/entry.S +9
---
.text

.globl purgatory_start
purgatory_start:

    adr x19, .Lstack
    mov sp, x19

    bl  purgatory

    /* Start new image. */
    ldr x17, arm64_kernel_entry
    ldr x0, arm64_dtb_addr
    mov x1, xzr
    mov x2, xzr
    mov x3, xzr
    br  x17

size purgatory_start
purgatory_start is an assembly subroutine. It calls a function purgatory first. And then stores arm64_dtb_addr in register x0 and jumps to arm64_kernel_entry. As arm64 kernel expects dtb address to be set in register x0, this jump is likely to be the jump to new kernel. We'll see where these symbols are set.

kexec/arch/arm64/kexec-arm64.c arm64_load_other_segments +
---
    elf_rel_build_load(info, &info->rhdr, purgatory, purgatory_size,
        hole_min, hole_max, 1, 0);

    info->entry = (void *)elf_rel_get_addr(&info->rhdr, "purgatory_start");

    elf_rel_set_symbol(&info->rhdr, "arm64_kernel_entry", &image_base,
        sizeof(image_base));

    elf_rel_set_symbol(&info->rhdr, "arm64_dtb_addr", &dtb_base,
        sizeof(dtb_base));
As you see in the code snippet, image_base is arm64_kernel_entry and dtb_base is arm64_dtb_addr. If you go little above in the function, you'll see image_base and dtb_base are kernel address and device-tree address respectively. And the function purgatory is also loaded into the segments. The last function to check between two kernels is purgatory.

purgatory/purgatory.c
---
void purgatory(void)
{
    printf("I'm in purgatory\n");
    setup_arch();
    if (!skip_checks && verify_sha256_digest()) {
        for(;;) {
            /* loop forever */
        }
    }
    post_verification_setup_arch();
}
setup_arch() and post_verification_setup_arch() are no-op for arm64. The actual delay is caused by verify_sha256_digest(). There is nothing wrong with this function. But it is being executed with I and D caches off. The option skip_checks was not introduced by the time I was debugging this issue. So we solved it by enabling I & D cache in setup_arch() and disabling it back in post_verification_setup_arch().

At last I felt like my purgatory sentence was over.

Variadic functions with unknown argument count

One of my colleagues came across a peculiar problem. She had to write an API that accepts variable number of arguments, but number of arguments won't be passed in the arguments list. She cracked it intelligently with following hack.

The Hack

Heart of this hack is a macro that can count the number of arguments passed to it. It has a limitation. Maximum number of arguments can be passed to this macro should be known. For example, if maximum number of arguments can be passed is 5, the macro will look like,

#define COUNT5(...) _COUNT5(__VA_ARGS__, 5, 4, 3, 2, 1)
#define _COUNT5(a, b, c, d, e, count, ...) count

If you want your macro to count 10 or lesser arguments,

#define COUNT10(...) _COUNT10(__VA_ARGS__, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
#define _COUNT10(a, b, c, d, e, f, g, h, i, j, count, ...) count

Let me explain it. Consider below macro call. It will expand like this.

COUNT5(99, 98, 97);
  |
  |
  V
_COUNT5(99, 98, 97, 5, 4, 3, 2, 1)
  |
  |
  V
  3

The three arguments passed to COUNT5 will occupy a, b, c of _COUNT5. 5 and 4 will occupy d, e. Next argument 3 will be in the place of count, that will be returned.

Final solution

So she exposed a macro that accepts variable number of arguments as the API requested. This macro internally used the COUNTX macro to get number of arguments passed. And she passed the count and variable arguments to the actual C function.

Example

A small C program using this hack.

#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>

int _sum(int count, ...);

#define COUNT(...) _COUNT(__VA_ARGS__, 5, 4, 3, 2, 1)
#define _COUNT(a, b, c, d, e, count, ...) count

#define sum(...) _sum(COUNT(__VA_ARGS__), __VA_ARGS__)

int _sum(int count, ...) {
    va_list arg_ptr;
    int     sum = 0;
    int     i = 0;

    va_start(arg_ptr, count);

    for (i = 0; i < count; i++) {
        sum += va_arg(arg_ptr, int);
    }

    return sum;
}

int main() {
    printf("%d\n", sum(1, 2, 3, 4, 5));
    printf("%d\n", sum(1, 2, 3));
    printf("%d\n", sum(1));
    printf("%d\n", sum(2, 2, 2, 2, 2));

    return 0;
}

And its output.

kaba@kaba-Vostro-1550:~/variadic
$ gcc variadic.c
kaba@kaba-Vostro-1550:~/variadic
$ ./a.out
15
6
1
10
kaba@kaba-Vostro-1550:~/variadic
$

Custom build kernel for Raspberry Pi

I've already written a post about how to cross-compile mainline kernel for Raspberry Pi. In this post I'm covering how to cross-compile Raspberry Pi Linux. This will be simple and straight forward. I may write a series of posts related to kernel debugging, optimization which will be based on Raspberry Pi kernel. So this post will be starting point for them.

Directory structure,

balakumaran@balakumaran-pc:~/Desktop/RPi$ ls -lh
total 32K
drwxrwxr-x  3 balakumaran balakumaran 4.0K Mar  9 19:33 firmware
drwxr-xr-x  8 balakumaran balakumaran 4.0K Jan 23 01:52 gcc-linaro-7.4.1-2019.02-x86_64_aarch64-linux-gnu
drwxrwxr-x 22 balakumaran balakumaran 4.0K Mar 30 18:38 kernel_out
drwxrwxr-x 26 balakumaran balakumaran 4.0K Mar 30 18:13 linux-rpi-4.14.y
drwxrwxr-x 18 balakumaran balakumaran 4.0K Mar  9 19:34 rootfs
balakumaran@balakumaran-pc:~/Desktop/RPi$
Directory | Purpose | ----------------|-----------------------------------------------------------| gcc-li... | GCC cross compiler from Linaro. Extracted | firmware/boot | boot directory of Raspberry Pi firmware repo | kernel_out | Output directory for Raspberry kernel | rootfs | rootfs from Linaro. Extracted | linux-rpi... | Raspberry Pi kernel repo |

Used Ubuntu image rootfs from Linaro.

Prepare SD card

Make two partition as follows,

balakumaran@balakumaran-pc:~/Desktop/RPi$ sudo fdisk -l /dev/sdc
[sudo] password for balakumaran:
Disk /dev/sdc: 29.7 GiB, 31914983424 bytes, 62333952 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xcde890ba

Device     Boot  Start      End  Sectors  Size Id Type
/dev/sdc1  *      2048   133119   131072   64M  b W95 FAT32
/dev/sdc2       133120 62333951 62200832 29.7G 83 Linux
balakumaran@balakumaran-pc:~/Desktop/RPi$
Complete steps on how to do this is available in Appendix.

Copy necessary files

balakumaran@balakumaran-pc:/media/balakumaran$ sudo mount /dev/sdc1 /mnt/boot/
balakumaran@balakumaran-pc:/media/balakumaran$ sudo mount /dev/sdc2 /mnt/rootfs/
balakumaran@balakumaran-pc:~/Desktop$ sudo cp -rf ~/Desktop/RPi/firmware/boot/* /mnt/boot/
[sudo] password for balakumaran:
balakumaran@balakumaran-pc:~/Desktop$
balakumaran@balakumaran-pc:~/Desktop$ sudo cp -rf ~/Desktop/RPi/rootfs/* /mnt/rootfs/
balakumaran@balakumaran-pc:~/Desktop$

Build and Install kernel

Unless you are ready for the pain, use stable kernel release.

Setup following environmental variables,

balakumaran@balakumaran-pc:~/Desktop/RPi$ source ~/setup_arm64_build.sh
balakumaran@balakumaran-pc:~/Desktop/RPi$ echo $CROSS_COMPILE
aarch64-linux-gnu-
balakumaran@balakumaran-pc:~/Desktop/RPi$ echo $ARCH
arm64
balakumaran@balakumaran-pc:~/Desktop/RPi$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/balakumaran/Desktop/RPi/gcc-linaro-7.4.1-2019.02-x86_64_aarch64-linux-gnu/bin/
balakumaran@balakumaran-pc:~/Desktop/RPi$

Cross compile Kernel, Device-tree, modules.

balakumaran@balakumaran-pc:~/Desktop/RPi/linux-rpi-4.14.y$ time make ARCH=arm64 O=../kernel_out/ bcmrpi3_defconfig
make[1]: Entering directory '/home/balakumaran/Desktop/RPi/kernel_out'
.
.
.

balakumaran@balakumaran-pc:~/Desktop/RPi/linux-rpi-4.14.y$ time make -j8  ARCH=arm64 O=../kernel_out/
make[1]: Entering directory '/home/balakumaran/Desktop/RPi/kernel_out'
.
.
.

balakumaran@balakumaran-pc:~/Desktop/RPi/linux-rpi-4.14.y$ make ARCH=arm64 O=../kernel_out/ dtbs
make[1]: Entering directory '/home/balakumaran/Desktop/RPi/kernel_out'
.
.
.

balakumaran@balakumaran-pc:~/Desktop/RPi/linux-rpi-4.14.y$ sudo cp ../kernel_out/arch/arm64/boot/Image /mnt/boot/kernel8.img
[sudo] password for balakumaran:
balakumaran@balakumaran-pc:~/Desktop/RPi/linux-rpi-4.14.y$ sudo make  ARCH=arm64 O=../kernel_out/ INSTALL_PATH=/mnt/boot/ dtbs_install
make[1]: Entering directory '/home/balakumaran/Desktop/RPi/kernel_out'
arch/arm64/Makefile:27: ld does not support --fix-cortex-a53-843419; kernel may be susceptible to erratum
arch/arm64/Makefile:40: LSE atomics not supported by binutils
arch/arm64/Makefile:48: Detected assembler with broken .inst; disassembly will be unreliable
make[3]: Nothing to be done for '__dtbs_install'.
  INSTALL arch/arm64/boot/dts/al/alpine-v2-evp.dtb
.
.
.

Create cmdline.txt and config.txt.

balakumaran@balakumaran-pc:~/Desktop/RPi/linux-rpi-4.14.y$ cat /mnt/boot/cmdline.txt
dwc_otg.lpm_enable=0 console=serial0,115200 root=/dev/mmcblk0p2 rootfstype=ext4 rootwait
balakumaran@balakumaran-pc:~/Desktop/RPi/linux-rpi-4.14.y$ cat /mnt/boot/config.txt
dtoverlay=pi3-disable-bt
disable_overscan=1
dtparam=audio=on
device_tree=dtbs/4.14.98-v8+/broadcom/bcm2710-rpi-3-b.dtb
overlay_prefix=dtbs/4.14.98-v8+/overlays/
enable_uart=1
balakumaran@balakumaran-pc:~/Desktop/RPi/linux-rpi-4.14.y$

Prepare rootfs

I'm going to use ubuntu-base images with some additional modification as rootfs here. Find ubuntu-base releases at http://cdimage.ubuntu.com/ubuntu-base/releases/. Latest stable is always better. Download and extract ubuntu-base rootfs. Install kernel modules into the rootfs extracted.

balakumaran@balakumaran-pc:~/Desktop/RPi/linux-rpi-4.14.y$ sudo make  ARCH=arm64 O=../kernel_out/ INSTALL_MOD_PATH=$HOME/ubuntu-base/ modules_install
[sudo] password for balakumaran:
make[1]: Entering directory '/home/balakumaran/Desktop/RPi/kernel_out'
arch/arm64/Makefile:27: ld does not support --fix-cortex-a53-843419; kernel may be susceptible to erratum
arch/arm64/Makefile:40: LSE atomics not supported by binutils
arch/arm64/Makefile:48: Detected assembler with broken .inst; disassembly will be unreliable
  INSTALL arch/arm64/crypto/aes-neon-blk.ko
.
.
.

Copy your resolv.conf for network access.

$ sudo cp -av /run/systemd/resolve/stub-resolv.conf $HOME/rootfs/etc/resolv.conf

Lets chroot into the new rootfs and install necessary packages. But its an arm64 rootfs. So you need qemu user-mode emulation. Install qemu-user-static in your host Ubuntu and copy that to new rootfs. And then chroot will work.

$ sudo apt install qemu-user-static
.
.
.

$ sudo cp /usr/bin/qemu-aarch64-static $HOME/rootfs/usr/bin/
$ sudo chroot $HOME/rootfs/

Change root user password and install necessary packages. As these binaries are running on emulator, they will be bit slower. Its just one time.

$ passwd root
$ apt-get update
$ apt-get upgrade
$ apt-get install sudo ifupdown net-tools ethtool udev wireless-tools iputils-ping resolvconf wget apt-utils wpasupplicant kmod systemd vim

NOTE: If you face any error like cannot create key file at /tmp/, change permission of tmp.

$ chmod 777 /tmp

Download raspberry firmware-nonfree package from raspberry repository, extract wireless firmware and copy it to rootfs. Refer this answer for more details. As I'm having a RPI3b board, I copied brcmfmac43430-sdio.bin and brcmfmac43430-sdio.txt to lib/firmware/brcm

$ mkdir -p $HOME/lib/modules/brcm/
$ cp brcmfmac43430-sdio.txt brcmfmac43430-sdio.bin $HOME/lib/modules/brcm/

Edit etc/fstab or rootfs will be mounted as read-only.

echo "/dev/mmcblk0p2    /   ext4    defaults,noatime    0   1" >> $HOME/rootfs/etc/fstab
I referred this link for rootfs preparation. Though I'm not using, there are steps to remove unwanted files explained.

Reference

  • https://a-delacruz.github.io/ubuntu/rpi3-setup-64bit-kernel
  • https://a-delacruz.github.io/ubuntu/rpi3-setup-filesystem.html
  • https://www.linuxquestions.org/questions/slackware-arm-108/raspberry-pi-3-b-wifi-nic-not-found-4175627137/#post5840054
  • http://cdimage.ubuntu.com/ubuntu-base/releases/
  • https://raspberrypi.stackexchange.com/questions/61319/how-to-add-wifi-drivers-in-custom-kernel

Appendix

Command (m for help): p
Disk /dev/sdc: 29.7 GiB, 31914983424 bytes, 62333952 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xcde890ba

Device     Boot  Start      End  Sectors  Size Id Type
/dev/sdc1  *      2048   133119   131072   64M  b W95 FAT32
/dev/sdc2       133120 62333951 62200832 29.7G 83 Linux

Command (m for help): d
Partition number (1,2, default 2): 2

Partition 2 has been deleted.

Command (m for help): d
Selected partition 1
Partition 1 has been deleted.

Command (m for help): p
Disk /dev/sdc: 29.7 GiB, 31914983424 bytes, 62333952 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xcde890ba

Command (m for help): n
Partition type
   p   primary (0 primary, 0 extended, 4 free)
   e   extended (container for logical partitions)
Select (default p): p
Partition number (1-4, default 1):
First sector (2048-62333951, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-62333951, default 62333951): +64M

Created a new partition 1 of type 'Linux' and of size 64 MiB.
Partition #1 contains a vfat signature.

Do you want to remove the signature? [Y]es/[N]o: Y

The signature will be removed by a write command.

Command (m for help): n
Partition type
   p   primary (1 primary, 0 extended, 3 free)
   e   extended (container for logical partitions)
Select (default p): p
Partition number (2-4, default 2):
First sector (133120-62333951, default 133120):
Last sector, +sectors or +size{K,M,G,T,P} (133120-62333951, default 62333951):

Created a new partition 2 of type 'Linux' and of size 29.7 GiB.
Partition #2 contains a ext4 signature.

Do you want to remove the signature? [Y]es/[N]o: Y

The signature will be removed by a write command.

Command (m for help): t
Partition number (1,2, default 2): 1
Hex code (type L to list all codes): L

 0  Empty           24  NEC DOS         81  Minix / old Lin bf  Solaris
 1  FAT12           27  Hidden NTFS Win 82  Linux swap / So c1  DRDOS/sec (FAT-
 2  XENIX root      39  Plan 9          83  Linux           c4  DRDOS/sec (FAT-
 3  XENIX usr       3c  PartitionMagic  84  OS/2 hidden or  c6  DRDOS/sec (FAT-
 4  FAT16 <32M      40  Venix 80286     85  Linux extended  c7  Syrinx
 5  Extended        41  PPC PReP Boot   86  NTFS volume set da  Non-FS data
 6  FAT16           42  SFS             87  NTFS volume set db  CP/M / CTOS / .
 7  HPFS/NTFS/exFAT 4d  QNX4.x          88  Linux plaintext de  Dell Utility
 8  AIX             4e  QNX4.x 2nd part 8e  Linux LVM       df  BootIt
 9  AIX bootable    4f  QNX4.x 3rd part 93  Amoeba          e1  DOS access
 a  OS/2 Boot Manag 50  OnTrack DM      94  Amoeba BBT      e3  DOS R/O
 b  W95 FAT32       51  OnTrack DM6 Aux 9f  BSD/OS          e4  SpeedStor
 c  W95 FAT32 (LBA) 52  CP/M            a0  IBM Thinkpad hi ea  Rufus alignment
 e  W95 FAT16 (LBA) 53  OnTrack DM6 Aux a5  FreeBSD         eb  BeOS fs
 f  W95 Ext'd (LBA) 54  OnTrackDM6      a6  OpenBSD         ee  GPT
10  OPUS            55  EZ-Drive        a7  NeXTSTEP        ef  EFI (FAT-12/16/
11  Hidden FAT12    56  Golden Bow      a8  Darwin UFS      f0  Linux/PA-RISC b
12  Compaq diagnost 5c  Priam Edisk     a9  NetBSD          f1  SpeedStor
14  Hidden FAT16 <3 61  SpeedStor       ab  Darwin boot     f4  SpeedStor
16  Hidden FAT16    63  GNU HURD or Sys af  HFS / HFS+      f2  DOS secondary
17  Hidden HPFS/NTF 64  Novell Netware  b7  BSDI fs         fb  VMware VMFS
18  AST SmartSleep  65  Novell Netware  b8  BSDI swap       fc  VMware VMKCORE
1b  Hidden W95 FAT3 70  DiskSecure Mult bb  Boot Wizard hid fd  Linux raid auto
1c  Hidden W95 FAT3 75  PC/IX           bc  Acronis FAT32 L fe  LANstep
1e  Hidden W95 FAT1 80  Old Minix       be  Solaris boot    ff  BBT
Hex code (type L to list all codes): b

Changed type of partition 'Linux' to 'W95 FAT32'.

Command (m for help): t
Partition number (1,2, default 2): 2
Hex code (type L to list all codes): 83

Changed type of partition 'Linux' to 'Linux'.

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

balakumaran@balakumaran-pc:/media/balakumaran$
balakumaran@balakumaran-pc:/media/balakumaran$ sudo fdisk -l /dev/sdc
Disk /dev/sdc: 29.7 GiB, 31914983424 bytes, 62333952 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xcde890ba

Device     Boot  Start      End  Sectors  Size Id Type
/dev/sdc1         2048   133119   131072   64M  b W95 FAT32
/dev/sdc2       133120 62333951 62200832 29.7G 83 Linux
balakumaran@balakumaran-pc:/media/balakumaran$
balakumaran@balakumaran-pc:/media/balakumaran$ sudo mkfs.fat /dev/sdc1
mkfs.fat 4.1 (2017-01-24)
balakumaran@balakumaran-pc:/media/balakumaran$ sudo mkfs.ext4 /dev/sdc2
mke2fs 1.44.4 (18-Aug-2018)
Creating filesystem with 7775104 4k blocks and 1945888 inodes
Filesystem UUID: 5815d093-6381-4db7-b692-32192b24cf9c
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

balakumaran@balakumaran-pc:/media/balakumaran$
balakumaran@balakumaran-pc:/media/balakumaran$ sudo fdisk /dev/sdc

Welcome to fdisk (util-linux 2.32).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): a
Partition number (1,2, default 2): 1

The bootable flag on partition 1 is enabled now.

Command (m for help): p
Disk /dev/sdc: 29.7 GiB, 31914983424 bytes, 62333952 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xcde890ba

Device     Boot  Start      End  Sectors  Size Id Type
/dev/sdc1  *      2048   133119   131072   64M  b W95 FAT32
/dev/sdc2       133120 62333951 62200832 29.7G 83 Linux

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

balakumaran@balakumaran-pc:/media/balakumaran$

64-bit Mainline kernel on Raspberry Pi 3

I've struggled a little recently on running vanilla kernel on Raspberry Pi 3. Still I didn't completely understand the internals. Anyway sharing the steps may be useful for someone like me.

Download toolchain and rootfs from Linaro.

And clone following repos * Vanilla kernel * Raspberry Pi kernel - Checkout same version as vanilla kernel you are going to use * Raspberry pi firmware - Or download only the files under boot directory of this repo

I've created a directory structure as below. You can have similar one based on your convenience.

$ ls ~/Desktop/kernel/
total 44K
drwxr-xr-x  2 kaba kaba 4.0K Sep 23 20:04 downloads
drwxrwxr-x  2 kaba kaba 4.0K Oct  4 10:22 firmware
drwxr-xr-x 22 kaba kaba 4.0K Oct  6 11:55 kernel_out
drwxr-xr-x 18 kaba kaba 4.0K Sep 12  2013 rootfs
drwxr-xr-x 26 kaba kaba 4.0K Oct  3 21:30 rpi_kernel
drwxr-xr-x  2 kaba kaba 4.0K Oct  7 12:13 rpi_out
drwxr-xr-x  3 kaba kaba 4.0K Sep 23 19:43 toolchain
drwxr-xr-x 26 kaba kaba 4.0K Oct  3 22:04 vanila_kernel
kaba@kaba-Vostro-1550:~/Desktop/kernel
$
Directory | Purpose | ----------------|-----------------------------------------------------------| downloads | Having tarballs of rootfs and toolchain | firmware | boot directory of Raspberry Pi firmware repo | kernel_out | Output directory for Mainline kernel | rootfs | rootfs tarball extracted | rpi_kernel | Raspberry Pi kernel repo | rpi_out | Output directory for Raspberry Pi kernel | toolchain | toolchain tarball extracted | vanilla_kernel | Mainline kernel repo |

Export PATH variable to include toolchain directory.

$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/kaba/Desktop/kernel/toolchain/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/
kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$

Configure and build 64-bit Vanilla kernel.

kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$ make O=../kernel_out ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- defconfig
.
.
.
kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$ make -j4 O=../kernel_out ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
Change the suffix to -j according to your machine. And wait for the build to complete.

Now build device-tree in Raspberry Pi kernel repo.

kaba@kaba-Vostro-1550:~/Desktop/kernel/rpi_kernel
$ make O=../rpi_out ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- defconfig
make[1]: Entering directory '/home/kaba/Desktop/kernel/rpi_out'
  HOSTCC  scripts/basic/fixdep
  GEN     ./Makefile
  HOSTCC  scripts/kconfig/conf.o
  YACC    scripts/kconfig/zconf.tab.c
  LEX     scripts/kconfig/zconf.lex.c
  HOSTCC  scripts/kconfig/zconf.tab.o
  HOSTLD  scripts/kconfig/conf
*** Default configuration is based on 'defconfig'
#
# configuration written to .config
#
make[1]: Leaving directory '/home/kaba/Desktop/kernel/rpi_out'
kaba@kaba-Vostro-1550:~/Desktop/kernel/rpi_kernel
$ make O=../rpi_out ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- dtbs
.
.
.

Partition your memory card into two. The first one should be FAT32 and second one should be EXT4. The first partition should be a boot partition.

balakumaran@balakumaran-USB:~/Desktop/RPi/linux_build$ sudo parted /dev/sdd
[sudo] password for balakumaran: 
GNU Parted 3.2
Using /dev/sdd
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print                                                            
Model: MXT-USB Storage Device (scsi)
Disk /dev/sdd: 31.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags: 

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  106MB   105MB   primary  fat32        boot, lba
 2      106MB   31.9GB  31.8GB  primary

(parted) help rm                                                          
  rm NUMBER                                delete partition NUMBER

        NUMBER is the partition number used by Linux.  On MS-DOS disk labels, the primary partitions number from 1 to 4, logical partitions from 5 onwards.
(parted) rm 1                                                             
(parted) rm 2                                                             
(parted) print
Model: MXT-USB Storage Device (scsi)
Disk /dev/sdd: 31.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags: 

Number  Start  End  Size  Type  File system  Flags

(parted) help mkpart
  mkpart PART-TYPE [FS-TYPE] START END     make a partition

        PART-TYPE is one of: primary, logical, extended
        FS-TYPE is one of: zfs, btrfs, nilfs2, ext4, ext3, ext2, fat32, fat16, hfsx, hfs+, hfs, jfs, swsusp, linux-swap(v1), linux-swap(v0), ntfs, reiserfs, freebsd-ufs, hp-ufs, sun-ufs,
        xfs, apfs2, apfs1, asfs, amufs5, amufs4, amufs3, amufs2, amufs1, amufs0, amufs, affs7, affs6, affs5, affs4, affs3, affs2, affs1, affs0, linux-swap, linux-swap(new), linux-swap(old)
        START and END are disk locations, such as 4GB or 10%.  Negative values count from the end of the disk.  For example, -1s specifies exactly the last sector.

        'mkpart' makes a partition without creating a new file system on the partition.  FS-TYPE may be specified to set an appropriate partition ID.
(parted) mkpart primary fat32 2048s 206848s
(parted) mkpart primary ext4 208896s -1s
(parted) print                                                            
Model: MXT-USB Storage Device (scsi)
Disk /dev/sdd: 31.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags: 

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  106MB   105MB   primary  fat32        lba
 2      107MB   31.9GB  31.8GB  primary  ext4         lba

(parted) set 1 boot on                                                    
(parted) print                                                            
Model: MXT-USB Storage Device (scsi)
Disk /dev/sdd: 31.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags: 

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  106MB   105MB   primary  fat32        boot, lba
 2      107MB   31.9GB  31.8GB  primary  ext4         lba

(parted) quit                                                             
Information: You may need to update /etc/fstab.

balakumaran@balakumaran-USB:~/Desktop/RPi/linux_build$ sudo fdisk -l /dev/sdd
Disk /dev/sdd: 29.7 GiB, 31914983424 bytes, 62333952 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xcde890ba

Device     Boot  Start      End  Sectors  Size Id Type
/dev/sdd1  *      2048   206848   204801  100M  c W95 FAT32 (LBA)
/dev/sdd2       208896 62333951 62125056 29.6G 83 Linux
balakumaran@balakumaran-USB:~/Desktop/RPi/linux_build$

Copy firmware and kernel to boot partition.

kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$ sudo mount /dev/sdb1 /mnt/boot/
kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$ sudo cp ../kernel_out/arch/arm64/boot/Image /mnt/boot/kernel8.img
[sudo] password for kaba: 
kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$ sudo cp ../firmware/* /mnt/boot/
kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$

Install device-tree blobs from Raspberry Pi repo into boot partition. The device-tree in upstream kernel is not working for some reason. I couldn't get more information regarding that.

kaba@kaba-Vostro-1550:~/Desktop/kernel/rpi_kernel
$ make O=../rpi_out ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- bcmrpi3_defconfig
kaba@kaba-Vostro-1550:~/Desktop/kernel/rpi_kernel
$ sudo make O=../rpi_out ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- INSTALL_PATH=/mnt/boot/ dtbs_install

Copy rootfs into second partition. Also install kernel modules into that.

kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$ sudo mount /dev/sdb2 /mnt/rootfs/
kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$ sudo cp -rf ../rootfs/* /mnt/rootfs/
kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$ sudo make O=../kernel_out ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- INSTALL_MOD_PATH=/mnt/rootfs/ modules_install
kaba@kaba-Vostro-1550:~/Desktop/kernel/vanila_kernel
$ sync

Create config.txt and cmdline.txt as follows. Make sure you update device-tree and overlay_prefix based on your configuration.

kaba@kaba-Vostro-1550:/mnt/boot
$ cat cmdline.txt 
dwc_otg.lpm_enable=0 console=serial0,115200 root=/dev/mmcblk0p2 rootfstype=ext4 rootwait    
kaba@kaba-Vostro-1550:/mnt/boot
$ cat config.txt 
dtoverlay=vc4-fkms-v3d,cma-256
disable_overscan=1
dtparam=audio=on
device_tree=dtbs/4.19.0-rc5-v8+/broadcom/bcm2710-rpi-3-b.dtb
overlay_prefix=dtbs/4.19.0-rc5-v8+/overlays/
enable_uart=1
kaba@kaba-Vostro-1550:/mnt/boot
$

Put the SD card in Raspberry Pi 3 and boot.

The Volatile keyword

Recently I've interviewed some candidates for entry and intermediate level positions. One of the questions most of them struggled is about the volatile keyword. Some conversations went like this, * Q: Why we use volatile keyword? * A: It will tell compiler not to use any registers for the volatile variable. * Q: Then how will it work in an ARM processor? In ARM no instruction other than load and store can use memory location. * A: ??!!

  • Q: What is the purpose volatile keyword?
  • A: We'll use for IO memory
  • Q: Why we need it for IO memory?
  • A: So every time processor accesses the memory, it will go to IO device
  • Q: So volatile is to tell processor not to cache data?
  • A: Yes
  • Q: Thus volatile is a processor directive not compiler directive?
  • And confusion starts

In this post, lets see how volatile works with two simple C programs. In complex programs with multiple variables and loops volatile keyword will make significant difference in speed and memory usage.

GCC provides many compiler optimization flags. Enabling them will aggressively optimize the code and give better performance in terms of speed and memory footprint. As these optimizations make debugging harder, they are not suitable development. All available GCC compiler optimization flags can be get from following command.

$ $CC --help=optimizers
The following options control optimizations:
  -O<number>                  Set optimization level to <number>.
  -Ofast                      Optimize for speed disregarding exact standards compliance.
  -Og                         Optimize for debugging experience rather than speed or size.
  -Os                         Optimize for space rather than speed.
  -faggressive-loop-optimizations Aggressively optimize loops using language constraints.
.
.
.

For simplicity, I used only -Ofast optimizer for the examples. It informs GCC to do its best to make the program run faster. We'll see how compiler builds, with and without volatile. GCC will give assembly code output with -S options.

Take the following C program.

#include <stdio.h>

int main() {
    int *x = (int *)0xc000;
    int b = *x;
    int c = *x;
    *x = b + c;
    return 0;
}
Don't worry about dereferencing a random virtual address. We are not going to run this program, just build the assembly code and examine manually. I use pointers in these programs. Because using immediate value makes no sense with volatile. We have an integer pointer x points to address 0xc000. We initialize two variables b and c with value in address 0xc000. And then addition of b and c is stored in location 0xc000. So we read the value in location 0xc000 twice in this program. Let see how it gets compiled by GCC form ARMv8.

$ echo $CC
aarch64-poky-linux-gcc --sysroot=/opt/poky/2.4.2/sysroots/aarch64-poky-linux
kaba@kaba-Vostro-1550:~/Desktop/volatile/single_varuable_two_reads
$ $CC -S -Ofast ./code.c -o code.S
kaba@kaba-Vostro-1550:~/Desktop/volatile/single_varuable_two_reads
$

    .arch armv8-a
    .file   "code.c"
    .text
    .section    .text.startup,"ax",@progbits
    .align  2
    .p2align 3,,7
    .global main
    .type   main, %function
main:
    mov x2, 49152
    mov w0, 0
    ldr w1, [x2]
    lsl w1, w1, 1
    str w1, [x2]
    ret
    .size   main, .-main
    .ident  "GCC: (GNU) 7.3.0"
    .section    .note.GNU-stack,"",@progbits
The compiler intelligently finds that variable b and c have same value from address 0xc000 and addition of them is equivalent to multiplying the value at 0xc000 by two. So it loads the value into register W1 and left shifts it by 1 (equivalent of multiplying with two) and then stores the new value into location 0xc000.

Now lets change the code to use volatile for variable x. And see how the assembly code looks.

#include <stdio.h>

int main() {
    volatile int *x = (int *)0xc000;
    int b = *x;
    int c = *x;
    *x = b + c;
    return 0;
}
    .arch armv8-a
    .file   "code.c"
    .text
    .section    .text.startup,"ax",@progbits
    .align  2
    .p2align 3,,7
    .global main
    .type   main, %function
main:
    mov x1, 49152
    mov w0, 0
    ldr w2, [x1]
    ldr w3, [x1]
    add w2, w2, w3
    str w2, [x1]
    ret
    .size   main, .-main
    .ident  "GCC: (GNU) 7.3.0"
    .section    .note.GNU-stack,"",@progbits
This time the compiler considers that the value at location 0xc000 may be different each time it reads. It thinks that the variables b and c could be initialized with different values. So it reads the location 0xc000 twice and adds both values.

Lets see a simple loop case

#include <stdio.h>

int main() {
    int *x = (int *)0xc000;
    int *y = (int *)0xd000;
    int sum = 0;
    for (int i = 0; i < *y; i++) {
        sum = sum + *x;
    }
    *x = sum;
    return 0;
}
This program initializes two pointers x and y to locations 0xc000 and 0xd000 respectively. It adds the value at x to itself as many times the value at y. Lets see how GCC sees it.
    .arch armv8-a
    .file   "code.c"
    .text
    .section    .text.startup,"ax",@progbits
    .align  2
    .p2align 3,,7
    .global main
    .type   main, %function
main:
    mov x0, 53248
    ldr w0, [x0]
    cmp w0, 0
    ble .L3
    mov x1, 49152
    ldr w1, [x1]
    mul w1, w0, w1
.L2:
    mov x2, 49152
    mov w0, 0
    str w1, [x2]
    ret
.L3:
    mov w1, 0
    b   .L2
    .size   main, .-main
    .ident  "GCC: (GNU) 7.3.0"
    .section    .note.GNU-stack,"",@progbits
The compiler assigns register X0 to y and register X1 to x. The program compares the value at [X0] - value at the address in X0 - with zero. If so, it jumps to .L3 which sets W1 to zero and jumps to .L2. Or it simply multiplies [X0] and [X1] and stores the value in W1. .L2 stores the value in W1 at [X2] and returns. The compiler intelligently identifies that adding [X2] to itself [X1] times is equivalent to multiplying both.

With volatile,

#include <stdio.h>

int main() {
    volatile int *x = (int *)0xc000;
    int *y = (int *)0xd000;
    int sum = 0;
    for (int i = 0; i < *y; i++) {
        sum = sum + *x;
    }
    *x = sum;
    return 0;
}
the corresponding assembly code is
    .arch armv8-a
    .file   "code.c"
    .text
    .section    .text.startup,"ax",@progbits
    .align  2
    .p2align 3,,7
    .global main
    .type   main, %function
main:
    mov x0, 53248
    ldr w3, [x0]
    cmp w3, 0
    ble .L4
    mov w0, 0
    mov w1, 0
    mov x4, 49152
    .p2align 2
.L3:
    ldr w2, [x4]
    add w0, w0, 1
    cmp w0, w3
    add w1, w1, w2
    bne .L3
.L2:
    mov x2, 49152
    mov w0, 0
    str w1, [x2]
    ret
.L4:
    mov w1, 0
    b   .L2
    .size   main, .-main
    .ident  "GCC: (GNU) 7.3.0"
    .section    .note.GNU-stack,"",@progbits
This time GCC uses X4 for the address 0xc000, but its not significant for our problem. Look here the loop is .L3. It loads the value at location X4 every time the loop runs, which is different than non-volatile behaviour. This time the compiler things the value at X4 will be different each time it is read. So without any assumption, it adds the value to sum every time the loop runs.

In both programs the value at the location 0xc000 can be cached by the processor. The subsequent read of the value at 0xc000 could be from processor's cache but not from main memory. It is responsibility of the memory controller to maintain coherency between memory and processor cache. The volatile keyword has nothing to do here.

I believe these simple programs had explained the concept clear. The volatile

IS
  • To tell compiler not to make any assumption about the value stored in the variable
IS NOT
  • To tell the compiler not to use any registers to hold the value
  • To tell the processor not to cache the value