Skip to content

March 2018

The Linux COW

In our last post we have seen how to get the physical memory address using a virtual memory address. In this post lets try to check Copy on Write(COW from here) implemented by Linux. For those who try to recollect what is COW, on fork(), Linux will not immediately replicate the address space of parent process. As the child process is most-likely to load a new binary, copying the current process's address space will be of no use in many cases. So unless a write occurs (either by the parent process or by the child process), the physical address space mapped to both processes' virtual address space will be same. Please go through Wikipedia if you still have no clue.

So lets write a C program that forks itself and check whether both are sharing the physical address space. We are going to reuse the get_paddr.c function we wrote in our previous post to get the physical memory address.

#include <stdio.h>
#include <unistd.h> /* fork() */

int main()
{
    int a;

    fork();

    printf("my pid is %d, and the address to work is 0x%lx\n", getpid(), (unsigned long)&a);
    scanf("%d\n", &a);

    return 0;
}

Execute this program in one console and while it is waiting for the user input, go to the next console and get the physical address of variable a using our get_paddr binary.

$ gcc specimen.c -o specimen
$ ./specimen
my pid is 5912, and the address to work is 0x7ffd055923e4
my pid is 5913, and the address to work is 0x7ffd055923e4
$ sudo ./get_paddr 5912 0x7ffd055923e4
getting page number of virtual address 140724693181412 of process 5912
opening pagemap /proc/5912/pagemap
moving to 274852916368
physical frame address is 0x509c9
physical address is 0x509c93e4
$
$ sudo ./get_paddr 5913 0x7ffd055923e4
getting page number of virtual address 140724693181412 of process 5913
opening pagemap /proc/5913/pagemap
moving to 274852916368
physical frame address is 0x64d2a
physical address is 0x64d2a3e4
$

OOPS! The physical address is not same! How? Why? What's wrong? As per my beloved Robert Love's Linux System Programming book, the physical memory copy will occur only when a write occurs.

{% blockquote Robert Love , Linux System Programming %} The MMU can intercept the write operation and raise an exception; the kernel, in response, will transparently create a new copy of the page for the writing process, and allow the write to continue against the new page. We call this approach *copy-on-write(COW). Effectively, processes are allowed read access to shared data, which saves space. But when a process wants to write to a shared page, it receives a unique copy of that page on fly, thereby allowing the kernel to act as if the process always had its own private copy. As copy-on write occurs on a page-by-page basis, with the technique a huge file may be efficiently shared among many processes, and the individual processes will receive unique physical pages only for those pages to which thy themselves write.

It is so clear that in our program specimen.c we never write to the only variable a unless the scanf completes. But we tested the COW before input-ing to the scanf. So by the time there would be no write and as per the design of COW physical memory space should be shared across both parent and child. Even though we have written the program carefully, sometimes the libraries, compiler and even the OS will act intelligently (stupid! They themselves says it) and fills us with surprise. So lets run the handy strace on our specimen binary to see what it actually does.

$ strace ./specimen
execve("./specimen", ["./specimen"], [/* 47 vars */]) = 0
brk(NULL)                               = 0x55de17b57000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=94696, ...}) = 0
mmap(NULL, 94696, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fdbad5b8000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\22\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1960656, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdbad5b6000
mmap(NULL, 4061792, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fdbacfc9000
mprotect(0x7fdbad19f000, 2097152, PROT_NONE) = 0
mmap(0x7fdbad39f000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE,
        3, 0x1d6000) = 0x7fdbad39f000
mmap(0x7fdbad3a5000, 14944, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS,
        -1, 0) = 0x7fdbad3a5000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7fdbad5b7500) = 0
mprotect(0x7fdbad39f000, 16384, PROT_READ) = 0
mprotect(0x55de15e94000, 4096, PROT_READ) = 0
mprotect(0x7fdbad5d0000, 4096, PROT_READ) = 0
munmap(0x7fdbad5b8000, 94696)           = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
        child_tidptr=0x7fdbad5b77d0) = 5932
getpid()                                = 5931
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
my pid is 5932, and the address to work is 0x7fff9823b814
brk(NULL)                               = 0x55de17b57000
brk(0x55de17b78000)                     = 0x55de17b78000
write(1, "my pid is 5931, and the address "..., 58my pid is 5931, and the address
        to work is 0x7fff9823b814
) = 58
fstat(0, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
read(0,

As the program is waiting on scanf, it is waiting for user input in read system-call from fd 0, stdin. But the surprise part here is the clone system call. If you don't remember, read the specimen.c program once again. We never used the clone system call. And where is the fork? It seems glibc does something in a so called intelligent way. Lets read the man pages once again for pointers.

C library/kernel differences
       Since  version  2.3.3, rather than invoking the kernel's fork() system call,
       the glibc fork() wrapper that is provided as part of the NPTL threading
       implementation invokes clone(2) with flags that provide the same effect as
       the traditional system call.  (A call to fork() is equivalent to a call to
       clone(2) specifying flags as just  SIGCHLD.) The glibc wrapper invokes any
       fork handlers that have been established using pthread_atfork(3).

Doesn't look completely true. Though the man page says it calls clone() with flags as just SIGCHLD, strace uncovers the dirty truth - flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD. Nobody escapes from an armed system programmer. Anyway this explains clone but no-fork surprise. But where comes the write that causes COW. See the last argument of clone() in strace, child_tidptr=0x7fdbad5b77d0 some place holder stack variable passed to be filled. And what about the two flags other than SIGCHLD? Lets go to the man pages once again, but for clone() this time.

    CLONE_CHILD_CLEARTID (since Linux 2.5.49)
        Clear (zero) the child thread ID at the location ctid in child memory when
        the child exits, and do a wakeup on the futex at that address.  The address
        involved  may be changed by the set_tid_address(2) system call.  This is
        used by threading libraries.

    CLONE_CHILD_SETTID (since Linux 2.5.49)
        Store the child thread ID at the location ctid in the child's memory.
        The store operation completes before clone() returns control to user space.

Ahh! The culprit is CLONE_CHILD_SETTID. It makes the OS to write thread ID of the child into its memory region. This write triggers the COW and gets the child a fresh copy of physical memory. Sorry Linux, I doubted you.

Okay. Lets modify our specimen with clone() and no CLONE_CHILD_SETTID flag.

/* clone() */
#define _GNU_SOURCE
#include <sched.h>

#include <stdio.h>

/* getpid() */
#include <sys/types.h>
#include <unistd.h>

#include <signal.h> /* SIGCHLD */
#include <stdlib.h> /* malloc() */

int run(void *arg)
{
    printf("my pid is %d and address to check is 0x%lx\n", getpid(), (unsigned long) arg);
    scanf("%d\n", (int *)arg);

    return 0;
}

int main()
{
    int a;
    void *stack_ptr = malloc(1024*1024);
    if (stack_ptr == NULL) {
        printf("No virtual memory available\n");
        return -1;
    }
    /*fork();*/

    clone(run, (stack_ptr + 1024*1024), CLONE_CHILD_CLEARTID|SIGCHLD, &a);
    if (a < 0)
        perror("clone failed. ");

    run(&a);

    return 0;
}

Unlike fork(), clone() doesn't start the execution of child from the point where it returns.Instead it takes a function as an argument and runs the function in child process. Here we are passing run() function which prints the pid and virtual address of the variable a. After clone() the parent also calls the function run() so that we'll get the pid of parent process.

Have you noted the second argument of clone()? It is the stack where child is going to execute. Again unlike fork(), clone() allows parent and child to share their resources like memory, signal handlers, open file descriptors, etc. As both cannot run in same stack, the parent process must allocate some space and give it to the child to use it as stack. stack grows downwards on all processors that run Linux, so the child-stack should point out to the topmost address of the memory region. That's why we are passing the end of malloced memory. Lets execute the program and see whether both process share the same physical memory space.

$ ./specimen
my pid is 3129 and address to check is 0x7fff396fdd6c
my pid is 3130 and address to check is 0x7fff396fdd6c
$ sudo ./get_paddr 3129 0x7fff396fdd6c
getting page number of virtual address 140734157020524 of process 3129
opening pagemap /proc/3129/pagemap
moving to 274871400424
physical frame address is 0x5e35a
physical address is 0x5e35ad6c
$
$ sudo ./get_paddr 3130 0x7fff396fdd6c
getting page number of virtual address 140734157020524 of process 3130
opening pagemap /proc/3130/pagemap
moving to 274871400424
physical frame address is 0x6a7cd
physical address is 0x6a7cdd6c
$

Different again! Linux can't be wrong. We should make sure we didn't mess-up anything. Are we sure there will be no write in stack after clone() call? * The child's execution starts at function run() * No variables are written until scanf() returns. Okay we'll complete our examination before providing any input. * But the functions? OOPS! After clone, parent calls run() and then printf() but child calls printf() directly. All these function calls will make write in stack which triggers COW.

Lets come-up with a different C-program with no function calls after clone() in both parent and child.

/* clone() */
#define _GNU_SOURCE
#include <sched.h>

#include <stdio.h>

/* getpid() */
#include <sys/types.h>
#include <unistd.h>

#include <signal.h> /* SIGCHLD */
#include <stdlib.h> /* malloc() */

#define STACK_LENGTH (1024*1024)

int run(void *arg)
{
    while(*(int*)arg);

    return 0;
}

int main()
{
    int a = 1;
    void *stack_ptr = malloc(STACK_LENGTH);
    if (stack_ptr == NULL) {
        printf("No virtual memory available\n");
        return -1;
    }
    /*fork();*/

    printf("my pid is %d and address to check is 0x%lx\n", getpid(), (unsigned long) &a);
    clone(run, (stack_ptr + STACK_LENGTH), CLONE_CHILD_CLEARTID|SIGCHLD, &a);
    if (a < 0)
        perror("clone failed. ");

    while (a);

    return 0;
}
We are not printing anything after clone() to avoid stack overwrite. So we have to assume the child process's pid is parent pid + 1. This assumption is true most of the times. And we use infinite while loop to pause both processes. Loops use jump statements. So they will not cause any stack write. So this time we should see same physical address has been used by both processes. Lets see.
$ ./specimen
my pid is 3436 and address to check is 0x7fffa920c1ec
$ sudo ./get_paddr 3436 0x7fffa920c1ec
getting page number of virtual address 140736030884332 of process 3436
opening pagemap /proc/3436/pagemap
moving to 274875060320
physical frame address is 0x67337
physical address is 0x673371ec
$
$ sudo ./get_paddr 3437 0x7fffa920c1ec
getting page number of virtual address 140736030884332 of process 3437
opening pagemap /proc/3437/pagemap
moving to 274875060320
physical frame address is 0x67337
physical address is 0x673371ec

GREAT! At last we conquered. We saw the evidence of our beloved COW. Now I can sleep peacefully.

For pure Engineer

Still I feel the urge to make a stack write in child process and see the physical address change. For those curious people out there who want to see the things break, lets run it one more time. Change the run() function as follows and execute specimen.c.

int run(void *arg)
{
    *(int*)arg = 10; //stack write
    while(*(int*)arg);

    return 0;
}
$ ./specimen
my pid is 3549 and address to check is 0x7ffdd8e1e9ac
$ sudo ./get_paddr 3549 0x7ffdd8e1e9ac
getting page number of virtual address 140728242137516 of process 3549
opening pagemap /proc/3549/pagemap
moving to 274859847920
physical frame address is 0x634e6
physical address is 0x634e69ac
$
$ sudo ./get_paddr 3550 0x7ffdd8e1e9ac
getting page number of virtual address 140728242137516 of process 3550
opening pagemap /proc/3550/pagemap
moving to 274859847920
physical frame address is 0x55976
physical address is 0x559769ac
$
Good Nightzzz...

Virtual memory to Physical memory

We all know that processes running in Linux acts only in virtual address space. So whenever a process wants to access a data (okay datum) it requests CPU for a virtual address. The CPU intern converts it into physical address and fetches the data. It will be nice to have a program that converts virtual address to physical address, won't it?

Linux from 2.5.26 provides a proc interface, pagemap that contains information what we want. Each process has its pagemap at /proc/p_id/pagemap. According to the Documentation it is a binary file contains a sequence of 64-bit words. Each word contains information regarding one virtual page for full virtual address space. Among them bits 0-54 (55-bits) represents the address of the physical frame number (PFN). I think that's all we need. Adding the offset of a variable from virtual page address to the PFN will give us the physical memory address.

WARNING: Don't try to read the pagemap file directly. cat /proc/self/pagemap or vim /proc/p_id/pagemap is not going to return anytime soon.

We'll write a small C program and the let's try to get physical address of a variable used in that C program. As the PFN data will be present only if the data is not moved to swap, lets use mlock() to lock the memory in physical memory.

#include <stdio.h>
#include <sys/mman.h>   /* for mlock() */
#include <stdlib.h>     /* for malloc() */
#include <string.h>     /* for memset() */

/* for getpid() */
#include <sys/types.h>
#include <unistd.h>

#define MEM_LENGTH 1024

int main()
{
    /* Allocate 1024 bytes in heap */
    char *ptr = NULL;
    ptr = malloc(MEM_LENGTH);
    if (!ptr) {
        perror("malloc fails. ");
        return -1;
    }

    /* obtain physical memory */
    memset(ptr, 1, MEM_LENGTH);

    /* lock the allocated memory in RAM */
    mlock(ptr, MEM_LENGTH);

    /* print the pid and vaddr. Thus we can work on him */
    printf("my pid: %d\n\n", getpid());
    printf("virtual address to work: 0x%lx\n", (unsigned long)ptr);

    /* make the program to wait for user input */
    scanf("%c", &ptr[16]);

    return 0;
}

Run the specimen.c program, get its p_id and start the dissection.

$ gcc specimen.c -o specimen
$ ./specimen
my pid: 11953

virtual address to work: 0x55cd75821260

In a 64-bit machine, virtual address-space is from 0x00 and to 2^64 - 1. First we have to calculate the page offset for the given virtual address [find on which virtual memory page, the address resides]. And multiply that with 8 as each virtual page table has 8-byte information word in the pagemap file.

#define PAGEMAP_LENGTH 8
page_size = getpagesize();
offset = (vaddr page_size) * PAGEMAP_LENGTH;

Open the pagemap file and seek to that offset location.

pagemap = fopen(filename, "rb");
fseek(pagemap, (unsigned long)offset, SEEK_SET)

Now cursor is on the first byte of 64-bit word containing the information we need. According to the Documentation bits 0-54 represents the physical page frame number (PFN). So read 7-bytes and discard most significant bit.

fread(&paddr, 1, (PAGEMAP_LENGTH-1), pagemap)
paddr = paddr & 0x7fffffffffffff;

This is the PFN. Add offset of the virtual address from its virtual page base address to the page shifted PFN to get the physical address of the memory.

offset = vaddr % page_size;
/* PAGE_SIZE = 1U << PAGE_SHIFT */
while (!((1UL << ++page_shift) & page_size));
paddr = (unsigned long)((unsigned long)paddr << page_shift) + offset;

Here is the complete program.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

#define PAGE_SHIFT 12
#define PAGEMAP_LENGTH 8

int main(int argc, char **argv)
{
    unsigned long vaddr, pid, paddr = 0, offset;
    char *endptr;
    FILE *pagemap;
    char filename[1024] = {0};
    int ret = -1;
    int page_size, page_shift = -1;

    page_size = getpagesize();
    pid = strtol(argv[1], &endptr, 10);
    vaddr = strtol(argv[2], &endptr, 16);
    printf("getting page number of virtual address %lu of process %ld\n",vaddr, pid);

    sprintf(filename, "/proc/%ld/pagemap", pid);

    printf("opening pagemap %s\n", filename);
    pagemap = fopen(filename, "rb");
    if (!pagemap) {
        perror("can't open file. ");
        goto err;
    }

    offset = (vaddr / page_size) * PAGEMAP_LENGTH;
    printf("moving to %ld\n", offset);
    if (fseek(pagemap, (unsigned long)offset, SEEK_SET) != 0) {
        perror("fseek failed. ");
        goto err;
    }

    if (fread(&paddr, 1, (PAGEMAP_LENGTH-1), pagemap) < (PAGEMAP_LENGTH-1)) {
        perror("fread fails. ");
        goto err;
    }
    paddr = paddr & 0x7fffffffffffff;
    printf("physical frame address is 0x%lx\n", paddr);

    offset = vaddr % page_size;

    /* PAGE_SIZE = 1U << PAGE_SHIFT */
    while (!((1UL << ++page_shift) & page_size));

    paddr = (unsigned long)((unsigned long)paddr << page_shift) + offset;
    printf("physical address is 0x%lx\n", paddr);

    ret = 0;
err:
    fclose(pagemap);
    return ret;
}

And the output

$ sudo ./a.out 11953 0x55cd75821260
getting page number of virtual address 94340928115296 of process 11953
opening pagemap /proc/11953/pagemap
moving to 184259625224
physical frame address is 0x20508
physical address is 0x20508260

References

  1. https://www.kernel.org/doc/Documentation/vm/pagemap.txt
  2. https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_types.h
  3. man pages

Github from Facebook

There are multiple J.A.R.V.I.S projects available in GitHub. All are based on some chat frameworks like api.ai, Alexa, etc. This is yet another rather vaguely intelligent system that I developed for Facebook developer circle challenge hosted at Devpost.

I have developed it just keeping J.A.R.V.I.S in mind. First of all J.A.R.V.I.S is not a simple t-shirt throwing bot like the one Zuckerberg developed. Tony Stark actively uses him during his project development. I mean he is not just a personal assistant but an intelligent co-developer. So I developed one to help you doing your GitHub related activities. You can simply create repos, open close issues, comment on others' issues just from a Facebook chat window. It will save your time and effort from navigating to multiple windows and clicking a dozens of buttons. With voice interface enabled for Facebook chat, he will be your personal J.A.R.V.I.S.

Here is a quick demo.


Code is available at github.com/kaba-official/MO.

VNC setup in Digitalocean

Install xfce desktop environment

$ sudo apt-get install xfce4 xfce4-goodies

Install TightVNC for VNC server

$ sudo apt-get install tightvncserver

Initialize VNC server + run VNC server once + enter and verify password + ignore view-only password

$ vncserver
Password:
Verify:
View-only password:

To configure VNC, first kill the running VNC server

$ vncserver -kill :1
Killing Xtightvnc process ID 13456

Backup xstartup configuration file

$ mv ~/.vnc/xstartup ~/.vnc/xstartup.orig

Edit xstartup file

$ vim ~/.vnc/xstartup

And add following content

*~/.vnc/xstartup*
#!/bin/bash
xrdb $HOME/.Xresources
startxfce4 &
+ Look at ~/.Xresources file for VNC's GUI framework information + Start XFCE whenever VNC server is started

Provide executable permission for ~/.vnc/xstartup file

$ sudo chmod +x ~/.vnc/xstartup

Start VNC server

$ vncserver -geometry 1280x800

Auto deploy from Bitbucket or any other git repository

Update aptitude first

$ sudo apt-get update

Install expect

$ sudo apt-get install expect

Create a directory named trigger in your home directory

$ cd
$ mkdir trigger
$ cd trigger

Create following files inside the trigger directory. trigger.js

$ cat trigger.js
var server_port = <new_port_for_trigger>;

var sys = require('sys');
var exec = require('child_process').exec;
var child;

var http = require('http');
var express = require('express');
var app = express();
var server_get = require('http').Server(app);

app.post('/update', function(req, res) {
    child = exec("./update_repo.sh", function(error, stdout, stderr) {
                console.log('stdout: ' + stdout);
                console.log('stderr: ' + stderr);
                if (error !== null) {
                    console.log('exec error: ' + error);
                }
    });

    res.send("SUCCESS");
});

app.listen(server_port, function() {
    console.log('Example app listening on port ' + server_port);
}

package.json

$ cat package.json
{
    "name": "Trigger",
    "version": "0.0.1",
    "scripts": {
        "start": "node server"
    },
    "dependencies": {
        "express": "^4.14.0",
        "http": "0.0.0",
        "https": "^1.0.0"
    }
}

update_repo.sh

$ cat update_repo.sh
#!/bin/bash
cd <your_git_repo>
~/trigger/git_pull_helper.sh
pm2 restart server

git_pull_helper.sh

$ cat git_pull_helper.sh
#!/usr/bin/expect -f
spawn git pull
expect "ass"
send "<your_ssh_key_pass_phrase>\r"
interact

Update permission for update_repo.sh and git_pull_helper.sh

$ chmod +x update_repo.sh
$ chmod +x git_pull_helper.sh

Start your trigger NodeJS server

$ pm2 start trigger.js

  • Configure you nginx if you are using one
  • Don't forget to enable firewall to allow the new port

Update in bitbucket

  • Got Settings
  • Select Webhooks in Workflow section
  • Click Add webhook button
  • Give name
  • Enter your url as *<your_domain_name>.<ext>:<your_port_for_trigger>/update*
  • Press save button

Or follow the steps provided by your git server.