
The Linux COW
In our last post we have seen how to get the physical memory address using a virtual memory address. In this post lets try to check Copy on Write(COW from here) implemented by Linux. For those who try to recollect what is COW, on fork()
, Linux will not immediately replicate the address space of parent process. As the child process is most-likely to load a new binary, copying the current process’s address space will be of no use in many cases. So unless a write occurs (either by the parent process or by the child process), the physical address space mapped to both processes’ virtual address space will be same. Please go through Wikipedia if you still have no clue.
So lets write a C program that forks itself and check whether both are sharing the physical address space. We are going to reuse the get_paddr.c
function we wrote in our previous post to get the physical memory address.1
2
3
4
5
6
7
8
9
10
11
12
13
14
int main()
{
int a;
fork();
printf("my pid is %d, and the address to work is 0x%lx\n", getpid(), (unsigned long)&a);
scanf("%d\n", &a);
return 0;
}
Execute this program in one console and while it is waiting for the user input, go to the next console and get the physical address of variable a
using our get_paddr
binary.1
2
3
4$ gcc specimen.c -o specimen
$ ./specimen
my pid is 5912, and the address to work is 0x7ffd055923e4
my pid is 5913, and the address to work is 0x7ffd055923e4
1 | $ sudo ./get_paddr 5912 0x7ffd055923e4 |
OOPS! The physical address is not same! How? Why? What’s wrong? As per my beloved Robert Love’s Linux System Programming book, the physical memory copy will occur only when a write occurs.
The MMU can intercept the write operation and raise an exception; the kernel, in response, will transparently create a new copy of the page for the writing process, and allow the write to continue against the new page. We call this approach *copy-on-write(COW). Effectively, processes are allowed read access to shared data, which saves space. But when a process wants to write to a shared page, it receives a unique copy of that page on fly, thereby allowing the kernel to act as if the process always had its own private copy. As copy-on write occurs on a page-by-page basis, with the technique a huge file may be efficiently shared among many processes, and the individual processes will receive unique physical pages only for those pages to which thy themselves write.
It is so clear that in our program specimen.c
we never write to the only variable a
unless the scanf
completes. But we tested the COW before input-ing to the scanf
. So by the time there would be no write and as per the design of COW physical memory space should be shared across both parent and child. Even though we have written the program carefully, sometimes the libraries, compiler and even the OS will act intelligently (stupid! They themselves says it) and fills us with surprise. So lets run the handy strace
on our specimen
binary to see what it actually does.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38$ strace ./specimen
execve("./specimen", ["./specimen"], [/* 47 vars */]) = 0
brk(NULL) = 0x55de17b57000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=94696, ...}) = 0
mmap(NULL, 94696, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fdbad5b8000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\22\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1960656, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdbad5b6000
mmap(NULL, 4061792, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fdbacfc9000
mprotect(0x7fdbad19f000, 2097152, PROT_NONE) = 0
mmap(0x7fdbad39f000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE,
3, 0x1d6000) = 0x7fdbad39f000
mmap(0x7fdbad3a5000, 14944, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS,
-1, 0) = 0x7fdbad3a5000
close(3) = 0
arch_prctl(ARCH_SET_FS, 0x7fdbad5b7500) = 0
mprotect(0x7fdbad39f000, 16384, PROT_READ) = 0
mprotect(0x55de15e94000, 4096, PROT_READ) = 0
mprotect(0x7fdbad5d0000, 4096, PROT_READ) = 0
munmap(0x7fdbad5b8000, 94696) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7fdbad5b77d0) = 5932
getpid() = 5931
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
my pid is 5932, and the address to work is 0x7fff9823b814
brk(NULL) = 0x55de17b57000
brk(0x55de17b78000) = 0x55de17b78000
write(1, "my pid is 5931, and the address "..., 58my pid is 5931, and the address
to work is 0x7fff9823b814
) = 58
fstat(0, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
read(0,
As the program is waiting on scanf
, it is waiting for user input in read
system-call from fd 0
, stdin
. But the surprise part here is the clone
system call. If you don’t remember, read the specimen.c
program once again. We never used the clone
system call. And where is the fork
? It seems glibc
does something in a so called intelligent way. Lets read the man pages
once again for pointers.1
2
3
4
5
6
7C library/kernel differences
Since version 2.3.3, rather than invoking the kernel's fork() system call,
the glibc fork() wrapper that is provided as part of the NPTL threading
implementation invokes clone(2) with flags that provide the same effect as
the traditional system call. (A call to fork() is equivalent to a call to
clone(2) specifying flags as just SIGCHLD.) The glibc wrapper invokes any
fork handlers that have been established using pthread_atfork(3).
Doesn’t look completely true. Though the man page
says it calls clone()
with flags as just SIGCHLD
, strace
uncovers the dirty truth - flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD
. Nobody escapes from an armed system programmer. Anyway this explains clone
but no-fork
surprise. But where comes the write that causes COW
. See the last argument of clone()
in strace
, child_tidptr=0x7fdbad5b77d0
some place holder stack variable passed to be filled. And what about the two flags other than SIGCHLD
? Lets go to the man pages
once again, but for clone()
this time.1
2
3
4
5
6
7
8
9 CLONE_CHILD_CLEARTID (since Linux 2.5.49)
Clear (zero) the child thread ID at the location ctid in child memory when
the child exits, and do a wakeup on the futex at that address. The address
involved may be changed by the set_tid_address(2) system call. This is
used by threading libraries.
CLONE_CHILD_SETTID (since Linux 2.5.49)
Store the child thread ID at the location ctid in the child's memory.
The store operation completes before clone() returns control to user space.
Ahh! The culprit is CLONE_CHILD_SETTID
. It makes the OS to write thread ID of the child into its memory region. This write triggers the COW
and gets the child a fresh copy of physical memory. Sorry Linux, I doubted you.
Okay. Lets modify our specimen
with clone()
and no CLONE_CHILD_SETTID
flag.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39/* clone() */
/* getpid() */
int run(void *arg)
{
printf("my pid is %d and address to check is 0x%lx\n", getpid(), (unsigned long) arg);
scanf("%d\n", (int *)arg);
return 0;
}
int main()
{
int a;
void *stack_ptr = malloc(1024*1024);
if (stack_ptr == NULL) {
printf("No virtual memory available\n");
return -1;
}
/*fork();*/
clone(run, (stack_ptr + 1024*1024), CLONE_CHILD_CLEARTID|SIGCHLD, &a);
if (a < 0)
perror("clone failed. ");
run(&a);
return 0;
}
Unlike fork()
, clone()
doesn’t start the execution of child from the point where it returns.Instead it takes a function as an argument and runs the function in child process. Here we are passing run()
function which prints the pid
and virtual address
of the variable a
. After clone()
the parent also calls the function run()
so that we’ll get the pid
of parent process.
Have you noted the second argument of clone()
? It is the stack
where child is going to execute. Again unlike fork()
, clone()
allows parent and child to share their resources like memory, signal handlers, open file descriptors, etc. As both cannot run in same stack
, the parent process must allocate some space and give it to the child to use it as stack. stack
grows downwards on all processors that run Linux, so the child-stack should point out to the topmost address of the memory region. That’s why we are passing the end of malloc
ed memory. Lets execute the program and see whether both process share the same physical memory space.1
2
3$ ./specimen
my pid is 3129 and address to check is 0x7fff396fdd6c
my pid is 3130 and address to check is 0x7fff396fdd6c
1 | $ sudo ./get_paddr 3129 0x7fff396fdd6c |
Different again! Linux can’t be wrong. We should make sure we didn’t mess-up anything. Are we sure there will be no write in stack after clone()
call?
- The child’s execution starts at function
run()
- No variables are written until
scanf()
returns. Okay we’ll complete our examination before providing any input. - But the functions? OOPS! After clone, parent calls
run()
and thenprintf()
but child callsprintf()
directly. All these function calls will make write in stack which triggers COW.
Lets come-up with a different C-program with no function calls after clone()
in both parent and child.
1 | /* clone() */ |
We are not printing anything after clone()
to avoid stack overwrite. So we have to assume the child process’s pid
is parent pid
+ 1. This assumption is true most of the times. And we use infinite while
loop to pause both processes. Loops use jump
statements. So they will not cause any stack write. So this time we should see same physical address has been used by both processes. Lets see.1
2$ ./specimen
my pid is 3436 and address to check is 0x7fffa920c1ec
1 | $ sudo ./get_paddr 3436 0x7fffa920c1ec |
GREAT! At last we conquered. We saw the evidence of our beloved COW. Now I can sleep peacefully.
For pure Engineer
Still I feel the urge to make a stack write in child process and see the physical address change. For those curious people out there who want to see the things break, lets run it one more time. Change the run()
function as follows and execute specimen.c
.1
2
3
4
5
6
7int run(void *arg)
{
*(int*)arg = 10; //stack write
while(*(int*)arg);
return 0;
}
1 | $ ./specimen |
1 | $ sudo ./get_paddr 3549 0x7ffdd8e1e9ac |
Good Nightzzz…