Disclaimer: this post isn’t really about C++ but I hope you’ll find it useful in any case ;)
When I first heard that dynamically loading libraries built in Go take much longer than one written in C or C++, my first reaction was one of surprise.
The second was: of course! And although it made sense I wanted to find out the reasons behind it.
So I went to compare what happens when loading a shared library object built in Go and another in C++, compiled with gcc. For clarity, both both cases built for and running on linux 64bit using Go version 1.11.
Go works with the concept of packages, that (normally for the most part) are also written in go. In the default build mode is generate archive (static linked) libraries. These libraries are then, later in the build process, linked against the main package(s).
The same thing happens with the go run time code: all is statically included into the final executable or library. And that is a good thing.
Running ldd
on an executable would probably look like this.
$ ldd go-executable
linux-vdso.so.1 (0x00007fffc7fe9000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f84d3dbb000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f84d3bd1000)
/lib64/ld-linux-x86-64.so.2 (0x00007f84d3df4000)
Naturally there are exceptions, but we’ll save that discussion for another day.
The idea is very simple but sufficient (or so I hope): write 2 functionally identical libraries, one in Go and the other in C++. Then a third program to load and use these libraries. This is the program we’ll be profiling and looking for .
I say functionally because obviously the code is different but the exported functions do the same thing.
This is the source code for testgo.go
package main
import (
"C"
)
//export GetSomeInfo
func GetSomeInfo() *C.char {
return C.CString("this is a go library")
}
func main() {}
and here its C++ counterpart, testcpp.cpp
#include <string>
static std::string somethingToBeReturned("this is a C++ library");
extern "C"{
__attribute__ ((visibility ("default")))
const char* GetSomeInfo(){
return somethingToBeReturned.c_str();
}
}
Very simple.
The Go version of the library was compiled with
go build o libtestgo.so -buildmode=c-shared testgo.go
And the C++ version compiled with gcc with some extra arguments to match the default go compilation.
g++ testcpp.cpp -g -shared -o libtestcpp.so -fpic -m64 -fmessage-length=0
To what Go build defaults are, run go env
and check the value of the variable GOGCCFLAGS
.
Finally, the program to load the libraries, loader.cpp
#include <iostream>
#include <string>
#include <dlfcn.h>
#define CPPLIBNAME "./libtestcpp.so"
#define GOLIBNAME "./libtestgo.so"
char* (*getSomeInfo)();
void handleError(){
char* error = dlerror();
if(error != NULL){
std::cerr << "Error loading lib " << error << std::endl;
}
}
void loadLib(const char* libName){
void* handle = NULL;
handle = dlopen(libName, RTLD_NOW);
handleError();
getSomeInfo = (char*(*)())dlsym(handle, "GetSomeInfo");
handleError();
std::cout << "some info is '" << (*getSomeInfo)() << "'" << std::endl;
dlclose(handle);
}
void cppLoad(){
loadLib(CPPLIBNAME);
}
void goLoad(){
loadLib(GOLIBNAME);
}
int main(int argc, char** argv){
if(argv[1][0] == 'c'){
cppLoad();
}else{
goLoad();
}
}
A bit of hacked code but it will suffice.
Note the argument c
that tells our loader to, well, load the C++ library otherwise, the go library. We’ll use this below.
My first attempt to benchmark load time was to use one of my favourite frameworks Google Benchmark but that would require the library loading code to be executed multiple times. However, my instincts were telling me that anything that was happening behind the scenes during the library initialization would happen only once and it would be hard to work around that.
It turns out those instincts were right, for once. So I turned to another favourite, linux perf
.
First, measuring loading time for each library. We’ll do that with perf stat
.
Ideally we’d this repeated times so that things that would interfere with the results would get diluted but since we’re comparing two result sets, a few runs should be sufficient to give us confidence.
Measuring the load time of the C++ library
$ perf stat -- ./loader c
some info is 'this is a C++ library'
Performance counter stats for './loader c':
1.460310 task-clock (msec) # 0.824 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
142 page-faults # 0.097 M/sec
3,229,283 cycles # 2.211 GHz
3,558,250 instructions # 1.10 insn per cycle
633,848 branches # 434.050 M/sec
19,978 branch-misses # 3.15% of all branches
0.001771322 seconds time elapsed
0.001778000 seconds user
0.000000000 seconds sys
Now, for the load time of the Go library
some info is 'this is a go library'
Performance counter stats for './loader g':
3.568294 task-clock (msec) # 1.006 CPUs utilized
32 context-switches # 0.009 M/sec
3 cpu-migrations # 0.841 K/sec
387 page-faults # 0.108 M/sec
6,418,891 cycles # 1.799 GHz
5,798,816 instructions # 0.90 insn per cycle
1,078,772 branches # 302.322 M/sec
31,764 branch-misses # 2.94% of all branches
0.003547798 seconds time elapsed
0.000000000 seconds user
0.004151000 seconds sys
See the documentation page for more details on perf
.
There are two things immediately visible in the results: task-clock and instructions. The task clock tells us that loading the Go library takes twice as long as the C++ library.
The number of instructions executed indicates there’s a lot more going on when the library built in Go is loaded.
Also, note the number of context-switches and CPU migrations. This suggests that the initialization of the Go library is probably creating (and running) threads.
Now, let’s try and find explanation for these differences.
To find out what the loader
application is doing, profiling the application is a good, time-tested approach. And for that, I resort to perf
once more.
We’ll use a combination of perf record
with perf report
.
Remember, we’re analyzing our library loader with the Go version of our test shared object.
$ perf record -Fmax --strict-freq -g --running-time --call-graph dwarf -- ./loader g
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 2.320 MB perf.data (305 samples) ]
$ perf report
- 8.26% 0.00% loader [.] runtime.rt0_go
- runtime.rt0_go
- 5.08% runtime.schedinit
+ 2.98% runtime.mcommoninit
+ 1.27% runtime.mallocinit
+ 0.83% runtime.modulesinit
+ 0.92% runtime.mstart
+ 0.91% runtime.newproc
+ 0.80% x_cgo_init
+ 0.55% runtime.check
+ 6.01% 0.00% loader [.] runtime.malg
+ 5.08% 0.00% loader [.] runtime.schedinit
+ 4.83% 0.00% loader [.] threadentry
+ 4.78% 0.00% loader [.] runtime.mstart
+ 4.78% 0.00% loader [.] runtime.mstart1
+ 4.20% 0.00% loader [.] runtime.mcommoninit
+ 3.86% 0.00% loader [.] crosscall_amd64
+ 3.58% 0.00% loader [.] runtime.newm
+ 3.56% 0.00% loader [.] runtime.newobject
+ 3.56% 0.00% loader [.] runtime.mallocgc
+ 3.56% 0.00% loader [.] runtime.(*mcache).nextFree
+ 3.56% 0.00% loader [.] runtime.(*mcache).nextFree.func1
+ 3.56% 0.00% loader [.] runtime.(*mcache).refill
+ 3.56% 0.00% loader [.] runtime.(*mcentral).cacheSpan
+ 3.56% 0.00% loader [.] runtime.(*mcentral).grow
+ 3.36% 0.31% loader [.] runtime.schedule
+ 3.29% 0.00% loader [.] runtime.asmcgocall
+ 3.27% 0.00% loader [.] runtime.mpreinit
+ 3.23% 0.00% loader [.] runtime.main
+ 3.04% 0.26% loader [.] runtime.(*mheap).allocSpanLocked
+ 3.03% 0.00% loader [.] runtime.malg.func1
+ 2.91% 0.00% loader [.] runtime.startm
+ 2.77% 0.00% loader [.] runtime.newproc.func1
+ 2.77% 0.00% loader [.] runtime.newproc1
+ 2.76% 0.00% loader [.] runtime.(*mheap).alloc
+ 2.76% 0.00% loader [.] runtime.(*mheap).alloc.func1
+ 2.76% 0.00% loader [.] runtime.(*mheap).alloc_m
+ 2.66% 0.00% loader [.] _cgo_sys_thread_start
+ 2.66% 0.00% loader [.] _cgo_try_pthread_create
+ 2.56% 0.00% loader [.] runtime.gcenable
+ 2.56% 0.00% loader [.] runtime.chanrecv1
+ 2.56% 0.00% loader [.] runtime.chanrecv
+ 2.37% 0.00% loader [.] _rt0_amd64_lib
Above is the call stack after the code executed through perf’s instrumentation process, ordered by percentage execution time.
If you’ve used perf, you are probably well aware that it is awesome but not perfect so the best way to interpret these results are by comparing function calls relative to one another. I cleaned up the output a little so as to focus on the shared object (library) we’re loading. The top one on the list is expanded and we’ll talk a little more about those calls. The remaining lines I left for scientific purposes.
All the calls in the list can aide further investigation steps.
Let’s start with runtime.rt0_go
, simply because it’s the first entry in the list :). Let’s see what was happening before that call was executed.
For that, we’ll use gdb
and put a break point on runtime.rt0_go
$ gdb --args ./loader g
(gdb) break runtime.rt0_go
Breakpoint 1 (runtime.rt0_go) pending.
(gdb) r
New Thread 0x7ffff793c700 (LWP 27545)]
[Switching to Thread 0x7ffff793c700 (LWP 27545)]
Thread 2 "loader" hit Breakpoint 1, runtime.rt0_go () at /usr/lib/go-1.11/src/runtime/asm_amd64.s:89
89 MOVQ DI, AX // argc
(gdb) bt
#0 runtime.rt0_go () at /usr/lib/go-1.11/src/runtime/asm_amd64.s:89
#1 0x00007ffff7e0c164 in start_thread (arg=<optimized out>) at pthread_create.c:486
#2 0x00007ffff7d34def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Not quite what I was expecting, none of the code we wrote is creating threads so all indicates the go runtime is doing it. Which means runtime.rt0_go
is not actually the code entrypoint when our library is loaded.
What’s the Go code entrypoint for this library then? Looking at the output from perf, another interesting candidate is _rt0_amd64_lib
. Putting a breakpoint on it will help clarifying things.
(gdb) break _rt0_amd64_lib
Breakpoint 1 at 0x7ffff79c6d10: file /usr/lib/go-1.11/src/runtime/asm_amd64.s, line 31.
(gdb) r
Breakpoint 1, _rt0_amd64_lib () at /usr/lib/go-1.11/src/runtime/asm_amd64.s:31
31 TEXT _rt0_amd64_lib(SB),NOSPLIT,$0x50
(gdb) bt
#0 _rt0_amd64_lib () at /usr/lib/go-1.11/src/runtime/asm_amd64.s:31
#1 0x00007ffff7fe398a in call_init (l=<optimized out>, argc=argc@entry=2, argv=argv@entry=0x7fffffffde48, env=env@entry=0x7fffffffde60) at dl-init.c:72
#2 0x00007ffff7fe3a89 in call_init (env=0x7fffffffde60, argv=0x7fffffffde48, argc=2, l=<optimized out>) at dl-init.c:30
#3 _dl_init (main_map=main_map@entry=0x55555556aec0, argc=2, argv=0x7fffffffde48, env=0x7fffffffde60) at dl-init.c:119
#4 0x00007ffff7fe7cdc in dl_open_worker (a=a@entry=0x7fffffffdae0) at dl-open.c:517
#5 0x00007ffff7d7648f in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:196
#6 0x00007ffff7fe72c6 in _dl_open (file=0x555555556045 "./libtestgo.so", mode=-2147483646, caller_dlopen=0x555555555330 <loadLib(char const*)+16>, nsid=<optimized out>, argc=2, argv=0x7fffffffde48,
env=0x7fffffffde60) at dl-open.c:599
#7 0x00007ffff7fb1256 in dlopen_doit (a=a@entry=0x7fffffffdd00) at dlopen.c:66
#8 0x00007ffff7d7648f in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffdca0, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:196
#9 0x00007ffff7d7651f in __GI__dl_catch_error (objname=0x55555556ae80, errstring=0x55555556ae88, mallocedp=0x55555556ae78, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:215
#10 0x00007ffff7fb1a25 in _dlerror_run (operate=operate@entry=0x7ffff7fb1200 <dlopen_doit>, args=args@entry=0x7fffffffdd00) at dlerror.c:163
#11 0x00007ffff7fb12e6 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#12 0x0000555555555330 in loadLib (libName=<optimized out>) at loader.cpp:19
#13 0x0000555555555129 in goLoad () at loader.cpp:43
#14 main (argc=<optimized out>, argv=<optimized out>) at loader.cpp:43
That’s more like it! As we go down the stack trace we see the familiar function calls we wrote in our loader application. We found it _rt0_amd64_lib
is the first piece of code invoked when a shared library built in Go is loaded.
For completeness, the entrypoint of a linux Go executable is _rt0_amd64_linux
. It will likely be different for each platform.
Alright, all this reverse engineering got us on the right track. Let’s look at some source code
We saw that the library entry point _rt0_amd64_lib
is in the file asm_amd64.s
and the source code for it can be found here.
Here’s a few interesting bits
TEXT _rt0_amd64_lib(SB),NOSPLIT,$0x50
// some code here
...
MOVQ DI, _rt0_amd64_lib_argc<>(SB)
MOVQ SI, _rt0_amd64_lib_argv<>(SB)
// Synchronous initialization.
CALL runtime·libpreinit(SB)
// Create a new thread to finish Go runtime initialization.
MOVQ _cgo_sys_thread_create(SB), AX
TESTQ AX, AX
JZ nocgo
MOVQ $_rt0_amd64_lib_go(SB), DI
MOVQ $0, SI
CALL AX
// more code here
...
First, the fact there’s a pre-initialization step specific for libraries (build type archive or shared) that is called very early on. The the version of Go we’re using in this post, all that is happening at this stage is the initialization of signal handlers.
Simple, but not necessarily cheap; handlers are installed for every signal the underlying OS supports and that happens via a system call, which makes it a fairly expensive operation. In this example, 66 system calls were made to install Go’s signal handlers.
So that’s one of the things adding to the initialization burden, but not worse.
The second interesting thing is that a good chunk of the initialization happens in the background by executing _rt0_amd64_lib_go
, which in turn simply invokes runtime·rt0_go
where the rest of the initialization happens.
This means that, at some point, this thread has to wait for the initialization tom complete before the code we wrote can actually be executed. This waiting happens in _cgo_wait_runtime_init_done
and this behaviour is nicely documented in the code.
A lot happens in runtime·rt0_go
but the two most interesting calls (in our case, that is) are the calls to runtime·osinit
and runtime·schedinit
. The runtime bootstrap order is documented in proc.go. This is where the fun begins. Take a look at list of functions invoked in schedinit
; from that we can get a good indication of what’s happening and what are the more expensive calls.
func schedinit() {
// some code here
...
tracebackinit()
moduledataverify()
stackinit()
mallocinit()
mcommoninit(_g_.m)
cpuinit()
alginit()
modulesinit()
typelinksinit()
itabsinit()
msigsave(_g_.m)
initSigmask = _g_.m.sigmask
goargs()
goenvs()
parsedebugvars()
gcinit()
// and more code here
...
We could talk about the things that happen in schedinit
for days so let’s focus on memory management: allocation and garbage collection.
If we look back at the perf
results above, a great number of functions that top the list are doing something related to memory allocation or are “slow” because they are invoking memory allocation.
Allocating memory is expensive and managing it is not any less trivial. The Go runtime has a garbage collector - for the happiness joy of us developers - and if you ever used any language that relies on GC (mostly likely java), you know there’s a lot of setup and pain along the lifetime of an executing program.
On top of that, Go has a mechanism to grow the size of goroutines stack dynamically, in order to have the creation of goroutines very efficient. Stack growth happens by acquiring memory from a pre-allocated stack pool.
The creation of the pool happens in the function stackinit()
we saw above in schedinit
. The stack management is a very interesting topic and important to know if you’re trying to get every bit of performance from your Go programs. Since we’re not to explore it further in this post, I suggest treading through the comments and code in stack.go
Keep in mind that the number of processor cores affects the stack initialization and management, since the runtime creates a thread pool for the execution of goroutines and this thread pool is sensitive to the number of processor cores.
Continuing on, we have mallocinit()
that does a loot of complicated and cool stuff and that includes the creation of per threads arenas, later used for memory allocation. Here also the address spaces used by the runtime are defined, the heap (important for GC) and cache are initialized.
Running a garbage collected environment, Go programs do not allocate memory directly from the OS; instead, memory is “handed” over to the executing program by Go’s memory manager. The requested blocks of memory are reserved in the heap given its size. Each size defined by its span class and it can range from small to large. By pre-allocating memory in a way that best serve these classes, the Go runtime can make memory accessible to programs much more quickly than if it always tried to match the requested size precisely.
This allocation (and more) happen in (*mheap) init()
. Take a look at the code and you’ll see other memory blocks being reserved for other functions.
Finally on our list is the initialization of the garbage collector in gcinit()
Here the GC threads (sweep, mark, cleanup) are initialized as well as the thresholds are defined.
We’ve covered a lot of ground already and I will stop here, for now.
Go doesn’t give us many options to tweak the runtime initialization - perhaps for the best - so there’s not much we can do in that sense.
We can however be mindful of this fact and how hardware configuration will perform differently and especially aware of packages we import in our program.
For example, look at the hot spots (output from perf
again) in the runtime initialization when gzip is imported in our test library.
- 7.82% 0.00% loader [.] runtime.main
- runtime.main
- 6.84% main.init
- 6.19% compress/gzip.init
- 4.35% bufio.init
bytes.init
+ unicode.init
+ 1.20% compress/flate.init
+ 0.64% hash/crc32.init
+ 0.65% syscall.init
+ 0.99% runtime.gcenable
See how the compress and buffer packages add to the overall initialization time substantially.
The initialization of the Go runtime maybe not as fast as one would like but it does what it has to do only once so that the user code, the code we write, can perform well.
Bira Ignacio