Terminology

Let’s start by fixing some terminology.

Shared Library. We use the cross-platform term shared library to refer to, what is otherwise known as shared objects, dynamic object, Dynamic Shared Object(DSO) or on Windows Dynamic Load Library(DLL).
Binary. An executable or a shared library.
Symbol. Function / global variables.
Linux. I will loosely use the term linux for all UNIX descendants.

Nano-introduction to Linking

We start with source files. In the first phase, the compiler translates them into object files. An object file is nothing but a container for sections. Two canonical examples for sections are as .text which contains machine code and .data which contains program data.

In the second phase, the linker - the first thing it does is, it pulls together the identically named sections and concatenates them into one larger section. The second thing it does, is it reorders the sections. So, sections with similar required run-time permissions are adjacent on disk.

In the final stage, the loader takes these adjacent chunks called segments and maps them to memory on page-aligned boundaries. After this mapping, the loader adjusts the page permissions accordingly.

Now, code typically does lots of function calls. The simplest case is that of from the binary into itself. But, that is not the general case. In general, the process can map shared libraries into its address space and perform calls into functions which are implemented in the shared library. The shared library itself can perform calls into functions implemented in yet other shared libraries. Arguably, the most important job of both the linker and loader is to properly wire these calls. How would might a wiring look like?

Say, we have a .code section that calls the function foo(), which is implemented in another binary. The .code section itself does not contain the string foo, it contains a call 0x0000 instruction into an address yet unknown at link time. So, the .code section contains a placeholder - a string of 0s. In addition, the linker generates another section called .reloc for relocation. On Windows, these are sometimes called fixups. A .reloc section is essentially a small TODO item for the loader. The linker says, “Dear loader, please find the function foo, when you do overwrite this placeholder with the address you found.”. This relocation does happen in practice, but it is generally frowned upon, for two reasons. First, modifying the .code section makes the .code section unshareable between processes and second this form of relocation needs to be done once per call site, not once per function, which can amount to a large difference if you have tens of thousands of calls into the same function e.g. a game engine with calls to the math function sqrt().

The more typical scheme is to route all calls to the same function through an indirect call into a single placeholder. In this scheme, when the loader reads the .reloc section, it knows it needs to overwrite just a single slot with the function foo()’s address and not all the call sites. This design trades a little bit of run-time performance for a lot of load-time savings. There is an entire section of such placeholders. On win64, it is called the IAT (Import Address Table). In Linux, this is a rough approximation for a section known as the GOT(Global Offset Table). Note already, that this design means that cross-binary calls are indirect. They carry the same overhead as virtual functions do. And this is not the full truth, it is just the first step in our journey towards it.

Windows Binary

Let us take a closer look at Windows first. A windows binary contains descriptions of all the imports it consumes in a section called .idata (the i stands for import). The most important data-structure in the .idata section is the directory table. The directory table is a table of entries - one per shared library. Such an entry contains the DLL name, an import lookup table (an offset) which contains imported symbol names for the loader’s usage, an offset into the import lookup table, which contains information about which symbols to locate within this library and an offset into an import address table (IAT), which tells the loader where to write the addresses of these located symbols once they are found.

Schematically, what Windows binary tells the world, tells the loader is, “Dear loader, please load the library lib1, and from it, please locate the symbols f1, f2 and f3. Then, load the library lib2 and from it, please locate the symbols g1, g2 and g3.”.

Windows Binary (.exe / .dll)
+------------------------------------------------------------------------------+
|  .idata section                                                              |
|  +--------------------------+                                                |
|  |     Directory table      |                                                |
|  |                          |                                                |
|  |  +--------------------+  |   ILT (symbol names)    IAT (addresses)        |
|  |  |  Entry: lib1.dll   |--+--> +--------------+    +--------------+        |
|  |  |  name / ILT / IAT  |  |    |  f1          |    |  <- addr(f1) |        |
|  |  +--------------------+  | +->|  f2          |    |  <- addr(f2) |        |
|  |           |              | |  |  f3          |    |  <- addr(f3) |        |
|  |           +--------------+-+  +--------------+    +--------------+        |
|  |                          |              ^                   ^             |
|  |  +--------------------+  |   ILT        |        IAT        |             |
|  |  |  Entry: lib2.dll   |--+--> +--------------+    +--------------+        |
|  |  |  name / ILT / IAT  |  |    |  g1          |    |  <- addr(g1) |        |
|  |  +--------------------+  | +->|  g2          |    |  <- addr(g2) |        |
|  |           |              | |  |  g3          |    |  <- addr(g3) |        |
|  |           +--------------+-+  +--------------+    +--------------+        |
|  +--------------------------+              |                   |             |
|                                            +---------+---------+             |
+------------------------------------------------------------------------------+

Linux import sections

As we shall see, this is not what happens in Linux. A linux binary contains not one but two sections, encoding almost identical information to that of idata in Windows. A .dynamic section contains a raw list of library names and the .dynsym section contains the more famous symbol table. It contains not just the symbols to be imported, but all the symbols that this binary contains. The ones to be imported are marked in the binary as undefined UND.

`.dynamic` section

readelf -d $(which ls) | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libcap.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

`.dynsym` section

readelf -W --syms $(which ls)

Symbol table .dynsym contains 132 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __ctype_toupper_loc@GLIBC_2.3 (2)
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND getenv@GLIBC_2.2.5 (3)
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND cap_to_text
     4: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND sigprocmask@GLIBC_2.2.5 (3)
     5: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __snprintf_chk@GLIBC_2.3.4 (4)
     6: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND raise@GLIBC_2.2.5 (3)
     7: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND free@GLIBC_2.2.5 (3)
     8: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __vfprintf_chk@GLIBC_2.3.4 (4)
     9: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@GLIBC_2.34 (5)
    10: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __mempcpy_chk@GLIBC_2.3.4 (4)
    11: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND abort@GLIBC_2.2.5 (3)
    12: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __errno_location@GLIBC_2.2.5 (3)
    13: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND strncmp@GLIBC_2.2.5 (3)
    14: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTable
    15: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND localtime_r@GLIBC_2.2.5 (3)
    16: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _exit@GLIBC_2.2.5 (3)
    17: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __fpending@GLIBC_2.2.5 (3)
    18: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND flistxattr@GLIBC_2.3 (2)
    19: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND isatty@GLIBC_2.2.5 (3)
    20: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND sigaction@GLIBC_2.2.5 (3)
    21: 0000000000000000     0 FUNC    GLOBAL DEFAU

To summarize, .dynamic and .dynsym are separate buckets of lib names and symbol names.

Schematically, a linux binary speaks to the loader in very different terms. It tells the loader - here are the libraries, I want you to map into the process. And here’s the bucket of symbols, I ask you locate anywhere within these libraries.

Linux Binary (ELF)
 +------------------------------------------------------------------------------+
|                                                                              |
|  .code section                                                               |
|  +------------------------------------------------------------------------+  |
|  |  .text   (executable instructions)                                     |  |
|  |  .rodata (read-only data, string literals, ...)                        |  |
|  |  PLT     (per-symbol call stubs -> resolved via GOT at runtime)        |  |
|  +------------------------------------------------------------------------+  |
|                                                                              |
|  .data section                                                               |
|  +------------------------------------------------------------------------+  |
|  |  .data   (initialised global/static variables)                         |  |
|  |  .bss    (uninitialised global/static variables)                       |  |
|  |  GOT     (per-symbol address slots, patched by loader at runtime)      |  |
|  +------------------------------------------------------------------------+  |
|                                                                              |
|  .dynamic section                                                            |
|  +------------------------------------------------------------------------+  |
|  |  NEEDED  libcap.so.2                                                   |  |
|  |  NEEDED  libc.so.6                                                     |  |
|  |  (library names only -- no symbol information)                         |  |
|  +------------------------------------------------------------------------+  |
|                                                                              |
|  .dynsym section                                                             |
|  +------------------------------------------------------------------------+  |
|  |  Ndx  Name                         Bind    Source                      |  |
|  |  ---  ---------------------------  ------  --------------------------  |  |
|  |  UND  __ctype_toupper_loc          GLOBAL  GLIBC_2.3                   |  |
|  |  UND  getenv                       GLOBAL  GLIBC_2.2.5                 |  |
|  |  UND  cap_to_text                  GLOBAL  libcap.so.2                 |  |
|  |  UND  sigprocmask                  GLOBAL  GLIBC_2.2.5                 |  |
|  |  UND  free                         GLOBAL  GLIBC_2.2.5                 |  |
|  |  UND  abort                        GLOBAL  GLIBC_2.2.5                 |  |
|  |   :   ...                                                              |  |
|  |   9   main                         GLOBAL  (defined in this binary)    |  |
|  |  10   some_exported_fn             GLOBAL  (defined in this binary)    |  |
|  |   :   ...                                                              |  |
|  +------------------------------------------------------------------------+  |
|                                                                              |
+------------------------------------------------------------------------------+

These seemingly benign differences in the architecture have far-reaching consequences. A symbol in Linux can be resolved from any of these libraries. The one that you intended it to be, or another. On the plus side, this might be an intentional behavior. It might have imported some third-party library and you want to override some implementation of the function with your own. This design enables it. This is an iceberg tip of much deeper design decisions.

Interposition

Interposition is the ability of overriding a symbol in one binary from another. This is a cornerstone of the design of linux execution model. It is a fundamental ABI design pillar.

You might come across alleged motivation for this. There are some claims online that say, the ELF pioneers designed it this way, so dynamic shared libraries would mimic the behavior of earlier static libs. Supposedly, in static libraries, if several libraries implement the same symbol, then the first definition encountered is used and it shadows the later ones. However, that is not entirely true. It might be the case, but you might also get a linker error complaining about ODR violations. The exact behavior depends upon which members are extracted from the static archives.

A different conjectured motivation is that the canonical model in the minds of the ELF architects was the library libc. libc is beyond ubiquitous, it is universal. It is loaded by practically all applications and it is huge. Therefore, it is reasonable to anticipate that users might want to override some of its implementations. Therefore, this capability was baked into the architecture. But this is guesswork. We have no internal knowledge about the motivation.

So, we have discussed one facility that enables interposition - that is the separation of library names from list of symbols to consume. Let’s discuss another - library search order.

Library Search Order

The library search order in Linux is breadth-first search. What this means is, suppose we have an executable which loads the dynamic libraries lib1 and lib2 and these internally load the libraries lib3, lib4 and lib5. Furthermore say that, lib5 wants to use the symbol foo. The order in which the binaries would be searched by the loader for the symbol foo is the first the exe (the executable), then lib1, lib2, then lib3 and lib4 and only finally lib5, even if lib5 implements foo. All the other binaries get an opportunity to interpose foo before it is sought in lib5 itself.

                    ┌───────────────┐
                    │      exe      │
                    └───────┬───────┘
            ┌───────────────┴───────────────┐
            │                               │
    ┌───────┴───────┐               ┌───────┴───────┐
    │     lib1      │               │     lib2      │
    └───────┬───────┘               └───────┬───────┘
    ┌───────┴───────┐                       │
    │               │                       │
┌───┴───────┐ ┌─────┴─────┐         ┌───────┴───────┐
│   lib3    │ │   lib4    │         │     lib5      │
└───────────┘ └───────────┘         │  defines foo  │
                                    └───────────────┘

In particular, the executable is consulte before the current library. This is the default behavior, it can be adjusted. The most direct way is linking lib5 with the linker switch -Bsymbolic. -Bsymbolic tells the linker, that when resolving symbols in lib5, searchn in lib5 before the usual breadth first order.

`LD_PRELOAD`

LD_PRELOAD is an environment variable that, if it exists, and contains library names, they are loaded after the executable, but before any dependent libraries.

                    ┌───────────────┐
                    │      exe      │ 
                    └───────┬───────┘
                    ┌───────┴───────┐
                    │   LD_PRELOAD  │  
                    └───────┬───────┘
            ┌───────────────┴───────────────┐
            │                               │
    ┌───────┴───────┐               ┌───────┴───────┐
    │     lib1      │               │     lib2      │
    └───────┬───────┘               └───────┬───────┘
    ┌───────┴───────┐                       │
    │               │                       │
┌───┴───────┐ ┌─────┴─────┐         ┌───────┴───────┐
│   lib3    │ │   lib4    │         │     lib5      │
└───────────┘ └───────────┘         │  defines foo  │
                                    └───────────────┘

Can a shared-library symbol be overriden from an executable?

Can a shared-library symbol be overriden from an executable? In Windows, the answer is no. In Linux, the answer is yes, thanks to Interposition. This is exactly what Interposition does. In Mac, the answer is yes, but not by default.

C++ `new` operator

Here is a quote from the C++ standard.

operator new(std::size_t)

operator new(std::size_t, std::align_val_t)

…

The program’s definitions are used instead of the default versions supplied by the C++ standard library.

That is what happens on Linux, thanks to interposition. That is not what happens on Windows.

                    ┌─────────────────────┐
                    │          exe        │
                    | operator new(size_t)|
                    └───────────┬─────────┘
            ┌───────────────────┴────────────────────┐
            │                                        │
    ┌───────┴───────┐                   ┌────────────┴─────────┐
    │     glibc     │                   │       libc++         │
    |               |                   | operator new(size_t) |
    +---------------+                   +----------------------+

In fact, Windows can’t do that. Strictly speaking, Windows does not conform to this clause of the standard.

Symbol Resolution Time

Let’s discuss another mechanism that supports interposition. The default behavior is governed by the default switch --allow-shlib-undefined. If we have the below tree of dependencies and suppose lib5 wishes to import the symbol foo and the exe wishes to import the symbol bar. The resolution for the exe symbols is checked at link-time. The linker would refuse to link the executable, if it cannot resolve the symbol bar.

That is not true of the libraries. The linker would very happily agree to link lib5, even if it cannot see foo. As far as the linker is concerned, the implementation of foo might lie in the executable, where it has no chance of discovering it.

                    ┌───────────────┐
                    │      exe      │---(bar    Resolution checked 
                    └───────┬───────┘           at link time
                    ┌───────┴───────┐
                    │   LD_PRELOAD  │  
                    └───────┬───────┘
            ┌───────────────┴───────────────┐
            │                               │
    ┌───────┴───────┐               ┌───────┴───────┐
    │     lib1      │               │     lib2      │
    └───────┬───────┘               └───────┬───────┘
    ┌───────┴───────┐                       │
    │               │                       │
┌───┴───────┐ ┌─────┴─────┐         ┌───────┴───────┐
│   lib3    │ │   lib4    │         │     lib5      │---(foo  Resolution NOT
└───────────┘ └───────────┘         └───────────────┘        checked at link time

This behavior can be controlled with some switches. The easiest would be linking the executable with a --no-allow-shlib-undefined on the exe.

C++ Implication #1 : How to form a process-wide singleton?

Let’s discuss a developer facing implication of this. Can we have a process-wide singleton? By singleton, I mean the Meyers singleton design pattern - a single object which has a unique instance which is usable by all the code in the process, from all binaries.

In Windows, the usual singleton design pattern would create a per-binary singleton. To form a process-wide singleton, that is visible by all binaries in the process, you would need to export the singleton variable from a shared library and and you need to have all the binaries in the process link against that single shared library that implements this singleton and consume it from there.

In Linux, you just need to code your singleton and put it in the executable. It’ll be picked up from there by whichever binary is trying to use it.

C++ Implication #2 : Can you have circular library dependencies?

Can we have circular binary dependencies? That is to say, can we have lib1 that uses symbols from lib2 and at the same time lib2 that uses symbols from lib1? In Linux, the answer is yes. You don’t have to do anything special for this. Both libraries would be very happy to link with the symbols being undefined at link-time, and resolved only by the loader.

In Windows, the short answer is no. Not by design, you’d have to hack pretty hard to get there.

However, note that one should put considerable thought in architecting the libraries well and not be sloppy. Here is a quote from Fangrui Song aka maskray, the main lld maintainer today:

This [allowed-shlib-undefined] is an unfortunate default for shared libraries. Changing it may be disruptive today. Mach-O and PE/COFF have many problems, but this may be a place where they got it right.

Position Independent Code : GOT(Global Offset Table)

The best way to explain position independent code is to understand, what isn’t position independent code. Consider the below binary. This call into a fixed hardcoded address is not position independent. Had this entire binary been loaded at a different address, this baked in call target would have been invalid.

binary loaded at
0x401000
     +----------------------------+
     |                            |
     |  .code                     |
     |  +----------------------+  |
     |  |  ...                 |  |
     |  |  call 0x401550       |  | absolute
     |  |  ...                 |  | address
     |  |                      |  |  
     |  |  foo @ 0x401550      |  | 
     |  |  ...                 |  |
     |  +----------------------+  |
     |                            |
     +----------------------------+

Here is another form of a call. A PC-relative call. In this form of the call, the target is relative to the current location in the program. This is position independent, but this does not allow for interposition.

binary loaded at
0x401000
     +----------------------------+
     |                            |
     |  .code                     |
     |  +----------------------+  |
     |  |  ...                 |  |
     |  |                  <---+--+---+
     |  |  ...                 |  |   |
     |  |                      |  |   |
     |  |  call rip-12     ----+--+---+
     |  |  ...                 |  |
     |  +----------------------+  |
     |                            |
     +----------------------------+

In this form of a call, the loader has no place to intervene and hijack the implementation of the function. So, this is used but only for, what is called hidden symbols, which hopefully we’ll have time to talk about later.

The default mechanism used to implement position independent code is through another level of indirection. A call is made indirectly. A call is being made into a fixed offset into a table of addresses(pointers). We already mentioned this table - this is the global offset table.

binary loaded at
0x401000
     +----------------------------+
     |                            |
     |  .code                     |
     |  +----------------------+  |
     |  |  ...                 |  |
     |  |                      |  |
     |  |                      |  |
     |  |  call [.got+42]      |  |          
     |  |  ...                 |  |
     |  |                      |  |
     |  |  call [.got+42]      |  |
     |  |  ...                 |  |
     |  +----------------------+  |
     |                            |
     |  .got                      |
     |  +----------------------+  |
     |  | 0x123                |  |
     |  | 0x789                |. |
     |  +----------------------+  |     
     +----------------------------+

If this entire binary is relocated, then these addresses in the .got do need to change. This entire binary is not position independent, but the .code section is. At run-time, the address that would be called is:

binary loaded at
0x401000
     +----------------------------+
     |                            |
     |  .code                     |
     |  +----------------------+  |
     |  |  ...                 |  |
     |  |                      |  |
     |  |  ...                 |  |
     |  |                      |  |
     |  |  call 0x789          |  |
     |  |  ...                 |  |
     |  +----------------------+  |
     |                            |
     |  .got                      |
     |  +----------------------+  |
     |  | 0x123                |  |
     |  | 0x789                |. |
     |  +----------------------+  |     
     +----------------------------+

This is the default implementation in all of the shared libraries unless you employed some sort of exotic switches.

Performance implications

Consider the below toy code where the function f() is called from the function g().

void f(){ /*...*/ }
void g(){ f(); }

If you place the above code into an executable, f() would most probably be inlined into g(). That is not the case, if you put the same code in a shared library.

If you put this code in shared library, the compiler is no longer free to assume that the f() implementation it sees right there, is the f() implementation that would be used at run-time. Interposition might kick-in and pre-empt this implementation - in particular, inlining is off the table. But, even more then that, all attributes that the compiler could have deduced about f() are thrown away.

Position Independent Code - switches

To build a shared library that is position independent, you have no choice but to use the -fPIC flag. To build an executable as position independent, you use a different switch with slightly different semantics.

Recall, that executable symbols are not interposable. Nothing comes before the executable in the search order. But, in recent decades, the executable is relocatable, since ASLR(Address Space Layout Randomization) hit the scene, executables can be loaded at varying addresses. To support that, the executable must be built with fPIE (Position Independent Executable) flag.

Lazy Binding a.k.a. delayed loading

Let’s discuss lazy binding. This is probably the most technically demanding topic in this post. Lazy binding is the practice of postponing the resolution of the symbol not just to the load time, but actually to the first call time. The motivation being that, if you have a large binary with tonnes of symbols but your particular invocation only uses a few of them, then potentially a lot of resolution work would go to waste. This can amount to tangible delay in load time that is not necessary. So, the ELF designers devised a plot to only resolve what you actually use in the first time. If you type gcc --version, the responsiveness is amazing, it is instantaneous. gcc is one large binary with millions of symbols. This is the way it is achieved, by lazy binding.

Lazy bind by default

By default, linux uses lazy binding. If you do not intervene, symbols are lazy bound. You can intervene with some linker switches -no-plt, -z now or at a function granularity with the function attribute noplt or with an environment variable LD_BIND_NOW. In Windows, by default lazy binding is off. In Windows lingo, this is called delayed load, and to enjoy it, you need to specify after the DELAYLOAD switch, the list of DLLs you want to lazy bind to.

/DELAYLOAD:<your_dll.dll>

The true flow of `call` in linux

Recall that, this is where we left off, in our pursuit of a true call in Linux. To understand lazy binding, let’s take a closer look at it. To have lazy binding, obviously the address of the function in the .got slot will have to change at run-time.

+----------------------------+        +---------------+
| .code                      |        |      lib      |
| +----------------------+   |        +---------------+
| | f();                 |   |        |               |
| |   call [got_slot_42] |   |        | f:            |
| +----------------------+.  |        |   //Defn      |   
|                            |        |               |
|                            |        +---------------+
|                            |
| .got                       |
| +----------------------+   |
| | got_slot_42:         |   |
| | ...                  |   |
| +----------------------+   |
|                            |
+----------------------------+

Initially, the address in got_slot_42 would have to route the execution into someone that performs the actual resolution. And this someone, would have to overwrite the address with the resolved address of f(){ /*...*/ }. So, this is how it is achieved. The call in fact not an indirect call into the the .got slot. It is a direct call into a procedure lookup stub. Refer the image below.

+-------------------------------------------+        +---------------------+
|   .code                                   |        |         lib         |
|   +---------------------------------+     |        +---------------------+
|   | f();                            |     |        |                     |
|   |   call procedure_lookup_stub_42 |     |        | f:                  |
|   +--------|------------------------+     |        |      //Defn         |   
|            |                              |        |                     |
|            |   +--------------------+     |        +---------------------+
|            +---|->jmp [got_slot_42] |-+   |          
|            +---|->push 42           | |   |      
|            |   |  jmp <ldr resolver>|-|---|---+    +---------------------+
|            |   +--------------------+ |   |   |    |       loader        |
|            |                          |   |   |    +---------------------+
|   .got     +---------+                |   |   |    |                     |
|   +------------------|--------------+ |   |   +----|-> <ldr resolver>:{  |
|   | got_slot_42: &"resolve42"    <--|-+   |        |     //Definition    |
|   | ...                             |     |        |   }                 |
|   +---------------------------------+     |        |                     |
|                                           |        +---------------------+
+-------------------------------------------+

This stub starts with an indirect jump to the got slot. As we mentioned, initially, this slot contains the address of somewhere that performs the resolution. And incidentally, this somewhere is exactly the next instruction in the stub. So, on first call, jmp [got_slot_42] causes the control to jump exactly to the next instruction. This next instruct sets up the arguments to the resolver and calls the resolver. The resolver is part of the loader, which is a different binary which is still loaded into the process address space.

This resolver performs the resolution - thus it’s magic, locates the function and overwrites the got slot with the address it found. Don’t forget that, after all this, we must delegate the execution to f() itself - all this was part of an f() call. This is how the first call to the function f which was lazily bound looks like.

+-------------------------------------------+        +---------------------+
|   .code                                   |        |         lib         |
|   +---------------------------------+     |        +---------------------+
|   | f();                            |     |        |                     |
|   |   call procedure_lookup_stub_42 |     |   0x789| f:                  |
|   +--------|------------------------+     |        |      //Defn         |   
|            |                              |        |                     |
|            |   +--------------------+     |        +---------------------+
|            +---|->jmp [got_slot_42] |-+   |          
|            +---|->push 42           | |   |      
|            |   |  jmp <ldr resolver>|-|---|---+    +---------------------+
|            |   +--------------------+ |   |   |    |       loader        |
|            |                          |   |   |    +---------------------+
|   .got     +---------+                |   |   |    |                     |
|   +------------------|--------------+ |   |   +----|-> <ldr resolver>:{  |
|   | got_slot_42: 0x789           <--|-+   |        |     //Definition    |
|   | ...                             |     |        |   }                 |
|   +---------------------------------+     |        |                     |
|                                           |        +---------------------+
+-------------------------------------------+

After the first call, things are much simpler. The indirected jump from the first instruction in the stub jumps directly to f.

Now, that was an individual procedure lookup stub. The full table of all such stubs is called the Procedure Lookup Table(PLT).

.plt
+----------------------+
| ...                  |
| jmp [got_slot_41]    |
|   push 41            |
|   jmp <ldr_resolver> |
|                      |
| jmp [got_slot_42]    |
|   push 42            |
|   jmp <ldr_resolver> |
|                      |
| jmp [got_slot_41]    |
|   push 41            |
|   jmp <ldr_resolver> |
| ...                  |
+----------------------+

That is, as close as we get to the full truth.

C++ Implication #3: Comparing function pointers

An interesting wrinkle comes up when we consider function pointers. Here is another quote from the standard:

C++ standard. If the pointers are both null, both point to the same function, or both represent the same address, they compare equal.

When you first come across this, you might wonder, why this even needs to be said.

However, after learning about the PLT and GOT design : it seems like magic that this is even possible. The actual calls are made into PLT slots and .plt is per binary; calls in different binaries are actually routed to different PLTs.

+------------------------+   +-------------------------+
| exe                    |   | lib1                    | 
|                        |   |                         |
| +-----------------+    |   |  +-----------------+    |
| | call            |    |   |  | call            |    |
| |   .plt_exe+8    |    |   |  |    .plt_lib1+8  |    |
| +-----------------+    |   |  +-----------------+    |
|                        |   |                         |
| +-----------------+    |   |  +-----------------+    |
| | .plt_exe        |    |   |  | .plt_lib1       |    |
| +-----------------+    |   |  +-----------------+    |
|                        |   |                         | 
| +-----------------+    |   |  +-----------------+    |
| | .got_exe        |    |   |  | .got_lib1       |    |
| |     0x1234      |    |   |  |     0x1234      |    |
| +-----------------+    |   |  +-----------------+    |
+------------------------+   +-------------------------+

How can function pointers in different binaries compare equal? Well, Linux sweats quite a bit to get this done. What happens is that, in a state where both the executable and the library both reference f, and say the library takes its address, the address is actually stored in the symbol table of the executable as the value of the symbol f.

exe                          lib1
      +----------------------+     +----------------------+
      |                      |     |                      |
0x1234| void f() { ... }     |     |   auto* pf = &f;  --+----.
      |                      |     |                      |    |
      +----------------------+     |                      |    |
      |                      |     +----------------------+    |
.plt+8| .plt_exe             |     |                      |    |
      |   jmp [.got+4]       |     |                      |    |
      |                      |     +----------------------+    |
      +----------------------+                                 |
      |                      |                                 |
.got+4| .got_exe             |                                 |
      |   0x1234             |                                 |
      |                      |                                 |
      +----------------------+                                 |
      |                      |                                 |
      | .dynsym              |                                 |
      |   f: .plt_exe+8  <--++---------------------------------'
      |                      |
      +----------------------+

When the loader needs to resolve the address of f here, it actually takes it from the symbol table in the executable. This way, both the executable and lib1 would get the same address for f and in fact, it would be a valid function pointer.