In this article, we explain the Apache Log4Shell vulnerability in plain English, and give you some simple educational code that you can use safely and easily at home (or even directly on your own servers) in order to learn more.

Just to be clear up front: we’re not going to show you how to build a working exploit, or how set up the services you need in the cloud to deliver active payloads.

Instead, you will learn:

  • How vulnerabilities like this end up in software.
  • How the Log4Shell vulnerability works.
  • The various ways it can be abused.
  • How to use Apache’s suggested mitigations.
  • How to test your mitigations for effectiveness.
  • Where to go from here.

1. Improper input validation

The primary cause of Log4Shell, formally known as CVE-2021-44228, is what NIST calls improper input validation.

Loosely speaking, this means that you place too much trust in untrusted data that arrives from outsiders, and open up your software to sneaky tricks based on booby-trapped data.

If you’ve ever programmed in C, you’ll almost certainly have bumped into this sort of problem when using the printf() function (format string and print).

Normally, you use it something like this:


  int  printf(const char *format, ...);

  int  count; 
  char *name;

  /* print them out somewhat safely */

  print("The name %.20s appeared %d timesn",name,count);

You provide a hard-coded format string as the first argument, where %.20s means “print the next argument as a text string, but give up after 20 bytes just in case”, and %d means “take an integer and print it in decimal”.

It’s tempting also to use printf() when you want to print just a single string, like this, and you often see people making this mistake in code, especially if it’s written in a hurry:


   int  printf(const char *format, ...);

   /* printfhack.c */

   int main(int argc, char **argv) {
      /* print out first command-line argument */
      printf(argv[1]);    <-- use puts() or similar instead
      return 0;
   }

In this code, the user gets not only to choose the string to be printed out, but also to control the very formatting string that decides what to print.

So if you ask this program to print hello, it will do exactly that, but if you ask it to print %X %X %X %X %X then you won’t see those characters in the output, because %X is actually a magic “format code” that tells printf() how to behave.

The special text %X means “get the next value off the program stack and print out its raw value in hexadecimal”.

So a malcontented user who can trick your little program into printing an apparently harmless string of %Xs will actually see something like this:


   C:Usersduck> printfhack.exe "%X %X %X %X %X"

   155FA30 1565940 B4E090 B4FCB0 4D110A

As it happens, the fifth and last value in the output above, sneakily sucked in from from the program stack, is the return address to which the program jumps after doing the printf()

…so the value 0x00000000004D110A gives away where the program code is loaded into memory, and thus breaks the security provided by ASLR (address space layout randomisation).

Software should never permit untrusted users to use untrusted data to manipulate how that very data gets handled.

Otherwise, data misuse of this sort could result.