Code Obfuscation For The CPP Language
Code Obfuscation For The CPP Language
C/C++ Language
A signed and completed cover sheet must accompany the submission of the Software Engineering
dissertation submitted for assessment.
1. Has a full bibliography attached laid out according to the guidelines specified in the Student Project
Handbook
2. Contains full acknowledgement of all secondary sources used (paper-based and electronic)
5. Is submitted on, or before, the specified or agreed due date. Late submissions will only be accepted in
exceptional circumstances or where a deferment has been granted in advance.
By submitting your dissertation you declare that you have watched the video on plagiarism
at http://www.qub.ac.uk/directorates/sgc/learning/WritingSkillsResources/Plagiarism/ and
are aware that it is an academic offence to plagiarise. You declare that the submission is your
own original work. No part of it has been submitted for any other assignment and you have
acknowledged all written and electronic sources used.
Student’s signature Date of Submission
Dominik Picheta Wednesday, 02 May 2018
CONTENTS Contents
Contents
1 Acknowledgments 2
2 Abstract 2
5 Design 14
5.1 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.4 Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.6 Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8 Conclusion 29
References 30
1 Acknowledgments
Thanks to Dr. Stuart Ferguson for supervising this fascinating project and for offering
guidance and help when required.
2 Abstract
Obfuscation is the action of making something unintelligible. In software development,
this action can be applied to source code or binary applications. The aim of this
dissertation was to implement a tool for the obfuscation of C and C++ source code.
The motivation was to allow proprietary code to be distributed to third-parties without
risking a recreation of the intellectual property within it. While many obfuscators
exist, they seldom focus on software that is distributed in source code form. This
dissertation presents the challenges and successes that arose during the development
of a C and C++ source code obfuscator using the Nim programming language [1].
2
3 Introduction and Problem Specification
and tampering. Collberg [2] defines these attack types and suggests tools to defend
against each of them. For the purposes of this dissertation, the focus will be on
reverse engineering protection.
The main defense against reverse engineering is obfuscation [2]. Obfuscation can be
said to protect the intellectual property (IP) of software from reverse-engineering
attacks. It is important to protect the IP as it can include sensitive data, algorithms,
or the design of the software which the developer may not wish to be copied [3, Sec.
2]. Obfuscation is defined as the transformation of code into something unintelligible
which preserves the semantics of the original code.
Software can be distributed in two forms, as source code or compiled machine code.
Machine code instructions can be executed directly by a computer’s central processing
unit (CPU). For a programmer, reading these instructions is not an easy task and
writing them is even more difficult. Because of this machine code is usually generated
from source code by a special program called a compiler. An executable or binary file
stores machine code instructions in a format that is specific to a particular operating
system.
Special applications called decompilers can be used to reverse the process of compila-
tion. That is, they take an executable file as input and output high-level source code
which matches the functionality of the executable. A disassembler is a special kind of
decompiler which translates machine language into assembly language. Decompilers
targeting languages such as C++ will often use a disassembler as the first stage of
the decompilation process.
Decompilers are an important and often used tool for the reverse engineering of
machine code. This is why applications designed for obfuscating executables focus
on the obstruction of disassembly and decompilation [4, Sec. 1]. The aptly named
obfuscator application developed in this dissertation does not put any effort into
obstructing decompilation. Instead it focuses on making the source code as difficult
to reverse engineer as possible, by decreasing its overall comprehensibility.
Reverse engineering source code is all about understanding it, figuring out the control
flow and how it interacts with its input data. With enough time, every piece of code
can be reverse engineered, so it is impossible to guarantee complete safety [5, Sec. 1].
But making the reverse engineering economically impractical is often enough.
A tool that obfuscates source code is incredibly valuable for preventing these attacks.
It can be a cheap way to protect IP from third parties who may have disassembled
an executable, or those who need to have access to the source code in order to build
it on a niche platform [6].
3
3.1 Challenges 3 Introduction and Problem Specification
3.1 Challenges
The actual act of obfuscating source code is no easy task. There are many distinct
challenges that need to be explored.
3.1.1 Parsing
Because of the complexity involved in parsing C and C++ code, developing a full-
featured parser would take a lot of time. It is possible to take some shortcuts, perhaps
a minimal parser that picks out desired syntactic elements of a source code file could
be written quickly, but it would always be missing vital features for some users. In
such a parser the foundations would always be rough and without taking into account
the full grammar of the C++ language from the start it would be doomed to a dead
end at some point.
Thankfully the most popular C and C++ compilers, gcc and clang, are open source.
Even though their codebase is large, the code responsible for parsing is logically
separated from any other functionality, making it reusable from other applications.
The compiler’s parsers are incredibly robust and support every current C/C++
feature.
Reusing a parser comes with its own challenges, but solving them is much easier
than writing a custom parser from scratch. Section 6 discusses the implementation
challenges in detail.
4
3.1 Challenges 3 Introduction and Problem Specification
3.1.2 Transformations
Once the code can be parsed, it needs to be represented in the program’s memory.
The representation needs to be flexible enough to be mutable, in order to facilitate
various code transformations. The transformations are necessary to obfuscate the
code.
The objective is to apply transformations that deliberately obfuscate the source code
of a program, so that its purpose or logic is concealed without any alterations being
made to its functionality.
Transformations can be separated into three main classes [7, Sec. 2]:
5
3.1 Challenges 3 Introduction and Problem Specification
such as if statements or for loops to make them more difficult to understand [5,
Sec. 1].
Researchers [5], [7], [10], [11] are always investigating new and more complex ways to
obfuscate code by coming up with novel transformation techniques. Implementing all
such techniques is outside the scope of this dissertation, but the obfuscator project
does offer a great modern test bed for them.
3.1.3 Rendering
The data structure that contains the obfuscated code doesn’t reflect the code itself,
so it cannot be easily written to a file. This data structure needs to be converted
into a valid C or C++ source code representation.
The generated C or C++ source code needs to be free of errors. It cannot omit any
syntactical constructs as that would prevent the code from being compiled for testing.
3.1.4 Correctness
6
3.1 Challenges 3 Introduction and Problem Specification
Collberg [2] explains that an obfuscator should maximize the obscurity of the obfus-
cated code. The obscurity of code refers to how time-consuming understanding and
reverse engineering it is.
Unfortunately measuring obscurity empirically is difficult, it would require a controlled
experiment involving professional developers and a measurement of the time it takes
them to understand an obfuscated vs. an original piece of code. As an example
Regano et al. [12] has performed such an experiment and found that their VarMerge
obfuscator “reduces by six times the number of successful attacks per unit of time.”
Performing a similar experiment would require significant amount of resources and
time which are not available for this dissertation. As an alternative, there are ways
to measure obscurity indirectly.
The alternative way to measure obscurity is by calculating the complexity of code.
There are multiple metrics which calculate this:
Naeem et. al. [11] investigates these metrics in the context of decompilers and
obfuscators. Some alternatives to McCabe’s and Halstead’s metrics are offered,
including a measure of the program size, the conditional complexity and the identifier
complexity.
The advantages of McCabe and Halstead metrics are that they are widely supported
and tools exist for measuring them. László et. al. [5] uses McCabe’s complexity
together with program size for evaluating the obscurity of their obfuscator. Unfortu-
nately these metrics are not a good way to evaluate the obscurity of the obfuscator
developed in this dissertation. They measure a very specific complexity feature which
is not affected by the obfuscator application.
7
3.2 Related work 3 Introduction and Problem Specification
8
4 System Requirements Specification
Tigress is another obfuscator, unlike CShroud its latest release is recent and appears
to be maintained at the time of writing. The Tigress website describes it as “a
diversifying virtualizer/obfuscator for the C language that supports many novel
defenses against both static and dynamic reverse engineering and de-virtualization
attacks.”
Obfuscators that work on binary files also exist, one example of such an application
is called obfuscator-llvm8 which can output an obfuscated binary code file [9, Sec.
2.1.2]. This is different to the obfuscators mentioned above which will output an
obfuscated source code file.
Obfuscation is also a healthy subject of study. Research papers often investigate the
ideal transformations that can be applied to achieve the best defense against reverse
engineering.
9
4.2 Assumptions 4 System Requirements Specification
4.2 Assumptions
The obfuscator application will be given syntactically and semantically valid C or
C++ code. Validity will be defined in terms of the standards supported, which will
include C99 [13] and C++98 [14]. Some features of the C11 and C++11 standards
may also be supported, but code containing those features may be viewed as invalid
by the obfuscator application.
A relatively modern computer system with an access to a terminal will be required
to run the application. Due to the CLI nature of the application the end user will
need to be comfortable with a terminal to use the obfuscator.
4.3 Constraints
The initial version of the obfuscator application will require macOS to run. This is
mainly a constraint due to the operating system that development took place on,
there is no reason the application cannot be compiled on other operating systems
and platforms with little to no changes to the code.
4.4 Requirements
At a high level, the obfuscator application is expected to parse a single C or C++
source code file, obfuscate it and save the obfuscated code into a new file. The
requirements including the specific obfuscation transformations that should be applied
to the code are outlined below.
Most developers take great care to format their code in a logical and readable manner.
They also add useful comments to code which describes its semantics. Removing
10
4.4 Requirements 4 System Requirements Specification
int main () {
helloWorld(42); // Say hello. int main(){helloWorld(42);}
}
Listing 2: Original code Listing 3: Transformed code
Literals are present in almost all source code. Integer and string literals can be
obfuscated in a relatively simple manner through data transformation.
Understanding integer literals is easiest when they are represented in the commonly
used decimal numeral system. In C and C++ an integer literal can also be represented
in hexadecimal or octal. For the purposes of obfuscation all integer literals should
be represented using hexadecimal. This transformation is shown in Listing 4 and
Listing 5.
11
4.4 Requirements 4 System Requirements Specification
The obfuscation of integer literals should go a step further. Each integer literal should
be transformed into a constant expression that reproduces the original literal. A single
literal should be separated into at least 4 different random integers, these should be
added and subtracted together to give the original literal. This transformation is
shown in Listing 6 and Listing 7.
String literals in source code are usually displayed in a human readable format. Every
string should be obfuscated by randomizing its representation, each character should
be represented either using a hexadecimal escape sequence, a decimal escape sequence
or as-is. The mixing of different escape sequence formats should confuse the reader
and ensure that a simple tool cannot be written to deobfuscate it. This transformation
is shown in Listing 8 and Listing 9.
"Hello" "\x48""e\154l\x6F"
Listing 8: Original code Listing 9: Transformed code
To ensure syntax compatibility with C and C++, the transformation must start a new
string literal after each hexadecimal escape sequence. Otherwise code like "\x48e"
would be incorrectly interpreted as a single 48E hexadecimal value instead of two \x48
and e characters.
The control paths in source code are essential to its execution, as a result they are an
important way to learn about the code. Many constructs used to control the flow of
execution contain boolean expressions in their predicates, including if, for and while
statements. As a result transforming boolean expressions is an important method of
obfuscating code.
The boolean expressions should be transformed in such a way as to yield the same
results. Doing so is a form of control transformation. A simple transformation that
12
4.4 Requirements 4 System Requirements Specification
This adds extra noise which, when applied to every boolean expression, significantly
maximizes the obscurity of the obfuscated code.
4.4.4 Identifiers
It is often said in jest that there are only two hard things in Computer Science:
cache invalidation and naming things [15]. This has a ring of truth to it. Software
developers often take great care to name variables, functions and other identifiers in
a way that makes the code easy to understand. Obfuscating these names will have a
great effect on the comprehensibility of the source code.
Ideally all identifier types should be obfuscated, but variable and function names are
a good start. The relevant transformation should be aware of the visibility of the
identifier. If an identifier is exported via a header file, it should not be transformed
as doing so may lead to errors due to cross-source dependencies. Identifiers that are
not exported must be transformed. The transformation is described in Listing 12 and
Listing 13.
13
5 Design
Note how in Listing 13 the main function defined at ❶ isn’t obfuscated. This is
because the C linker will look for this function name during compilation to use as
the entry point of the program. Renaming it would cause an error.
The rest of the identifiers have all been obfuscated. How these identifiers are obfuscated
is up to the obfuscator implementation. In the case of Listing 13 the first 4 characters
of an MD5 hash function’s output have been used.
4.4.5 Testing
An obfuscator lends itself well to testing, because while it’s a complicated piece of
software it can be easily executed using simple scripts. A tester program or script
should be written to execute the obfuscator on sample source code and verify that
the original and obfuscated source code compiles successfully.
The first testing category will be unit tests where sample source code containing a
single category of specific C and C++ language features will be tested. For example,
two unit tests can be created for array declaration syntax and if statements. A good
range of unit tests should be created to ensure the obfuscator works as expected, and
to make it easier to find issues during development.
The tester should also perform integration tests. The difference between these and
unit tests is that they will test the ability to obfuscate a full project. For every project,
every source code file should be obfuscated, compiled and tested. Real open source
projects should be used for this purpose to ensure the obfuscator works effectively.
5 Design
The architecture of the obfuscator application will be modelled after a typical compiler.
A compiler begins its work in the same manner as the obfuscator: by parsing source
code. The output of these tools is the differentiator, an obfuscator will output
obfuscated source code while a compiler will output machine code or an executable
instead.
A high-level view of obfuscator’s architecture is shown in Figure 1.
14
5.3 Renderer 5 Design
5.3 Renderer
The renderer component is the final component in the obfuscator architecture. It is
responsible for converting an AST into its C or C++ source code representation.
The sheer number of different kinds of AST nodes used to represent C and C++
source code means that rendering them all is a difficult process. Having to implement
a fully featured renderer would mean that the obfuscator could not be tested quickly.
A good way to work around this issue is to store the original code in each AST node.
That way, if the renderer cannot handle a specific AST node kind, the original code
can be used instead. This has the advantage that the renderer can be developed
incrementally, but means that while the renderer is unfinished certain code will remain
unobfuscated.
To demonstrate what this means, let’s look at a larger AST example:
Call (origCode: "printf(\"Hello World!\");")
Identifier (origCode: "printf", name: "printf")
StringLiteral (origCode: "\"Hello World!\"", value: "Hello World!")
IfStmt (origCode: "if (true) {\n printf(\"true\"); \n}")
Branch
Identifier (origCode: "true", name: "true")
StmtList (origCode: "printf(\"true\");")
Call (origCode: "printf(\"true\");")
Identifier (origCode: "printf", name: "printf")
StringLiteral (origCode: "\"true\"", value: "true")
Listing 14: A textual representation of a larger AST
Listing 14 shows the AST tree of Listing 15. The children are indented to show the
parent node they belong to. Note the data fields contained in each node.
Assuming that a renderer is used which does not support if statements, Listing 16
shows what the obfuscated code will look like.
printf(100);
printf(0x64);if (true) {
if (true) { printf(200); ❷
printf(200); }
}
Listing 15: Source code for the AST in Listing 16: Code rendered without if
Listing 14 statement support
17
6.1 Programming language 6 Implementation and Testing
6.2 Parser
The libclang library is used to parse C and C++ source code. In order to make use
of it, a thin Nim wrapper had to be written, this was made trivial thanks to the c2nim
tool which generated most of it automatically.
The way in which the libclang library exposes information about the parsed source
code is a bit unusual. It seems to have been designed for the purposes of Integrated
Development Environment (IDE) introspection tools, which only need to understand
a small subset of the source code and not modify it. The information is exposed via a
CXCursor object which represents a single position in the source code, this object can
be queried for information about the syntactical construct that is at the underlying
cursor position.
This is a problematic API for two reasons:
19
6.2 Parser 6 Implementation and Testing
Research into these issues revealed more specific limitations of libclang, for example
the inability to retrieve the value of an integer literal [16]. This was a showstopper
for a while and it seemed like libclang would have to be abandoned in favour of
something else.
Further investigation revealed that libclang is a C wrapper on top of the original
clang parser written in C++. Using the clang parser directly, although much more
difficult, was always a possibility. But looking at the libclang source code closely
revealed certain abstract pointers to data exposed through the CXCursor object. A
further look at how the different libclang query functions work revealed that these
pointers are actually pointing to the original clang parser objects. By wrapping
the underlying C++ classes in Nim, it is possible to access these objects and the
information they store [17]. This approach allows the continued use of libclang and
access to all the necessary information by falling back to C++ when necessary.
6.2.1 Mutability
20
6.2 Parser 6 Implementation and Testing
6.2.2 Preprocessor
#ifdef __unix__
#define PI 3.14159 # include <unistd.h>
#elif defined _WIN32
printf("%f", PI); # include <windows.h>
#endif
Listing 17: Macro expansion Listing 18: Conditional compilation
The libclang parser provides information about inclusion directives and macro ex-
pansions. This allows the obfuscator to render these constructs in the obfuscated
code.
Unfortunately no information is provided about conditional compilation directives.
When parsed, the AST of Listing 18 contains only a single inclusion directive, with
the included file depending on the operating system that the parser is executed on.
This is a limitation which causes the produced AST to always be dependent on the
platform that it is produced on. For some use cases it is a serious limitation, but for
others it may in fact be a feature. Solving this limitation is beyond the scope of this
dissertation, but there are multiple approaches that can be considered in the future
for solving it:
21
6.3 Transformer 6 Implementation and Testing
6.3 Transformer
The transformer component is implemented fully in Nim in just over 100 lines of code.
This makes it the simplest component in the obfuscator.
This component’s job is to transform the AST in such a way as to obfuscate it. The
transformations performed include:
Most of these transformations are relatively trivial. The identifier renaming, which is
most complicated, is described in more detail in the following section.
The transformation itself is very simple, but deciding whether it should be performed
for a specific identifier is non-trivial.
There are 3 different pieces of information collected by the ast module for the purposes
of this transformation:
22
6.4 Renderer 6 Implementation and Testing
• isGlobal flag: this determines whether an identifier is defined inside any of the
included header files.
• referencedLocally flag: this determines whether a referenced identifier is defined
inside the file being obfuscated.
Global identifiers, as determined by the isGlobal flag, are not renamed because they
may be used by external software. Similarly, only identifiers referenced locally are
renamed.
The new name for identifiers is generated using the USR string. This string is hashed
using MD5 and used as the new name. This produces consistent and reliable identifier
obfuscations.
6.4 Renderer
Sometimes parser libraries implement their own rendering functionality. This is
the case with clang, but this functionality cannot be reused because the obfuscator
implements a custom AST.
A custom renderer is difficult to write, but it does give far more control about the
code that is created. For an obfuscator this is really important.
The obfuscator application’s renderer is written completely in Nim. As of writing,
it supports a vast majority of syntactical constructs, with unsupported constructs
being rendered using the original code as described in subsection 5.3.
The renderer intentionally omits whitespace as much as possible, this gets rid of all
formatting and has the effect that there is no newline characters in the obfuscated
code. The renderer is also responsible for rendering literals, it does so consistent with
the system specification described in section 4.
6.5 Testing
The test suite is implemented in the tester.nim file. Running nimble test in the
obfuscator’s directory will compile the tester module and then execute it. Upon
execution tester runs a suite of unit and integration tests. Figure 5 shows the
successful execution of nimble test.
23
6.6 Demonstration 6 Implementation and Testing
These tests ensure that the code semantics remain the same after obfuscation and
that the obfuscated code is valid C or C++ code.
Integration tests feature a similar set of actions, but for larger source code files
including a large open source C project called wrk10 . The wrk program is a high
performance HTTP benchmark tool. The test suite verifies that it can be obfuscated
and that the obfuscated code can be compiled.
6.6 Demonstration
For demonstration purposes a simple web application was put together ahead of the
demo day. This web app wasn’t a part of the original system requirements, it was
created simply to show off the obfuscator in a more user friendly manner. Figure 6
shows what the interface of this web application looks like. It can be accessed at the
following URL: https://picheta.me/obfuscator.
10
https://github.com/wg/wrk
25
7.1 Tools and resources 7 System Evaluation and Experimental Results
for performing this evaluation will consist of the following steps for each source code
file:
A “diff” percentage will then be calculated by comparing the number of added and
removed characters in the diff, to the number of total characters in the diff.
This methodology is codified inside the evaluate.sh script to ensure the results can
be replicated easily for each source code file.
The difference metrics should give a good indication of how resilient the obfuscator is.
It will also give an indication of the obscurity. The higher the difference between the
original code and the prettified code the stronger the obscurity and resilience.
Character counts are collected using the wc tool. Diffing of files is performed using
git diff.
27
7.2 Results 7 System Evaluation and Experimental Results
7.2 Results
Obfuscator Stunnix
File Original size Diff Diff Tigress Diff
revcomp.gcc 6549 66.2% 54.8% 72.3%
fannkuchredux.gcc 1605 71.5% 57.9% 68.8%
regexredux.gcc-4.gcc 7106 58.8% 54.9% Error
pidigits.gcc 1219 66.4% 55.3% 80.0%
mandelbrot.gcc 2465 67.8% 56.7% Error
28
8 Conclusion
C/C++ code. Several simple and advanced transformations have been implemented
successfully as set out in the system requirements, and at least one large software
project was successfully obfuscated using the developed tool. Furthermore, a novel
approach to the evaluation of obfuscators was devised and used to compare the system
developed to a state of the art commercial C/C++ obfuscator and to the Tigress
obfuscator, yielding great results in favour of the developed system.
During development and testing several opportunities for future work have been iden-
tified. First of all, it was found that the preprocessor absorbs some vital information
such as conditional compilation constructs. These are required for the appropriate
obfuscation of platform-independent code. Multiple approaches to resolving this have
been proposed in the relevant sections.
In addition to the above, the obfuscator currently only implements one advanced form
of obfuscation. There is a lot of room for different transformations to be implemented,
including ones described in detail in various research papers. Indeed, most of the time
spent on this system was to research and develop the foundations for an obfuscator,
the system is a good base for further obfuscation research.
Taking inspiration from a paper by Regano et al. [12], in order to properly evaluate
the obfuscation quality, it would be good to create an experiment where humans
attempt to reverse engineer source code under lab conditions.
Some of the design choices used in this dissertation were particularly good, including
the decision to store the original code of each AST and use it to bootstrap the
renderer. This allowed the obfuscator to be rapidly developed. Others were not so
good, the decision to use libclang means that some information about the AST is not
easily accessible, it provided a quick way to get started but a rewrite of this project
would likely benefit from using the clang parser directly.
References
[1] D. Picheta, “Git repository.” [Online]. Available: https://gitlab.eeecs.qub.ac.uk/
40122251/CSC3002_DominikPicheta.
[2] C. S. Collberg and C. Thomborson, “Watermarking, tamper-proofing, and obfus-
cation - tools for software protection,” IEEE Transactions on Software Engineering,
vol. 28, no. 8, pp. 735–746, Aug. 2002.
[3] S. Cho, H. Chang, and Y. Cho, “Implementation of an obfuscation tool for
30
8 Conclusion
31
8 Conclusion
32