Inside Kdenlive: How to fuzz a complex GUI application?

Introduction and approach

Fuzz-testing, also called Fuzzing, is an essential tool inside the tool-box of software developers and testers. The idea is simple yet effective: simply throw a huge amount of random test inputs at the target program, until you manage to get it to crash or otherwise misbehave. This crashes are often revealing some defects in the code, some overlooked corner-cases that are at best annoying for the end-user if she stumbles upon them, or at worse dangerous if the holes have security implications. As part of our refactoring efforts of the main components of Kdenlive, this is one of the tools we wanted to use to ensure as much stability as possible.

One of the most commonly used fuzzing library is called LibFuzzer, and is built upon LLVM. It has already helped finding thousands of issues in a wide range of projects, including well tested ones. LibFuzzer is a coverage based fuzzer, which means that it attempts to generate inputs that creates new execution paths. That way, it tries to cover the full scope of the target software, and is more likely to uncover corner-cases.
Building a library (in this case Kdenlive’s core library) with the correct instrumentation to support fuzzing is straightforward: with Clang, you simply need to pass the flag -fsanitize=fuzzer-no-link. And while we’re at it, we can also add Clang’s extremely useful Address sanitizer with -fsanitize=fuzzer-no-link, address. This way, we are going to detect any kind of memory malfunction as soon as it occurs.

Now that the library is ready for fuzzing, we need to create a fuzz target. That corresponds to the entry point of our program, to which the fuzzer is going to pass the random inputs. In general, it looks like this:

// fuzz_target.cc
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
    DoSomethingWithData(Data, Size);
    return 0;
}

Now, the challenge is to come up with a good fuzzing function. Utilities that read from stdin or from an input file, like ffmpeg, any compression tool, any conversion tool, etc., are easy to fuzz since we just need to fuzz the data we feed them. In the case of Kdenlive, we also read project files, but this represents only a tiny amount of what a full editing software is supposed to do, and further-more our project opening loading logic is mostly deferred to third-party libraries. So, how to fuzz the interesting parts of Kdenlive? Well, if you look at it from a distance, Kdenlive can more or less be summed up as a “just” a (rich) GUI sitting on top of existing video manipulation libraries. That means that our most prominent source of inputs is the user: at the core, what Kdenlive must excel at is handling any kind of action the user may want to perform.

During the rewrite of our core modules, we changed a bit the architecture so that there is a clear separation between the model, which handles all the actual logic, and the view (written in QML), which is designed to be as thin as possible. Essentially, this means that any action executed by the user corresponds exactly to one or several call to the model’s API. This makes our life easier when performing fuzzing: in order to effectively push Kdenlive to its limits, we simply need to call random model functions in a random order with random parameters.

The plan is getting clear now, but one crucial piece is missing: how do we turn the input provided by LibFuzzer (a random string) into a random sequence of model actions?

Generating a maintainable script language

One obvious idea would be to define a scripting language that maps text to actions, for example move 0 3 4 could move clip 0 on track 3 at position 4. However, writing such a scripting language from scratch, and then an interpreter for it, is a daunting task and is hard to maintain: each time a new function is added in the API it must be added in the script language as well, and any change in the API is likely to break the interpreter.

Basically, we want to generate the script language and its interpreter programmatically in a semi-automated way. One way to do this is to use reflection: by enumerating all the methods in the API, we can figure out what is a legal operation, and interpret it correctly if it is indeed an existing operation. As of today, C++ still lacks native reflection capabilities, but there are some great libraries out there that allow you to fill this gap a little. We used RTTR, which is a Runtime reflection library. It requires to register the functions you want to make available: in the following snippet we register a method called “requestClipsUngroup” from our timeline model:

RTTR_REGISTRATION
{
    using namespace rttr;
    registration::class_("TimelineModel")
        .method("requestClipsUngroup", &TimelineModel::requestClipsUngroup)
               (parameter_names("itemIds", "logUndo"));
}

Note that specifying the names of the parameter is technically not required by RTTR, but it is useful for our purposes.

Once we have that, our script interpreter is much easier to write: when we obtain a string like “requestClipDoSomething”, we check the registered methods for anything similar, and if we find it, we also know which arguments to expect (their name as well as their type), so we can parse that easily as well (arguments are typically numbers, booleans or strings so they don’t require complicated parsing).

For Kdenlive, there is one caveat though: the model is, by design, very finicky with the inputs its receives. In our example function, the first parameter, itemIds is a list of ids of items on the timeline (clips, compositions,…). If one of the element of the input list is NOT a known item id, the model is going to send an abort, because everything is checked through an assert. This behavior was designed to make sure that the view cannot sneak an invalid model call without us knowing about it (by getting an immediate and irrevocable crash). The problem is that this is not going to play well within a fuzzing framework: if we let the fuzzer come up with random ids, there is little chance that they are going to be valid ids and the model is going to be crashing all the time, which is not what we want.
To work around this, we implemented a slight additional thing in our interpreter: whenever the argument is some kind of object id, for example an item id, we compute a list of currently valid ids (in the example, allValidItemIds). That way, if we parse an int with value i for this argument, we send allValidItemIds[i % allValidItemIds.size()] to the model instead. This ensures that all the ids it receives are always going to be valid.

The final step for this interpreter to be perfect is to automatically create a small translation table between the long API names and shorter aliases. The idea behind this is that the fuzzer is less likely to randomly stumble upon a complicated name like “requestClipUngroup” than a one letter name like “u”. In practice, LibFuzzer supports dictionaries, so it could in theory be able to deal with these complicated names, but maintaining a dictionary is one extra hassle, so if we can avoid it, it’s probably for the best. All in all, here is a sample of a valid script:

a 
c red  20 1
c blue  20 1
c green  20 1
b 0 -1 -1 $$ 0
b 0 -1 -1 $$ 0
b 0 -1 -1 $$ 0
e 0 294 295 0 1 1 0
e 0 298 295 23 1 1 0
e 0 299 295 45 1 1 0
e 0 300 296 4 1 1 0
e 0 299 295 43 1 1 0
e 0 300 296 9 1 1 0
l 0 2 299 294 1 0
l 0 2 300 294 1 0
e 0 299 295 43 1 1 0
e 0 300 296 9 1 1 0
e 0 299 295 48 1 1 0
e 0 294 296 8 1 1 0
e 0 294 295 3 1 1 0

Generating a corpus

To work optimally, Libfuzzer needs a initial corpus, which is an initial set of inputs that trigger diverse behaviors.
One could write some scripts by hand, but once again that would not scale very well and would not be maintainable. Luckily, we already have a trove of small snippets that call a lot of model functions: our unit-tests. So the question becomes: how do we (automatically) convert our unit-tests into scripts with the syntax described above?

The answer is, once again, reflection. We have a singleton class Logging that keeps track of all the operations that have been requested. We then instrument our API functions so that we can log the fact that they have been called:

bool TimelineModel::requestClipsUngroup(const std::unordered_set& itemIds, bool logUndo)
{
    TRACE(itemIds, logUndo);
    // do the actual work here
    return result
}

Here TRACE is a convenience macro that looks like this:

#define TRACE(...)                                                                                                                                             \
    LogGuard __guard;                                                                                                                                          \
    if (__guard.hasGuard()) {                                                                                                                                  \
        Logger::log(this, __FUNCTION__, {__VA_ARGS__});                                                                                                        \
    }

Note that it passes the pointer (this), the function name (__FUNCTION__) and the arguments to the logger.
The LogGuard is a small RAII utility that prevents duplicate logging in the case of nested calls: if our code looks like this:

int TimelineModel::foo(int foobaz) {
    TRACE(foobaz);
    return baz * 5;
}

int TimelineModel::bar(int barbaz) {
    TRACE(barbaz);
    return foo(barbaz - 2);
}

If bar is called, we want to have only one logging entry, and discard the one that would result from the inner foo call. To this end, the LogGuard prevents further logging until its deleted, which happens when it goes out of scope, i.e when bar returns. Sample implementation:

class LogGuard{
public:
    LogGuard()
        : m_hasGuard(Logger::start_logging()) {}
    ~LogGuard()
    {
        if (m_hasGuard) Logger::stop_logging();
    }
    // @brief Returns true if we are the top-level caller.
    bool hasGuard() const { return m_hasGuard; }
protected:
    bool m_hasGuard = false;
};

Once we have a list of the function calls, we can generate the script by simply dumping them in a format that is consistent with what the interpreter expects.

This kind corpus is very useful in practice. Here is the output of LibFuzzer after a few iterations on an empty corpus

#1944	NEW    cov: 6521 ft: 10397 corp: 46/108b lim: 4 exec/s: 60 rss: 555Mb L: 4/4 MS: 1 ChangeBit-

The important metric is “cov”, which indicates how well we cover the full source code. Note that at this point, not a single valid API call has been made.

With a corpus generated through our unit tests, it looks like this

#40	REDUCE cov: 13272 ft: 65474 corp: 1148/1077Kb lim: 6 exec/s: 2 rss: 1340Mb L: 1882/8652 MS: 2 CMP-EraseBytes- DE: "movit.convert"-

The coverage is more than twice as big! And at this point at lot of valid calls are made all the time.

Summary

In a nutshell, here are the steps we went through to be able to efficiently fuzz a complex application like Kdenlive:

Structure the code so that model and view are well separated
Generate a scripting language using reflection, to be able to query the model
Trace the API calls of the unit-tests to generate an initial script corpus
Fuzz the model through the script interface
Profit!

For us at Kdenlive, this approach has already proved useful to uncover bugs that were not caught by our test-cases. See this commit for example: https://invent.kde.org/kde/kdenlive/commit/fcd1ccd6250aea6a977a0856a284a9ac1f5341ee. Note that our logger is able to emit either a script or a unit-test after an execution: this means that when we find a script that triggers a bug, we can automatically convert it back to a unit-test to be added to our test library!