Codebase Understanding

80% of auditing is codebase understanding.
How can you make someone faster at understanding a codebase?
Let's learn some learning science!

Memory Challenge

But first, a little memory challenge.
I'm going to give you a set of 3 facts each about 2 people.
Later, I'll ask you to recall each without looking back.

Ready?

Person 1: Vitalik Buterin. Facts:
1. After his WoW character got nerfed, he became convinced of the importance of digital ownership.
2. After high school, he couchsurfed around the world, staying with people working on interesting Bitcoin projects.
3. Upon creating Ethereum, he won the Thiel Fellowship without even pitching.

Person 2: Ralph Jones. Facts:
1. He won the Elijah Watts Sells award.
2. He dislikes bitter food.
3. He enjoys online trivia.

Reading Code

So say you're learning a codebase. And you get to some code like this:

What happens mentally when you try to understand?

To understand it, you must answer a ton of questions. Just for the very first line:

So you might have a number of strategies:

  • Read on and try to figure out what you can.

  • Look for documentation, if there is any.

  • Run the program and try to see if you can figure out what an overlay is.

  • Search the codebase

  • Etc

So say you eventually figure out the answer.

You can look back and ask: which moments contributed meaningfully to me understanding the code? Very few — just the ones where you learned the answer. E.g.: Every moment of reading documentation that doesn't answer the question is wasted time!

That goes to say: ALL OF THESE ARE INEFFICIENT.

The lesson here: The learning comes not from the code you read but the questions you ask.

Understanding Design

So when you learn a codebase, what are you actually learning?

The answer goes deep into the definition of a design, and how dreams become programs.

In the 1970s, Dijkstra proposed an idea: instead of writing programs line by line, we can start with a high-level description and slowly add information and detail until the program becomes what we want.

Almost no-one builds programs the way Dijkstra suggested.

But, in some ways, we all do, in our heads:

We start with high level ideas.

Make decisions.

Make plans.

Refine the plans.

Debug them.

Until we have a working program.

That sequence of steps to form the final code, together with the rationale, is the program's design.

Now here's the rub:

When you try to understand a program, what you are really doing is running that process BACKWARDS.
You look at the final code, and try to piecemeal reconstruct the higher-level intentions and assumptions that gave rise to that code.
Only then does it become an object you can truly manipulate in your head.
To write new code or to debug it at the deepest level.

This means that effectively teaching a codebase is all about quickly imparting the understanding of these high-level concepts and how they map to the code.

Concrete to Abstract

That code I showed you earlier? It's from the map editor of Heroes of Might and Magic II, the game I've been modding for half my life. An overlap is an object you can place on the map, like a castle. The bitmasks are the colored squares below, showing which squares are obstructed vs. not, which are part of its shadow, and which one is the square where your army can enter the castle.

So when you learn the codebase, you are understanding ideas such as overlays and overlay masks, both as abstract concepts, and as their reflection in the code.

Amazing how much faster it is to understand now that I've told you this and shown you the picture.

Memory Challenge: Answers

Now back to the memory challenge:

Can you remember the three facts about Vitalik Buterin?

Don't look back at them.

.

.

.

Now try Ralph Jones!

When I've done this in person, usually someone in the room can rattle off all three facts about Vitalk.
At best, I get a couple people who slowly piece together the facts about Ralph Jones. And no-one remembers the name of his award.

That's not because people become dumber after I say the name Ralph Jones. Because memory formation is actually kinda predictable.

Memory Formation

Each fact about Vitalik naturally forms multiple associations in your memory: It attaches to Vitalik, a familiar name. And they attach to other, forming a story. Plus, they naturally give rise to images.

With Ralph Jones, I intentionally chose the facts so that none of those hold.

But if I told you some more things — like that Ralph "Jones" is actually my father Ralph Koppel, and the Elijah-Watts Sells award is given for being a national top performer on the CPA exam, you'd be more likely to remember.

Multiple cognitive theories predict that people are more likely to remember a chunk, and more quickly, when it attaches to more existing chunks. The ACT-R architecture even gives a formula.

Solving the Puzzle

Every line of code is a chunk.

If you read a line, and can't attach it to other things you know, then you forget it immediately. It is structurally similar to solving a jigsaw puzzle. When you pick up a piece, if it doesn't attach to an already-solved part, then all the effort spent looking at it produced no value.

This suggests a way to make codebase-learning faster: It's all about sequencing.

With a really good course or textbook, you're always ready for the next lesson. Within a few weeks, you go from learning the basics to learning magic. No flipping back and forth needed. What if we can make reading a codebase more like that?

That's what our product, AuditLabs does.

Say you want to understand a function that relies on understanding some data structure.
It teaches you the data structure before the function.
It chunks related fields together, making you faster at going from the lowest level ("it has these fields") to the higher-level ("it's trying to represent this idea"):

It shows you examples of the fields being used, so you build more associations and learn it better.

It can even "read ahead" to figure out important things about the representation that would otherwise be non-obvious.

Why is `modules` a `mapping(address => address)`. It's a linked list. I was impressed when our tool spat out this explanation:

(Hallucination? In this instruction modality, actually a small deal. If it's wrong, then you'll carry around an incorrect idea up until you read the code that uses it a few minutes later. But it's pretty close to the code, so it's seldom wrong.)

Early Days

We're just getting started. There's a lot more we plan to do. Custom visualizers to help you step through assembly. An interactive diagrammer to test your mental map of the code and compare it against the actual map. Things to help you remember just enough context for fix reviews.

But already this version works wonders. One of our first users said it made him 4x faster an audit. A couple days ago, @mngyuknn1, a highly-ranked auditor who's found multiple solo vulnerabilities, told us it saved him a full day on the Solidity portion of the Story protocol.

It helps in more ways too.

Some contracts have lots of functions that just show information to the web UI but aren't used to do any effect on-chain.

AuditLabs helps you avoid wasting time on these functions:

There's a lot of cognitive overhead going from some long formula in the code to an actual understanding of the formula; only then can you start to poke flaws in it.

AuditLabs renders them in beautiful LaTeX, using the same terminology as the whitepaper:

(Lots of people have requested the ability to play around with those formulas. That's coming up :) )

AuditLabs is still an early-stage project.
It's not something you can download and use in all your private audits.
We QA every code tour it generates. We reject plenty before showing them to customers.
When one part is too long, we press a button, and it rewrites itself.

But lately we've been rejecting and editing less.

We've raised our standards, but now it impresses me regularly.

We've partnering with a select few auditors and firms who can't wait to get an edge in their audit speed and quality.
But in the meantime, for a select few contests, everyone can get the benefits of accelerated codebase learning.

Right now, that's the story contest.

If you're auditing Story, give yourself a Christmas, Channukah, or New Year's gift!

And get a taste of the future of auditing And the future of codebase learning and code review in general.

This blog post is adapted and condensed from a talk I've given a couple times:

Next
Next

Story Protocol