When will I be able to understand large, complex codebases like the Linux repository?
### Background
Two years post-graduation, I’ve been in a junior role with ongoing learning efforts both at work and in my free time. Recently, while browsing the Linux repository, I felt utterly lost in deciphering its complexity.
### Challenges
– Limited knowledge of C and low-level programming
– Surface-level understanding of Operating Systems
– Inability to comprehend the sustenance and coordination of a massive system like Linux
### Frustrations
Comparing my current BI analyst role and basic side projects to the depth of Linux’s codebase leaves me feeling inadequate and uncertain about my technical growth.
### Curiosity
When and how can I progress to comprehend repositories like Linux and all its intricacies? What steps can I take to embark on this technical journey and overcome my confusion?
#programming #codebase #technicalgrowth #Linuxrepository #operatingsystems #gettingstarted #repositories #technicaljourney #learningpath
I think some people are honestly just gifted for it. Like yeah they obviously need to do the work to get there, but it’s like their brains can handle MASSIVE amounts of complex code with ease.
It’s kinda like how there are guys that play football their whole life trying to get to the NFL, play all the way up through D1 college and just never make it no matter how hard they try.
-Also a 2nd year dev with a CS degree working in a large code base and wondering how the seniors and principals I work with do it.
Give it time, mate. Understanding big codebases is like leveling up in a game – takes practice and patience. Keep diving in, asking questions, and experimenting. You’ll get there sooner than you think!
Nobody really does this, it is not humanly possible to understand every line of a multi-million line codebase. A well-organized codebase will have functionality separated into distinct modules and components so that you can focus on a specific area of the codebase without needing to know too much about the rest. You would typically have a specific task at hand, find the files that relate to the task, and spend some time reading and understanding that specific portion of the codebase.
Also to add: Just because you see a large amount of code in a single github repository, doesn’t mean you have to mentally think of it as one codebase. Oftentimes it makes more sense to think of it as lots of little codebases: the files in folder A are for feature X, the files in folder B are for feature Y, etc. Sometimes it’s even literally the case that the code in one folder can run completely independently of the code in some other folder.
Very few people “understand” the Linux kernel. A lot of people understand *parts* of the Linux kernel. Maintainers have spent years or even decades working on their section of the kernel. You’re not going to understand large complex codebases like this overnight.
The reality is that beyond a certain level of complexity, *no one* understands the whole system. Stroustrup famously admitted to no longer knowing the entirety of the C++ specification (despite being its creator). Â
When you get to that point, you typically divide and conquer via delegation: different people own different parts of the system. There’s of course a variety of strategies and tools to cope with complexity at an individual level. Normally you learn them gradually by taking on increasingly more complex systems over the course of your career.
> I thought to myself after about 30 seconds of staring at this thing…”wow, I have absolutely no idea what any of this means”
Let’s be real, 30 seconds is not a long time, give yourself a bit more credit here.
> I certaintly don’t know how a giant system like this is sustained, and coordinated by thousands of its contributors.
It’s a giant system *because* it has thousands of contributors.
> Like…how and when will I get to the point of understanding a repo like Linux, and everything that’s invovled with it?
How much value do you expect this to give you? This is like saying “I would like to be able to read an encyclopedia from cover to cover and understand all of it, quickly”. That sounds crazy to me.
Realistically, when you start contributing to a large project, you’ll only be working on specific parts, improving sections of the code base at a time. When you suddenly get airdropped into the middle of the ocean, knowing how the entire ocean works probably isn’t too relevant.
After you start getting more experienced with programming, you tend to recognize more patterns and learn where to look to get the specific information you need to complete a task. Everything else is unimportant, because if it’s unrelated to the task at hand and designed in an unsurprising way, it probably works as expected (and if it doesn’t, then that’s a problem for someone else, or future you).
Open up the whole project, start the find all references goose chase. It helps if you know the boot up sequence. I use Visual Assist to make reference searching faster.
You probably picked the craziest example though. I don’t know if it’s even possible to understand Linux first glance within 30 seconds.
It’s like opening a history book at a random page then being overwhelmed at not knowing the entirety of human history!
Hopefully the code base comes with technical documentation and system design diagrams helping to explain the system from a high level and hopefully documentation in the code itself to explain whats going on code wise.
You may never get there, or it may take years.
It’s a long slog to get to the point where you can really *get* large projects, plenty of programmers never really get there.
The key is to understand the layers and then follow the references.
For instance: views – controllers – business layer – data access layer.
Now you know the “typical” flow of the data, so now you just follow the functionality and references your IDE shows. Maybe use your IDE to find specific keywords.
Read more code. Start at the beginning of the app. Then keep reading and debuggin until you see the patterns.
Linus would eventually fail if you grilled him closed book on niche enough stuff in that code base. Good software is about organizing the code so you don’t need to think about the rest of it when working on a piece of it, not about holding every single thing in your head.
Basically, a really good engineer will have a *general* understanding of the whole stack all of the way down, not a specific understanding of every part. Then they use that general understanding to help them navigate to the specific part they need to be looking at.
For lower level stuff, the best resource I’m familiar with for building that kind of general understanding is nand2tetris.
Like, computers are a set of basic operations implemented as circuits, where the circuits are just structures of simple logic gates, which are made of patterns of transistors. Those run on a CPU and can interact with various peripherals, like collections of persistent gates we call memory. Machine code is a list of those operations, or “instructions”, nothing else. Assembly provides basic hooks for organizing calls to those instructions. The next level of language, like c, provides much more convenient ways to manage assembly, having a compiler that turns the code into the right assembly for that platform, and basic patterns for data like a concept of a “character”. Then a high level language provides more powerful abstractions for organizing code, like a “class”.
An operating system provides a set of basic tools for making the computer usable, like allowing it to run specific code when it starts, to have a concept of a “file” and relevant organization, to render things on a screen, or to accept the input from a keyboard. Each of those peripherals needs a driver, which is code that defines how the CPU should send and receive data from the peripheral, etc.
A well organized code base will basically follow the same pattern as the high level description of the way things fit together. So if you were demanded to implement support in Linux for a new kind of text encoding because a lovecraftian nightmare of an alien language reads in 2 dimensions at the same time, you would have a vague but not fully formed sense of all of the main pieces of the system that rely on the assumption that text is a one dimensional array, and you’d be able to start working/grepping through everything that needs to change for the rest of your mortal life, because there’s only so much you can really do in a huge code base.
Stick to the scope of your ticket and learn new sections of code as you go. Eventually, you’ll have worked enough in each part of the code to know how to navigate and understand each piece.
Also practice the basics. It will get you so far don’t take it for granted.
New repo… rinse and repeat.
You have to spend time and effort with it, there’s no shortcut.
I mean.. of course you aren’t going to understand a language that you don’t know.
Maybe start with a codebase that is written in a language you actually know.
There’s a massive gap between Large + Complex and the Linux Kernel. There are probably only a handful of people in the world that are close to fully understanding it.
You never understand it completely. You learn how to search effectively and catch up quickly how this part works on a higher level.
Human brain simply can’t remember everything, unless you have one of those rare brain conditions that gives you a super memory.
So over time and experience you become better making it look like you know all, when you are actually catching up quickly by reading the code.
You are not meant to. Unless it some sort of microservice orchestration, with an architecture graph right there for you, or some medium sized but well segmented code base you cannot expect of yourself to look and see.
When you start a job, there are days, sometimes a week or two just to get accustomed to one part of codebase you will be responsible for.
Like, at my company I worked for 5 years on a single product and I do know it, but it is because I got lucky and got to work start to maturity on a single product and because I got to do it for 5 years.
Like, maybe after 6-10 years, you will see enough various projects, that you will gain intuition about their architecture and micro and macro structure, but most people really don’t. And sometimes, even when you have solution architects, they know the overall ins and outs, but not details, because for big enough projects, it is just not efficient to even have one person to know it all when he can just get better at the aspect he is actually responsible for.
On a previous project, even after being with it for more than 5 years, I still don’t understand the whole thing. It’s an old legacy code, by the way. A mishmash of C/C++/Java.
It’s all about abstractions. I have been working with the same codebase for 2 years now, and there are things I have no idea about basically (because I have never touched the code).
You can look at the code base in different ways. The functionality it provides. The architecture and patterns it has over time etc. and not to mention what people thought the codebase would do over time.
I had the (naive) thought as a junior that things were well thought out and defined, but you will see that a large code base, or a code base over time with different people, will have lots of weird patterns and setups due to …
– the intention of the code base might have changed, sometimes they think they can sell some of the code in modules, so all the code is “modular”, but implemented badly because it never panned out
– older code styles and versions i.e Objective-C programmers doing things not so Swifty, and having Swift 2.0 APIs and style in lots of code
But after a while with a codebase, and overall programming, you’ll see abstractions of what is going on with functionality, and that patterns are mixed together some places etc.
Codebases and teams should have some (at least) some *guided documentation* about what is going on and why. You don’t need (or shouldn’t need) the whole history about the codebase, but what the codebase is for, how it is doing things and why it is doing things that way…. but that doesn’t always happen, so you sometimes don’t even get a feel for it unless you actually run the code and do stuff yourself also, but that’s a slow process unfortunately, but it still gets better with experience.
Hang in there, and don’t demand too much of yourself, you’ll learn that you can’t and don’t need to know everything there is in programming, even the codebase you work on *a regular basis*. Learn as needed, that’s the most important skill.
You have to spend a lot of time learning a piece, you naturally start learning the pieces next to it, then make a jump to a whole other area, learn that really well, after doing this you’ll start to understand how it all works together
You don’t, its a function of time spent in said codebase, reading, re-reading, working within it.
Same reason why its hard to build big platform or re-roll a game engine or an ecommerce site solo now.
It just reached a level of complexity that is untenable for single people.
And also, kinda unrelated, dont fall into the “genius/ego” trap thinking software is a solo game. It usually isnt in any multi year timespan.