Defining the GNU/Linux distribution

If you are here, we can safely assume that you already know what a GNU/Linux software distribution is, but for completion’s sake, let’s just define so we all have the same context.

A GNU/Linux distribution is a collection of system and application software, packaged together by the distribution’s developers, so that they are distributed in a nicely integrated bundle, ready to be used by users and developers alike. Software typically included in such a distribution ranges from a compiler toolchain, to the C library, to filesystem utilities to text editors.

As you can imagine, from the existence of several different GNU/Linux distributions, there are multiple ways that you could possibly combine all these different applications and their respective configurations, not to mention that you could include even more specialised software, depending on the target audience of the distribution (such as multimedia software for a distribution like Ubuntu Studio or penetration testing tools for a distribution such as Kali Linux)

The “f” word

But even with such a great number of different software collections and their respective configurations there still may not be one that appeals to your specific needs. That’s ok though, as you can still customize each and every distribution to your specific liking. Extensive customization is known to create a differentiation point known as a potential forking point.

Forking is a term that has been known to carry negative connotations. As wikipedia puts it,

the term often implies not merely a development branch, but a split in the developer community a form of schism.

Historically, it has also been used as a leverage to coerce a project’s developers into merging code into their master branches that they didn’t originally want to, or otherwise take a decision that they wouldn’t have taken if not under the pressure of a “fork”. But why is it so?

You see, traditionally, forking a project meant a couple of things: For starters, there were now two, identical projects, competing in the same solution space. Those two projects had different development hours and features or bug fixes going into them, and eventually, one of the two ended up being obsolete. Apart from that forking also created an atmosphere of intense competition among the two projects.

However, in 2014, and the advent of the distributed version control systems such as git and mercurial and of the social coding websites such as Github and Bitbucket, the term is finally taking on a more lax meaning, as just another code repository that may or may not enjoy major (or even minor, for that matter) development.

Forking a GNU/Linux distribution

So, up until now we have discussed what a GNU/Linux distribution is, and what a fork is. However, we haven’t discussed yet what it means to fork a GNU/Linux distribution.

You see, what differentiates each distro from the other ones, apart from the software collection that they contain, is the way in which they provide (and deploy) that software. Yes, we are talking about software packages and their respective package managers. Distributions from the Debian (.deb) family are using dpkg along with apt or synaptic or aptitude or some other higher level tool. RPM (.rpm) based distributions may use rpm with yum or dnf or zypper or another higher level tool. Other distributions, not based on the aforementioned may choose to roll their own configuration of packages and package managers, with Arch Linux using its own pacman, Sabayon uses its entropy package manager, etc.

Now, naturally, if you want to customize an application to your liking, you have many ways in which you could do that. One of them is downloading the tarball from the upstream’s website or ftp server, ./configure it and then make and make install it. But if you do start customizing lots of applications this way, it can become tedious and unwieldy too soon. After all, what did that make install install exactly? Will the new update replace those files? What were your configuration options? Did they replace the files the package manager installed?

In this case, it really pays off to learn packaging software for your distribution of choice. What this means is to learn the format of packages your distribution’s package manager accepts as well as how you could produce them. This way, instead of the ./configure && make && make install cycle, you just have beautiful software packages, that you can control more tightly, you can update more easily and you can also distribute them to your friends if you so desire. As an added bonus, now the package manager also knows about those files, and you can install, remove or update them much more easily. What’s not to like?

After you have created some custom packages, you may also wish to create a repository to contain them and update straight from that. Congratulations, you have created your custom distribution, and a potential fork. While you are at it, if you really want to fork the distribution, you could as well take the distribution’s base packages, customize them, rebuild them, and then distribute them. Congratulations again, now you have your true GNU/Linux distribution fork.

That seems easy. More specifically?

Yes of course. Let’s take a look at how you might want to go about forking some well known GNU/Linux distribution.

Debian

In Debian, your usual procedure if you wish to customize a package is the below:

  • First, you make sure you have the essential building software installed. apt-get install build-essential devscripts debhelper
  • Then you need to download the package’s build dependencies. apt-get build-dep $package_name
  • Now it’s time to download it’s sources, via apt-get source $package_name
  • Proceed with customizing it to your liking (update it, patch the sources, etc)
  • Now it’s time to rebuild it. debuild -us -uc

Assuming all went fine, you should now have an $package_name.deb file in your current directory ready to be installed with dpkg -i $package_name.deb.

Please take note that the above is not an extensive treatise into debian packaging by any means. If you want to build custom debian packages, here are some links:

Now that you have your custom packages, it’s time to build a repository to contain them. There are many tools you can use to do that, including the official debian package archiving tool known as dak, but if you want a personal repository without too much hassle, it’s better if you use reprepro. I won’t go to full length on that here, but instead I will lead you to a very good guide to do so if you so desire

Fedora

Building packages for fedora is a procedure similar to the debian one. Fedora however is more convenient in one aspect: It allows you to download a DVD image with all the sources in .rpm form ready for you to customize and rebuild to your tastes.

Apart from that, the usual procedure is the following:

  • Download the SRPM (source RPM) via any means. You could do that using the yumdownloader utility, likewise yumdownloader $package_name. To use yumdownloader, you need to have yum-utils installed.
  • After you have downloaded the SRPM, next you have to unpack it: rpm -i $package_name
  • Next up, you customize the package to your liking (patch the sources, etc)
  • Finally, you cd to the SPECS folder, and then rpmbuild -ba $package.spec

Again the above mentioned steps may not be 100% correct. If you want to go down this route, see the following links for more information:

Next up, is the repository creation step.

  • To create a yum repository, you need to yum install createrepo. After that you need to create a directory to use as the repository, likewise mkdir /var/ftp/repo/Fedora/19/{SRPMS, i386,x86_64).
  • After that you move your i386 packages to /var/ftp/repo/Fedora/19/i386, and the rest of the packages to their respective folders.
  • Next step is adding a configuration file to /etc/yum.repos.d/ that describes your repository to yum.

Again, not definitive, and for more information, take a look at these links:

Arch Linux

Arch Linux, at least in comparison to .deb and .rpm package distribution families is very easy to customize to your liking. That’s to be expected though as Arch Linux is a distribution that sells itself of the customization capabilities it offers to its user.

In essence, if you want to customize a package, the general process is this:

  • Download Arch tarball that contains the PKGBUILD file
  • untar the tarball
  • (Optional) download the upstream tarball referenced in the PKGBUILD, and modify it to your liking
  • run makepkg in the folder containing the PKGBUILD
  • install (using pacman) the .xz file produced after the whole process is finished.

In order to download the official {core | extra | community} packages, you need to run as root abs. This will create a directory tree that contains the files required for building any package in the official repositories.

Next up, you can create a custom local repository with the repo-add tool, and then proceeding with editing /etc/pacman.conf and adding an entry for your repository there. For more information:

To fork or not to fork?

Well, that’s not an easy question to answer. My opinion is that it’s extremely educational to do a soft fork, clone the distribution’s core repository, and for some time maintain your own distribution based on it, that is, update and customize all the repositories. Do that for some months, then go back to using your distribution of choice now that you are enlightened with how it works under the hood. The reason this is very educational is that it will teach you the ins and outs of your distribution, teach you about all the software in it, how it integrates, what its role is. It will teach you packaging which is a tremendously undervalued skill, as you can customize your experience to your liking, and it will make you appreciate the effort going into maintaining the distribution.

As for doing a hard fork, that is creating your own distribution, that you commit to maintaining it for a long time, my opinion is that it’s simply not worth it. Maintaining a distribution, be it by yourself, or with your friends, is a tremendous amount of work, that’s not worth it unless you have other goals you want to achieve by that. If all you want to do is to customize your distribution of choice to your liking, then go ahead, learn packaging for it, customize-package the applications you want, then create your own repo - but always track the upstream. Diverging too much from the upstream is not worth the hassle, as you will end up spending more time maintaining than using the distribution in the end.

tl;dr:

If you want to do a small scale, private fork in order to see what’s under the hood of your Linux distro; by all means go ahead.

If you want to do a large scale, public fork, then take your time to calculate the effort, if it’s worth it, and if you could just help the upstream distribution implement the features you want.

In the last part of this series, we talked about the compiler’s composition, including the assembler and the linker. We showed what happens when the compiler runs, and what’s the output of translation software such as cc1 or as etc. In this final part of the series, we are going to talk about the C library, how our programs interface with it, and how it interfaces with the kernel.

The C Standard Library

The C Standard Library is pretty much a part of every UNIX like operating system. It’s basically a collection of code, including functions, macros, type definitions etc, in order to provide facilities such as string handling (string.h), mathematical computations (math.h), input and output (stdio.h), etc.

GNU/Linux operating systems are generally using the GNU C Library implementation(GLIBC), but it’s common to find other C libraries being used (especially in embedded systems) such as uClibC, newlib, or in the case of Android/Linux systems Bionic. BSD style operating systems usually have their own implementation of a C library.

So, how does one “use” the C Standard Library?

So, now that we are acquainted with the C Library, how do you make use of it, you ask? The answer is: automagically :). Hold on right there; that’s not exactly a hyperbole. You see, when you write a basic C program, you usually #include <some_header.h> and then continue with using the code declared in that header. We have explained in the previous part of this series that when we use a function, say printf(), in reality it’s the linker that does the hard work and allows us to use this function, by linking our program against the libc’s so (shared object). So in essence, when you need to use the C Standard Library, you just #include headers that belong to it, and the linker will resolve the references to the code included.

Apart from the functions that are defined in the Standards however, a C Library might also implement further functionality. For example, the Standards don’t say anything about networking. As a matter of fact, most libraries today may implement not only what’s in the C Standards, but may also choose to comply with the requirements of the POSIX C library, which is a superset of the C Standard library.

Ok, and how does the C Library manage to provide these services?

The answer to this question is simple: Some of the services that the library provides, it does so without needing any sort of special privileges, being normal, userspace C code, while others need to ask the Operating’s system Kernel to provide these facilities for the library.

How does it do so? By calling some functions exported by the kernel to provide certain functionality named system calls. System calls are the fundamental interface between a userspace application and the Operating System Kernel. For example consider this:

You might have a program that has code like this at one point: fd = open("log.txt", "w+");. That open function is provided by the C Library, but the C Library itself can not execute all of the functionality that’s required to open a file, so it may call a sys_open() system call that will ask the kernel to do what’s required to load the file. In this case we say that the library’s open call acts as a wrapper function of the system call.

Epilogue

In this final part of our series, we saw how our applications interface with the C Standard Library available in our system, and how the Library itself interfaces with the Operating system kernel to provide the required services needed by the userspace applications.

Further Reading:

If you want to take a look at the System Call interface in the Linux Operating System, you could always see the man page for the Linux system calls

xv6: An introduction

If you are like me, a low level pc programmer, it’s hard not to have heard of xv6. xv6, for those who haven’t really heard of it, is a UNIX version 6 clone, designed at MIT to help teach operating systems.

The reasoning behind doing this was fairly simple: Up until that point, MIT had used John Lions’ famous commentary on the Sixth Edition of UNIX. But V6 was challenging due to a number of reasons. To begin with, it was written in a near ancient version of C (pre K&R), and apart from that, it contained PDP-11 assembly (a legendary machine for us UNIX lovers, but ancient nonetheless), which didn’t really help the students that had to study both PDP-11 and the (more common) x86 architecture to develop another (exokernel) operating system on.

So, to make things much more simpler, professors there decided to roll with a clone of UNIX version 6, that was x86 specific, written in ANSI C and supported multiprocessor machines.

For a student (or a programmer interested in operating systems), xv6 is a unique opportunity to introduce himself to kernel hacking and to the architecture of UNIX like systems. At about 15k lines of code (iirc), including the (primitive) libraries, the userland and the kernel, it’s very easy (or well, at least easier than production scale UNIX like systems) to grok, and it’s also very easy to expand on. It also helps tremendously that xv6 as a whole has magnificent documentation, not only from MIT, but from other universities that have adopted xv6 for use in their operating systems syllabus.

An introduction to Ensidia: my very personal xv6 fork

When I first discovered xv6 I was ecstatic. For the reasons mentioned above I couldn’t lose on the opportunity to fork xv6 and use it as a personal testbed for anything I could feel like exploring or testing out.

As a matter of fact, when I first discovered xv6, I had just finished implementing (the base of) my own UNIX like operating system, named fotix, and the timing of my discovery was great. xv6 had done what I had done, and also implemented most of what I was planning to work on fotix (for example, elf file loading), and it was a solid base for further development. It also had a userland, which fotix at the time didn’t have.

After I forked xv6, I spent some time familiriazing myself with the code. I also cleaned up the source code quite a bit, structuring the code in a BSD like folder structure, instead of having all of the code in the same folder and made various small scale changes.

After that for quite some time, I had left ensidia alone and didn’t touch it much. However, I always felt like I wanted to develop it a bit more and get to play with its code in interesting ways. I was trying to think of a great way to get started with kernel hacking on it, in a simple way, to get more acquainted with the kernel, and found an interesting pdf with interesting project ideas for it. One of them was to add a system call. I figured out that would be an interesting and quick hack, so hey, why not?

Getting started with kernel hacking on xv6: Adding the system call.

The system call I decided to introduce was the suggested one. It was fairly simple sounding too. You have to introduce a new system call that returns the number of total system calls that have taken place so far. So let’s see how I went about implementing it:

An introduction to system calls in xv6

First of all, we should provide some context about what system calls are, how they are used, and how they are implemented in xv6.

A system call is a function that a userspace application will use, so as to ask for a specific service to be provided by the operating system. For instance with an sbrk(n) system call, a process can ask the kernel to grow its heap space by n bytes. Another example is the well known fork() system call in the UNIX world, that’s used to create a new process by cloning the caller process.

The way applications signal the kernel that they need that service is by issueing a software interrupt. An interrupt is a signal generated that notifies the processor that it needs to stop what its currently doing, and handle the interrupt. This mechanism is also used to notify the processor that information it was seeking from the disks is in some buffer, ready to be extracted and processed, or, that a key was pressed in the keyboard. This is called a hardware interrupt.

Before the processor stops to handle the interrupt generated, it needs to save the current state, so that it can resume the execution in this context after the interrupt has been handled.

The code that calls a system call in xv6 looks like this:

# exec(init, argv)
 .globl start
 start:
   pushl $argv
   pushl $init
   pushl $0  // where caller pc would be
   movl $SYS_exec, %eax
   int $T_SYSCALL

In essence, it pushes the argument of the call to the stack, and puts the system call number (in the above code, that’s $SYS_exec) into %eax. The number is used to match the entry in an array that holds pointers to all the system calls. After that, it generates a software interrupt, with a code (in this case $T_SYSCALL) that’s used to index the interrupt descriptor tables and find the appropriate interrupt handler.

The code that is specific to find the appropriate interrupt handler is called trap() and is available in the file trap.c. If trap() check’s out the trapnumber in the generated trapframe (a structure that represents the processor’s state at the time that the trap happened) to be equal to T_SYSCALL, it calls syscall() (the software interrupt handler) that’s available in syscall.c

// This is the part of trap that
// calls syscall()
void
trap(struct trapframe *tf)
{
  if(tf->trapno == T_SYSCALL){
    if(proc->killed)
      exit();
    proc->tf = tf;
    syscall();
    if(proc->killed)
      exit();
    return;
  }

syscall() is finally the function that checks out %eax to get the number of the system call (to index the array with the system call pointers), and execute the code corresponding to that system call.

The implementation of system calls in xv6 is under two files. The first one is sysproc.c, and is the one containing the implementation of system calls correspondent to processes, and sysfile.c that contains the implementation of system calls regarding the file system.

The specific implementation of the numcalls() system call

To implement the system call itself is simple. I did so with a global variable in syscall.c called syscallnum, that’s incremented everytime syscall(), calls a system call function, that is, the system call is valid.

Next we just need a function, the system call implementation that returns that number to the userspace program that asks for it. Below is the function itself, and syscall() after our change.

// return the number of system calls that have taken place in
// the system
int
sys_numcalls(void)
{
    return syscallnum;
}
// The syscall() implementation after
// our change
void
syscall(void)
{
  int num;

  num = proc->tf->eax;
  if(num > 0 && num < NELEM(syscalls) && syscalls[num]) {
    syscallnum++; // increment the syscall counter
    proc->tf->eax = syscalls[num]();
  } else {
    cprintf("%d %s: unknown sys call %d\n",
            proc->pid, proc->name, num);
    proc->tf->eax = -1;
  }
}

After that was done, the next few things that were needed to be done were fairly straight forward. We had to add an index number for the new system call in syscall.h, expose it to user proccesses via user.h, and add a new macro to usys.S that defines an asm routine that calls that specific system call, and change the makefile to facilitate our change . After doing so we had to write a userspace testing program to test our changes.

The result after doing all this is below :)

cpu1: starting
cpu0: starting
init: starting sh
$ ls
.              1 1 512
..             1 1 512
README         2 2 2209
cat            2 3 9725
echo           2 4 9254
forktest       2 5 5986
grep           2 6 10873
init           2 7 9579
kill           2 8 9246
ln             2 9 9240
ls             2 10 10832
mkdir          2 11 9315
rm             2 12 9308
sh             2 13 16600
stressfs       2 14 9790
usertests      2 15 37633
wc             2 16 10207
zombie         2 17 9028
syscallnum     2 18 9144
console        3 19 0
$ syscallnum
The total number of syscalls so far is 643
$ syscallnum
The total number of syscalls so far is 705
$ syscallnum
The total number of syscalls so far is 767
$ syscallnum
The total number of syscalls so far is 829

Epilogue

I usually end my blog posts with an epilogue. Although this is a post that doesn’t necesarilly need one, I wanted to write one just to say to you that you should try kernel hacking, that is programming jargon for programming an operating system kernel, because it’s an experience that undoubtedly will teach you a great deal of things about how your computer actually works.

Last but not least, take a look at the ongoing work on Ensidia, my fork of xv6. To see this particular work, take a look at the syscall branch.

References

In the previous part of this little series, we talked about the compiler, and what it does with the header files, in our attempt to demistify their usage. In this part, I want to show you what’s the compiler’s output, and how we create our file.

The compiler’s composition

Generally speaking, a compiler belongs to a family of software called translators. A translator’s job is to read some source code in a source language, and generate (translate it to) some source code in a target language.

Now, you might think that most compilers you know don’t do that. You input a (source code) file, and you get a binary file, ready to run when you want it to. Yes that’s what it does, but it’s not the compiler that does all this. If you remember from the last installment of this series, when you call the compiler like gcc some_file.c or clang some_file.c, in essence you are calling the compilation driver, with the file as a parameter. The compilation driver then calls 1) the preprocessor, 2) the (actual) compiler, 3) the assembler and last but not least the linker. At least when it comes to gcc, these pieces of software are called cpp, cc1, gas (executable name is as) and collect2 (executable name is ld) respectively.

From that little software collection up top, that we call the compiler, we can easily take notice of at least 3 (yeah, that’s right) translators, that act as we mentioned earlier, that is take some input in a source language, and produce some output to a target language.

The first is the preprocessor. The preprocessor accepts source code in C as a source language, and produces source code again in C (as a target language), but with the output having various elements of the source code resolved, such as header file inclusion, macro expansion, etc.

The second is the compiler. The compiler accepts (in our case) C source code, as a source language, and translates it to some architecture’s assembly language. In my case, when I talk about the compiler, I’m gonna assume that it produces x86 assembly.

The last one, is the assembler, which accepts as input some machine’s architecture assembly language, and produces what’s called binary, or object representation of it, that is it translates the assembly mnemonics directly to the bytes they correspond to, in the target architecture.

At this point, one could also argue that the linker is also a translator, accepting binary, and translating it to an executable file, that is, resolving references, and fitting the binary code on the segments of the file that is to be produced. For example, on a typical GNU/Linux system, this phase produces the executable ELF file.

The (actual) compiler’s output: x86 assembly.

Before we go any further, I would like to show you what the compiler really creates:

For the typical hello world program we demonstrated in our first installment, the compiler will output the following assembly code:

.file	"hello.c"
	.section	.rodata
.LC0:
	.string	"Hello world!"
	.text
	.globl	main
	.type	main, @function
main:
	pushq	%rbp
	movq	%rsp, %rbp
	subq	$16, %rsp
	movl	%edi, -4(%rbp)
	movq	%rsi, -16(%rbp)
	movl	$.LC0, %edi
	call	puts
	movl	$0, %eax
	leave
	ret
	.size	main, .-main
	.ident	"GCC: (GNU) 4.8.2 20131212 (Red Hat 4.8.2-7)"
	.section	.note.GNU-stack,"",@progbits

To produce the above file, we had to use the following gcc invocation command: gcc -S -fno-asynchronous-unwind-tables -o hello.S hello.c. We used -fno-asynchronous-unwind-tables to remove .cfi directives, which tell gas (the gnu assembler) to emit Dwarf Call Frame Information tags, which are used to reconstruct a stack backtrace when a frame pointer is missing.

For more usefull compilation flags, to control the intermediary compilation flow, try these:

  • -E: stop after preprocessing, and produce a *.i file
  • -S: we used this, stop after the compiler, and produce a *.s file
  • -c: stop after the assembler, and produce a *.o file.

The default behaviour is to use none, and stop after the linker has run. If you want to run a full compilation and keep all the intermediate files, use the -save-temps flag.

From source to binary: the assembler.

The next part of the compilation process, is the assembler. We have already discussed what the assembler does, so here we are going to see it in practice. If you have followed so far, you should have two files, a hello.c, which is the hello world’s C source code file, and a hello.S which is what we created earlier, the compiler’s (x86) assembly output.

The assembler operates on that last file as you can imagine, and to see it running, and emit binary, we need to invoke it like this: as -o hello.bin hello.S, and produces this:

ELF\00\00\00\00\00\00\00\00\00\00>\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\F0\00\00\00\00\00\00\00\00\00\00\00@\00\00\00\00\00@\00\00\00UH\89\E5H\83\EC\89}\FCH\89u\F0\BF\00\00\00\00\E8\00\00\00\00\B8\00\00\00\00\C9\C3Hello world!\00\00GCC: (GNU) 4.8.2 20131212 (Red Hat 4.8.2-7)\00\00.symtab\00.strtab\00.shstrtab\00.rela.text\00.data\00.bss\00.rodata\00.comment\00.note.GNU-stack\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00 \00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00@\00\00\00\00\00\00\00 \00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\B8\00\00\00\00\00\000\00\00\00\00\00\00\00	\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00&\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00`\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00,\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00`\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\001\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00`\00\00\00\00\00\00\00
\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\009\00\00\00\00\00\000\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00m\00\00\00\00\00\00\00-\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00B\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\9A\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\9A\00\00\00\00\00\00\00R\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\B0\00\00\00\00\00\00\F0\00\00\00\00\00\00\00
\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00	\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\A0\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\F1\FF\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00	\00\00\00\00\00\00\00\00\00\00\00\00\00 \00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00hello.c\00main\00puts\00\00\00\00\00\00\00\00\00\00\00\00\00
\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00	\00\00\00\FC\FF\FF\FF\FF\FF\FF\FF

Last but not least: the linker

We saw what the assembler emits, which is to say, binary code. However, that binary code still needs further processing. To explain that, we need to go back a little.

In our first installment of the series, we said that when you call a function like printf(), the compiler only needs its prototype to do type checking and ensure that you use it legally. For that you include the header file stdio.h. But since that contains the function prototype only, where is the source code for that function? Surely, it must be somewhere, since it executes successfully to begin with, but we haven’t met the source code for printf so far, so where is it?

The function’s source code is located in the .so (shared object) of the standard C library, which in my system (Fedora 19, x64) is libc-2.17.so. I don’t want to expand on that further, as I plan to do so on the next series installment, however, what we have said so far is enough for you to understand the linker’s usage:

The linker resolves the undefined (thus far) reference to printf, by finding the reference to the printf symbol and (in layman’s talk) making a pointer to point to it so that execution can jump to printf’s code when we have to do that during our program’s execution.

To invoke the linker on our file, at least according to it’s documentation, we should do the following: ld -o hello.out /lib/crt0.o hello.bin -lc. Then we should be able to run the file like this: ./hello.out.

Epilogue

That’s this end of this part 2 of my series that explains how your code turns into binary, and how your computer (at least when it comes to the software side) runs it. In part 3, I am going to discuss in greater length, the C library, and the kernel.

References

The past two to three days, I have been busy with creating my very own Linux distribution using the well known Linux from Scratch. This post is an accounting of my experience with the process, what I liked, what I did learn from that, what was surprising to me and more.

Linux from Scratch: An introduction

If you are here, then you most likely already know what linux from scratch is, but for the sake of completeness (or in the case that you don’t know what it is, but are so keen on learning) I will provide an introduction about it here.

Linux from scratch is a book (from now on, lfs), providing a series of steps that guide you to the creation of a fully function GNU/Linux distribution. Although the original book creates a “barebones” distribution, with only fundamental tools in it, the distribution created provides a fine enviroment for further experimentation or customization.

Apart from the basic book, the lfs project also has 3-4 books to read if you want to extend the basic system (such as blfs, Beyond Linux from Scratch) or if you want to automate the process, create a distribution that is more secure, or how to cross-compile an lfs system for different machines.

My experience with building LFS

A small introduction about my background

I have been a UNIX (-like) systems (full-time) user for about 2.5 years now. During that time I had seen myself from being what you would call a Linux newbie, not knowing how to use a system without a GUI installed (have I mentioned that Ubuntu was my favourite distribution) to being an arguably experienced UNIX programmer, trying to learn more about the various aspects of UNIX systems, and delving deeper and deeper into them every day (while also feeling pain if using something other than a UNIX like system).

During that time, I have learned about the Unix way of working with the system, using the shell and the system’s toolchain to write software and other wise manipulate the system. I ditched my old knowledge about IDEs and GUIs, and set out to master the command line and the associated tools (Anecdote: I remember, when I first came from to Unix from Windows, to searching the net for a C/C++ IDE to do development.) I remember reading about how people worked another way in Unix land, using an editor, and the shell to work, and I decided to force myself to learn to work that way. I still remember trying to use vim and gcc, and ending up liking this way better because it seemed a more natural way to interact with the software development process, than using a ide and pressing the equivalent of a “play” button, so that magic ensues for the next few seconds until I have a result.

Time has passed since then, and going through hours and hours of reading and working with the system, I did learn quite a lot about it. My Google Summer of Code experience in 2013 expanded my system knowledge even further (that’s what you get when you have to work with the system kernel, the C library and a compiler).

But in all that time, of using Unix like systems, I never had the chance to create one myself. And although my background did allow me to know quite a few things of the inner workings of a system like that, I never actually saw all these software systems combining in front of my very eyes to create that beauty we know as a GNU/Linux distribution. And that left me a bad taste, because I knew what was happening, but I wanted to see it happen right in front of my eyes.

Knowing about the existence of lfs, and not actually going through it also made matters worse for me, as I knew that I could actually “patch” that knowledge gap of mine, but I never really tried to do that. I felt that I was missing on a lot, and that lfs would be instrumental to my understanding of a Linux system. Having gone through that some years ago, and getting stuck at the very beginning had also created an innate fear in me, that it was something that would be above my own powers.

Until two days ago, when I said to myself: “You know what? I have seen and done a lot of things in a UNIX system. I am now much more experienced than I was when I last did it. And I know I want to at least try it, even if it will only give me nothing but infinite confusion Because if I do manage to get it, I will learn so many more things, or at least get assured that my preexisting knowledge was correct” And that thought was the greatest motive I had to do that in a fairly long time.

So, I sat at my desk, grabbed a cup of coffee and off I went!

The process

Preparation and the temporary toolchain

The book is itself several chapters long, each of which perform another “big step” in the creation of the distribution.

The first few chapters are preparatory chapters, where you ensure the integrity of the building environment, and download any building dependencies you may be lacking, create a new partition that will host the lfs system, and create the user account that will do the building of the temporary toolchain.

The temporary toolchain building is a more exciting process. In essence you compile and collect several pieces of software that will later be used to compile the distribution’s toolchain and other software.

You start of with building binutils, and that is to get a working assembler and linker. After having a working assembler and linker, you proceed with compiling gcc. Next on is unpacking the linux headers, so that you can compile (and link against them) the glibc.

Having the basic parts of the toolchain compiled, you then proceed with installing other software that is needed in the temporary toolchain, like gawk, file, patch, perl etc.

Building the main system

After getting done with the temporary toolchain, you then chroot into the lfs partition. You start of with creating the needed directories (like /bin, /boot, /etc, /home etc) and then continue with building the distribution software, utilising the temporary toolchain. For instance, you construct a new gcc, you compile sed, grep, bzip, the shadow utility that manages the handling of passwords etc, all while making sure that things don’t break, and running countless tests (that sometimes take longer than what the package took to compile) to ensure that what you build is functional and reliable.

Final configuration

Next one on the list, is the various configuration files that reside in /etc, and the setup of sysvinit, the distribution’s init system.

Last, but not least, you are compiling the linux kernel and setting up grub so that the system is bootable.

At this point, if all has gone well, and you reset, you should boot into your new lfs system.

What did I gain from that?

Building lfs was a very time consuming process for me. It must have taken about 7-8 hours at the very least. Not so much because of the compilation and testing (I was compiling with MAKEFLAGS='-j 4' on a Core i5), but because I didn’t complete some steps correctly, and later needed to go back and redo them, along with everything that followed and the time it took to research some issues, programs or various other things before I did issue a command at the shell.

Now if I were to answer the question “What did I gain from that”, my answer would be along the lines of “Infinite confusion, and some great insight at some points”.

To elaborate on that,

  • lfs mostly served as a reassurance that indeed, what I did know about the system was mostly correct.
  • I did have the chance to see the distribution get built right before my eyes, which was something I longed for a great amount of time.
  • It did make me somewhat more familiar with the configure && make && make install cycle
  • It made me realise that the directories in the system are the simple result of a mkdir command, and that configuration files in the /etc/folder are handwritten plain files. (yeah, I feel stupid about that one - I don’t know what I was expecting. This was probably the result of the “magic involved” that the distro making process entailed for me)
  • I got to see the specific software that is needed to create a distribution, and demonstrate to me how I can build it, customize that build, or even change that software to my liking
  • And last but not least, something that nearly every lfs user says after a successful try: I knew that package managers did a great many things in order to maintain the system, and that much of the work I would normally have to do was done nearly automatically but boy, was I underestimating them. After lfs, I developed a new appreciation for a good package manager.

Epilogue

Lfs was, for the most part, a great experience. As a knowledge expander, it works great. As a system that you keep and continue to maintain? I don’t know. I know that people have done that in the past, but I decided against maintaining my build, as I figured it would be very time consuming, and that if I ever wanted to gain the experience of maintaining a distro, I would probably fork something like Crux.

In the end if you ask me if I can recommend that to you, I will say that I’m not so sure. It will provide you with some insight into the internals of a GNU/Linux distribution, but it won’t make you a better programmer as some people claim (most of the process revolves around the configure && make && make install cycle, and some conf files handwriting).

In the end, it is yourself who you should ask. Do you want that knowledge? Is it worth the hassle for you? Do you want the bragging rights? Are you crazy enough to want to maintain it? These are all questions that you get as many answers to them as the people you ask.