C Language Programming Tips and Advice from a Google Expert

(Click the blue text above to quickly follow us)

Compiled by: Bole Online/PJing, English: Rob Pike

If you have good articles to submit, please click → here for details

[Bole Online Guide]: Rob Pike is one of the most renowned software engineers at Google, a former member of the Bell Labs Unix development team, and a key leader in the development of the Plan 9 and Inferno operating systems. He is a core figure in the creation of the Go and Limbo programming languages. He co-authored two books with Brian Kernighan: “The Unix Programming Environment” and “The Practice of Programming”.

C Language Programming Tips and Advice from a Google Expert

The advice and tips presented in this article were written by Rob Pike in February 1989 and are compiled as follows.

Introduction

The book “The Elements of Programming Style” by Kernighan and Plauger is an important and widely recognized influential work. However, sometimes I feel that the concise rules in the book can be seen as good cooking methods rather than a philosophical expression of simplicity. If this book claims that variable names should be meaningfully chosen, then are the names in their article better? Is MaximumValueUntilOverflow better than maxval? I don’t think so.

Below is a brief article that generally encourages clear philosophical thinking in programming rather than providing rigid rules. I do not expect you to agree with everything, as they are merely opinions that may change over time. However, if I had not written them down until now, these opinions, based on much experience, have long been accumulating in my mind. Therefore, I hope these views can help you understand how to plan the details of a program. (I have yet to see a good article on how to plan the whole thing, but this part could be part of a course.) If you can discover their characteristics, that would be great; if you disagree, that’s fine too. But if it inspires you to think about why you disagree, that would be even better. In any case, you should not blindly follow the way I suggest programming; try to accomplish the program in the way you think is best. Please do so consistently and without mercy.

Your comments are welcome.

Formatting Issues

A program is a publication. This means that programmers will read it first (perhaps days, weeks, or years later, even your future self), and only then will the machine get its turn. The machine’s happiness is that the program can compile; it does not care how beautifully the program is written, but people should maintain the program’s aesthetics. Sometimes people become overly concerned: using a beautiful printer to rigidly print beautiful output, where all prepositions are highlighted in bold English text, are details unrelated to the program. While many believe that programs should be written in the style described by Algol-68 (some systems even require programs to be written in that style), clear programs do not become clearer because of such presentation; it only makes bad programs look more ridiculous.

For clear programs, formatting standards have always been crucial. Of course, it is well known that indentation is the most useful, but when ink obscures intent, it controls formatting. Therefore, even when using a simple old typewriter output, one should be aware of foolish formatting. Avoid excessive embellishment, such as keeping comments concise and flexible. Express what you want to convey through the program in a neat and consistent manner. Now, let’s move on.

Variable Naming

For variable names, length is not the value of the name; clarity of expression is. Uncommonly used global variables may have a long name, like maxphysaddr. The array index used in each line of a loop does not need to have a name more detailed than i. Using index or elementnumber would input more letters (or invoke the text editor) and obscure the details of the computation. When variable names are long, it becomes difficult to understand what is happening. To some extent, this is a formatting issue; see below:

for( i = 0 to 100)

array[i] = 0;

vs.

for(elementnumber = 0 to 100)

array[elementnumber] = 0;

In real examples, the problem becomes worse. So just treat the index as a symbol.

Pointers also need reasonable symbols. np is merely a mnemonic for the pointer nodepointer. If you consistently follow naming conventions, it is easy to infer that np stands for “node pointer”. More will be mentioned in the next article.

At the same time, consistency is extremely important in other aspects of programming readability. If a variable name is maxphysaddr, then do not name a sibling variable lowestaddress.

Finally, I tend to favor “minimum length” but “maximum information” in naming, allowing context to fill in the rest. For example: global variables rarely have contextual help for understanding when used, so their naming needs to be more comprehensible. Therefore, I call maxphyaddr (not MaximumPhysicalAddress) a global variable name, while for locally defined and used pointers, np does not necessarily mean NodePoint. This is a matter of taste, but taste is related to clarity.

I avoid embedding capital letters in naming; in my experienced eyes, their readability is too awkward, much like poor formatting that is annoying.

Use of Pointers

The C language is unusual because it allows pointers to point to anything. Pointers are sharp tools; like any such tool, when used properly, they can produce delightful productivity, but when misused, they can cause great destruction (just a few days before writing this article, I drove a chisel into my thumb). Pointers have a bad reputation in academia because they are too dangerous and can become problematic for no apparent reason. But I believe they are powerful symbols that can help us express ourselves clearly.

Consider this: when a pointer points to an object, for that object, it is precisely just a name, nothing more. It sounds trivial, but look at the two expressions below:

np

node[i]

The first points to a node, while the second computes (one could say) the same node. But the second form is a less easily understood expression. Here’s the explanation: we must know what node is, what i is, and we must also understand the (possibly not very detailed) rules of how i relates to node and the surrounding program. An isolated expression does not indicate that i is a valid index for node, let alone that it is the index of the element we want. If i, j, and k are all indices in the node array, it can easily lead to errors, and even the compiler cannot help find the mistakes. This is especially easy to err when passing parameters to subroutines: a pointer is just a single parameter; but in the receiving subroutine, one must think of the array and index as a whole.

Expressions computed as objects themselves are less perceptible than the address of that object and are prone to errors. Correctly using pointers can simplify code:

parent->link[i].type

vs.

lp->type.

If you want to get the type of the next element, it could be:

parent->link[++i].type

or

(++lp)->type.

i increments, but the rest of the expression must remain unchanged; with pointers, you only need to do one thing: move the pointer forward.

Also consider formatting factors. For handling contiguous structures, using pointers is more readable than using expressions: it requires less ink, and the performance overhead for the compiler and computer is minimal. A related issue is that pointer types affect the correct use of pointers, which allows for some useful error checking at compile time to ensure that array sequences cannot be separated. Moreover, if it is a structure, then its label fields serve as hints of its type. Therefore:

np->left

is sufficient to be understood. If it is an indexed array, the array will take some carefully chosen names, and the expression will become longer:

node[i].left.

Furthermore, as the examples grow larger, the extra characters become more annoying.

In general, if you find that the code contains many similar and complex expressions, and those expressions compute elements in a data structure, wisely using pointers can eliminate these issues. Consider:

if(goleft)

p->left=p->right->left;

else

p->right=p->left->right;

This looks like using a compound expression to represent p. Sometimes it is worth using a temporary variable (here, p) or extracting the computation into a macro.

Procedure Names

Procedure names should indicate what they do, and function names should indicate what they return. Functions are often used in expressions like if, so readability is important.

if(checksize(x))

is not very helpful because it cannot infer whether checksize returns true on error or non-error. In contrast:

if(validsize(x))

makes this clear and is less likely to lead to errors in regular use.

Comments

This is a subtle issue that requires personal experience and judgment. For some reasons, I tend to prefer to eliminate comments. First, if the code is clear and uses standard type names and variable names, it should be understandable from the code itself. Second, the compiler cannot check comments, so they cannot guarantee accuracy, especially after the code has been modified. Misleading comments can be very confusing. Third, formatting issues: comments can clutter the code.

But sometimes I do write comments, like in the following cases, merely to introduce them. For example: explaining the use and type of global variables (I always write comments in large programs); as an introduction to an unusual or critical procedure; or marking a section of large computations.

A typical example of poor commenting style is:

i=i+1; /* Add one to i */

And an even worse practice:

/**********************************

* *

* Add one to i *

* *

**********************************/

i=i+1;

Don’t laugh yet; wait until you see it in reality.

Perhaps aside from declarations of important data structures (comments on data are usually more helpful than comments on algorithms), such crucial parts should avoid “cute” formatting and lengthy comments; basically, it’s best not to write comments at all. If the code needs comments to explain it, the best approach is to rewrite the code to make it easier to understand. This brings us to complexity.

Complexity

Many programs are overly complex, more so than necessary to effectively solve the problem. Why is this? Mostly due to poor design, but I will skip this issue as it is too broad. However, programs often become complex at the micro level, and these can be addressed here.

Rule 1: Do not assume where the program will spend its running time.

Bottlenecks always appear in unexpected places; do not try to guess and speed up the running time until you have confirmed where the bottleneck is.

Rule 2: Measure

Do not optimize speed without measuring the code first, unless you have identified the most time-consuming part of the code; otherwise, do not do it.

Rule 3: When n is very small (usually quite small), fancy algorithms run slowly.

Fancy algorithms have a large constant-level complexity. Do not use fancy algorithms until you are sure n is always large. (Even if n grows larger, prioritize Rule 2.) For common problems, binary trees are always more efficient than splay trees.

Rule 4: Fancy algorithms are more prone to bugs and harder to implement than simple algorithms.

Try to use simple algorithms with simple data structures.

The following are almost all the data structures used in practical programs:

  • Arrays

  • Linked lists

  • Hash tables

  • Binary trees

Of course, one must also be prepared to flexibly combine these data structures, such as using a hash table implemented as a linked list of character arrays.

Rule 5: Center on data

If appropriate data structures are chosen and everything is organized neatly, the algorithms will be self-evident. The core of programming is data structures, not algorithms. (Refer to Brooks p. 102)

Rule 6: There is no Rule 6

Data Programming

Unlike many if statements, algorithms or the details of algorithms are usually encoded in compact, efficient, and clear data. The work at hand can be encoded, ultimately due to its complexity being composed of unrelated details. Analysis tables are a typical example, encoding the syntax of programming languages through a form of parsing fixed, simple code segments. Finite state machines are particularly suitable for this form of processing, but almost any program that involves building data-driven algorithms is about “parsing” the input of certain abstract data types into sequences, which will consist of some independent “actions”.

Perhaps the most interesting aspect of this design is that table structures can sometimes be generated by another program (a classic case is a parser generator). A more down-to-earth example is that if the operating system is driven by a set of tables that connect I/O requests to the corresponding device drivers, then the system can be “configured” by a program that reads descriptions of certain special devices connected to suspicious machines and prints the corresponding tables.

Data-driven programs are uncommon among beginners due to Pascal’s authoritarianism. Pascal, like its founder, firmly believes that code should be separate from data. Thus (at least in its original form) it cannot create initialized data. This is contrary to the theories of Turing and von Neumann, which define the basic principles of stored-program computers. Code and data are the same, or at least can be considered so. How else can we explain how compilers work? (Functional languages have similar issues with I/O)

Function Pointers

Another result of Pascal’s authoritarianism is that beginners do not use function pointers. (In Pascal, functions are not treated as variables.) Using function pointers to handle coding complexity can have some interesting aspects.

Programs pointed to by pointers have a certain complexity. These programs must adhere to some standard protocols, such as requiring a set of programs that all have the same calls. Beyond that, the goal is simply to get the job done, and the complexity is dispersed.

One argument for a protocol is that since all the functions used are similar, their behavior must also be similar. This helps with simple documentation, testing, program extension, and even distributing programs over a network—remote procedure calls can be encoded through this protocol.

I believe the core of object-oriented programming is the clear use of function pointers. Specify a series of operations to be performed on the data and the entire set of data types that respond to those operations. The simplest way to bring the program together is to use a set of function pointers for each type. In short, this is defining classes and methods. Of course, object-oriented languages provide more elegant syntax, derived types, etc., but conceptually, they do not introduce anything extra.

The combination of data-driven programs and function pointers becomes a surprisingly effective working method. In my experience, this approach often yields surprising results. Even without object-oriented languages, one can achieve 90% of the benefits without extra work and manage the results better. I cannot recommend a higher standard of implementation. All my programs are organized and managed in this way, and after multiple developments, they have all run smoothly—far better than methods lacking constraints. Perhaps as the saying goes: in the long run, constraints yield rich rewards.

Include Files

Simple Rule: When including files, never nest includes.

If declarations (in comments or implicit declarations) of required files are not prioritized for inclusion, then the user (programmer) must decide which files to include, but handle it in a simple way and adopt a structure that avoids multiple inclusions. Multiple inclusions are the bane of system programming. It is not uncommon for a single C source file to compile with five or more includes. The /usr/include/sys in Unix systems uses this dreadful method.

Speaking of #ifdef, there is a small anecdote: while it can prevent reading files twice, it is often misused. #ifdef is defined in the file itself, not in the file that includes it. As a result, it often leads to thousands of unnecessary lines of code passing through the lexical analyzer, which is the most costly phase in excellent compilers.

Just follow the simple rule above.

Did you find this article helpful? Please share it with more people.

Follow “CPP Developers” to enhance your C/C++ skills

C Language Programming Tips and Advice from a Google Expert

Leave a Comment