The Relationship Between Pointers and Arrays in C Language – Part Two

2. Analysis from the Implementation Perspective (Compiled Code)

Previously, we briefly understood the relationship between arrays and pointers from the perspective of program writing. Next, we will analyze this logic more deeply from the perspective of compilation implementation.

int *int_pointer = int_array;

lea -0x40(%rbp),%rax ; (%rbp) – 0x40 value of int_array

mov %rax,-0x8(%rbp) ; -0x8(%rbp) ——> address of int_pointer,

load value of int_array to int_pointer

In computer terminology, a pointer is an alias for an address, such as rip — instruction pointer, rbp — stack base pointer, rsp — stack pointer. Looking at the comment “(%rbp) – 0x40 value of int_array” Indeed, (%rbp) – 0x40 is the value of int_array, which is the address of int_array, and the address is the pointer. The choice between using a pointer or an address usually depends on the situation; sometimes using an address is more fitting for the context and syntax habits, while at other times, referring to it as a pointer is more appropriate.

Let’s continue to look at the following statements:

printf(“address is %lx\n”, address);

// Output the main function (main program) stack base pointer

printf(“address – &int1 is %lx\n”, address – (long)&int1);

//Output the stack offset value of local variableint1

printf(“&int1 is %lx\n”, &int1);

// Output the absolute address of local variableint1 in process space

Below are the execution output results of the previous three output statements:

address is 7ffc5848d120 // We analyze the output here and find:

address – &int1 is 2c // The offset size of variable int1 in the stack is -0x2c.

&int1 is 7ffc5848d0f4 // The sum of the stack base pointer and the offset is exactly the address of variable int1

From the previous analysis, we can see that the offset address of variable i in the function stack frame is 0x2c. Next, let’s look at the statement int *int_pointer = &int1 and analyze the compiled program’s disassembly as follows. The first two assembly instructions are the compilation results of the previous statement int_pointer = int_array:

int *int_pointer = &int1;;

lea -0x2c(%rbp),%rax // address of int1

mov %rax,-0x8(%rbp) // address of int_pointer

int_pointer = int_array;

lea -0x40(%rbp),%rax ; (%rbp) – 0x40 value of int_array

mov %rax,-0x8(%rbp) ; -0x8(%rbp) ——> address of int_pointer,

load value of int_array to int_pointer

It can be seen that the assembly code for these two statements is almost identical: lea -0x2c(%rbp),%rax statement. lea means load effective address ———— load effective address. Next, let’s analyze the following two statements and their disassembled statements:

int1 = 100;

movl $0x64,-0x2c(%rbp) // Variable int1, here -0x2c(%rbp) corresponds to the offset value we calculated earlier 0x2c.

printf(“int1 is %d\n”, int1);

mov -0x2c(%rbp),%eax

mov %eax,%esi

mov $0x402040,%edi

mov $0x0,%eax

callq 401030

From the above statements, it can be seen that for the integer data type variable int1, the default operation on the variable (directly referencing the variable name without any access modifiers) is to take the data from the storage space represented by the variable for related processing:

int1 = 100; corresponds to movl $0x64,-0x2c(%rbp)

Here, -0x2c(%rbp) is the address of the storage space pointed to by variable int1. movl $0x64,-0x2c(%rbp) means placing the value 0x64 into the 4-byte storage space starting from the address -0x2c(%rbp). This means the compiler interprets variable int1 as (%rbp) – 2c (-0x2c(%rbp)), which is the address of the storage unit represented by that variable. It is only when performing common operations like addition, subtraction, multiplication, and division that the data stored in that storage space is taken for processing. This concept will be clearer in the explanation of compilation principles, and if there is an opportunity later, we will further elaborate on it. Correspondingly, when the compiler deals with array variable operations, the default is to use the address of the storage space for processing, rather than the content stored in it, as referenced in the following statement:

int_pointer = int_array;

lea -0x40(%rbp),%rax

mov %rax,-0x8(%rbp)

Here, it can be seen that the default access for the array type variable int_array is to operate on the address of int_array: lea -0x40(%rbp),%rax. As mentioned earlier, the assembly instruction lea is used to obtain the effective address of the storage unit. On the other hand, mov is used to load the content from memory storage space/register into memory storage space/register. Let’s look at the correspondence of the three statements:

Statement:int_pointer = int_array; Corresponds to lea -0x40(%rbp),%rax;

mov %rax,-0x8(%rbp)

Statement:int_pointer = &int1; Corresponds to lea -0x2c(%rbp),%rax;

mov %rax,-0x8(%rbp)

Statement:int1 = 100; Corresponds to movl $0x64,-0x2c(%rbp)

Observing the differences between these two statements, it can be seen that the compiler has added the operation of loading the data into the variable’s representative storage space for int1. This means that the compiler semantically distinguishes between ordinary variables and array variables. Therefore, I have always said that in most cases, the variable names defined in the program are actually pointers (with exceptions), and the same goes for array variables, although some operations on array variables are more complex. For array-type variable operations, the default semantics is to take the address operation. For basic data type variables, the default semantics is to operate on the data in the storage space pointed to by its pointer.

The following statements further validate this situation:

printf(“int_array is %p\n”, int_array);

printf(“* int_array is %d\n”, *int_array);

Output results are as follows:

int_array is 0x7ffc5848d0c0

* int_array is 262144

Combining the previous analysis, it becomes easier to understand this output result. In fact, the operation on array-type variables is the address of the data storage space they represent. Dereferencing it retrieves the data from the target location (the data type is the data type of the defined array elements), and it is clear that the compiler interprets it as a pointer to the data type of the array element declaration.

The subsequent output of the statements further validates this situation:

int_array[0] = 255;

printf(“* int_array is %d\n”, *int_array);

*int_array = 0;

printf(“int_array[0] is %d\n”, int_array[0]);

int_array[0] = 100;

printf(“* int_array is %d\n”, *int_array);

*int_array = 255;

printf(“int_array[0] is %d\n”, int_array[0]);

Output results are as follows:

* int_array is 255

int_array[0] is 0

* int_array is 100

int_array[0] is 255

This further validates the relationship between int_array and the array elements, at least showing their correspondence.

Of course, at this point, some may feel that this is a type conversion. Essentially, this reflects a misunderstanding of type conversion in C language, and we will continue to clarify why there is no data conversion here.

To understand this concept, we first need to clarify what the goal of type conversion is. Why do we need to perform type conversion?

Because different data type variables have different applicable operation types and rules. For example, incrementing an int pointer actually adds +4, while incrementing a char pointer adds +1. The purpose of type conversion is to inform the compiler of the implementation goal of the program. Therefore, when applying an operation of one data type to a variable of another data type, type conversion can be used to inform the compiler to switch the operation rules to the new data type. The address pointed to by the array variable is the address of the first element of that array, and in the subsequent tests, we can see that the increment operation on the array increases the result by the size of the storage space occupied by the data type of the array elements, rather than the entire size of the array, which implicitly indicates the compiler’s semantic understanding of array variables. This further illustrates that array variables are treated as pointers to the data type of their array elements. Since the types are the same, what is the need for type conversion?

To take a step back, even if implicit type conversion is indeed performed, whether it is implicit or explicit type conversion, it is still a conversion between addresses pointing to different data, not from other data types to pointer types. Moreover, implicit type conversion has a characteristic that can refute this understanding: implicit conversion only converts one data type to another data type. It cannot first take a certain attribute of a variable of one data type, such as the address, and then convert that attribute (address) into another data type. If it is necessary to convert the address of a variable of one data type into a pointer to another data type, it must be done through explicit type conversion, first using the address operator & to take the address of that variable, and then performing explicit type conversion on that address. The syntactic semantics of this meaning are different: that is, the variable must first undergo the address operation & and then perform type conversion. The address operation & cannot be performed implicitly, as it is an operation, not a type conversion. In other words, even if int_pointer = int_array uses implicit type conversion, it is still a type conversion between pointer types pointing to different data types. Ultimately, the addresses remain the same, only pointing to different data types. This also indicates that array variables are still pointer types.

Next, we will also intuitively explain from another perspective why an array is a direct pointer to the data type of its array elements. If interested, please continue to follow.

Leave a Comment