ARM Server Optimization: A Cautionary Tale

Recently, another friend who does integration shared a funny story with me. During the server optimization process, he mistakenly changed a parameter and almost caused a disaster.

As we all know, whether it’s migration or optimization, it sounds simple but can be quite complicated. This article is something I got from a friend, so let’s enjoy it together.

ARM Server Optimization: A Cautionary Tale

No bragging, just a shield to protect myself.

I am a system integration engineer at a small company, with a bit of experience. Sometimes I lead a team of 3-5 younger colleagues, and at times it can go up to around 10. The content mentioned in this article might raise some eyebrows regarding a certain domestic tech giant, so I will try to omit the names of all companies and projects. Just consider this as a fun read and a chance to learn something new. If you say you’ve dealt with this big company, I believe you will definitely see through it.

Last year, there was a small project, an internal project with less than 10 machines, which was essentially a migration project. When we started the selection process, they initially said they would use x86 machines, but later, for some unknown reason, the client suddenly wanted to switch to ARM.

Usually, I get along well with the employees working with the client, so I carefully asked them what the client was thinking. My colleague at the client looked quite desperate and wasn’t very clear, but pieced together some rumors that a leader had been parachuted in, wanting to please a certain leader from a state-owned enterprise. This leader wanted to investigate the differences and adaptability between x86 and ARM. So this leader confidently insisted that our project use ARM servers.

So my colleague asked me what I thought. I said, since the leader said so, let’s switch. But don’t forget, how do you exchange one x86 machine for an ARM machine? When the quantity is large, you need to have a good justification. He said he understood, and thanked me for the reminder.

Sure enough, when the project started, problems arose. The binaries were unusable, and everything had to be compiled from source, which was time-consuming and labor-intensive; many software versions were incompatible, which was expected. The key issue was that after a tremendous effort to deploy everything, the client started demanding performance…

Our configuration files were uniform, and the only thing that needed to be adjusted was the sysctl settings. Regardless of whether it’s operations, system integration, development, or optimization, no one can know everything. Development is also divided into application development and operations development. After all, no one knows everything; most people are jack-of-all-trades, and very few are masters of many, while complete specialists are even rarer. I provided feedback to my colleague at the client, who said there was no problem; they would still need to tune it.

I thought I wouldn’t need to worry about this anymore and left a younger colleague in charge, as I had other projects to attend to. Although I knew there would be pitfalls in this project, it was unavoidable, so I had to push forward.

It’s not that I’m irresponsible; it’s just that I had another project far away, and it required travel. Moreover, it was a new project with significant changes, so the travel time was quite long.

After a while, I thought this matter was basically settled.

One morning, while I was still in bed at the hotel, my younger colleague suddenly called me: “Brother, there’s a problem with the project. It shows that the disk space is full, but there is still 40% space available.”

I was startled. It shouldn’t be the case. I was worried that the hard disk wouldn’t be enough, so I specifically requested double the space. I quickly asked my colleague to check the inode usage, and sure enough, the inodes were full.

This project had been running for several months, and the number of files shouldn’t have changed dramatically. Let’s see what can be deleted. I contacted my colleague at the client, and we deleted what needed to be deleted and moved what needed to be moved.

I have to commend my younger colleague for anticipating that this might be problematic and promptly issued a temporary maintenance notice. Fortunately, this was an internal project; otherwise, both users and partners would have been in big trouble!

The key issue for us troubleshooters is that we are really anxious. We also want the system to recover to normal immediately, but filling a pit requires digging one; if the pit digger doesn’t know they’ve dug a pit and doesn’t inform us in advance, we would have to fill it slowly. Not much, but once a week, I’d be losing hair over this.

Sometimes, there really are no good solutions. Whether to operate or not to operate, take the task of deleting files as an example. If the space is genuinely full, you’d delete a couple of bigger files, and that would be about it. But since it’s the inodes that are full, we must find some small files and empty folders to delete. I spent almost a whole day doing just that.

This project is strictly speaking a “new bottle with old wine” project, so my colleagues at the client also knew it shouldn’t be our existing project’s issue. But they couldn’t resist the newly parachuted leader, who insisted on finding a reason. My colleague had no choice but to ask me: “Brother, could you check it out?”

Sure, I asked my younger colleague, who also had other projects to attend to, so he wasn’t on this project every day. It was fine since there was still a reference; the old bottle was still there.

I compared the new and old setups and didn’t find anything different. I then asked my colleague at the client what optimizations they had made. He said they had almost completely reinstalled everything.

Wait, reinstall? If it were optimization, it shouldn’t require a complete reinstall. Did something go wrong at the start?

I asked my colleague for more details: two waves of people came. The first wave said the optimization was done well and there was nothing more to optimize, but the leader didn’t approve. Later, another wave came, saying the previous optimization wasn’t thorough enough and needed to be reinstalled.

Got it, I noted this down, and my younger colleague and I could clear ourselves of suspicion at least; we wouldn’t have to take the blame.

Moreover, great! We finally found the reason. I took a look at the blocksize of the file system, and wow, it was 8k, whereas the default is 1024. That’s an 8-fold difference…

Nothing more to say; the reason was right there! I reported this to my colleague at the client, who said this pit was dug by them, and he was extremely frustrated. He said, “Brother, you really have a high level; you found the reason so quickly.”

I told him that I was just lucky. First, we could compare the new and old setups, second, I had a good relationship with my colleague at the client, which allowed for honest communication, and third, my colleague at the client understood a bit; at least he followed the entire project and asked questions when he saw something unreasonable.

If any detail didn’t match or there was slight non-cooperation, how could you uncover the truth?No way! Even if you find out the truth, who takes the blame? Wouldn’t it still be me? It’s one thing for operations to take the blame, but it’s outrageous for system integration to take the blame.

So, we found a new hard disk, freed up some folders, and reformatted the data partition to the default size. Then we found that the performance had basically returned…

Well, we basically know the methods for optimization. Some optimization parameters are a double-edged sword. Optimization is about using characteristics as advantages, rather than mindlessly pulling everything up. If certain characteristics are seen as shortcomings but aren’t recognized, that’s truly a pitfall. Pitting yourself can still be a learning experience, but pitting others is just a forced sale, forcing others to spend money to learn a lesson.

Later, my colleague asked me how to avoid similar situations in the future.

I said, install the operating system yourself, and any changes made by Company A should be verified with Company B to see the impact and interoperability. Don’t reinvent the wheel.If you really do reinvent the wheel, it’s okay if it’s compatible. What’s scary is if it’s not compatible; then it’s a disaster.

My colleague then asked, “But the CPUs are different; how do we unify the operating systems? If the operating systems aren’t unified, then the comparisons and changes mentioned earlier become meaningless.”

That’s right; there’s only one way—try not to use various CPUs.

Why do I say this? The path of XC is not easy to walk. One approach is a fully closed environment. Once fully closed, it’s completely cut off from the outside. Those open-source and self-controllable systems are not much different, and almost don’t require any changes. If you cut off all contact, isn’t that just being obstinate?

The risk of creating your own standards is that once it backfires, it will lead to complete destruction. Back in the day, Intel developed its own AI64 and insisted on it for 20 years. Many major software and hardware companies supported it fully, but in the end, it still failed.

Now, domestically, ARM is pushing this the hardest, but there are two issues: one is the ecological foundation, and the other is a common human ailment. They always say it’s incompatible and not yet adapted, but reporting it up doesn’t help. Who can ask the open-source software gurus to help adapt software for unfamiliar fields?

The other approach is semi-closed, only checking whether it’s self-controllable, and trying to create one’s own system as much as possible without significant changes. If there are truly “connections,” those enterprises rely on securing upper-level leaders and stirring up so-called “patriotic sentiments” to pull some small tricks.

One thought can lead to disaster, and another can lead to enlightenment. Our level and vision are too limited. Perhaps these small tricks could become high technology? Or perhaps in fields we don’t understand, others are indeed high technology? We don’t understand, so why not avoid creating more trouble for ourselves and save some hair? After all, hair transplants are quite expensive; one month’s salary wouldn’t even cover a few strands, and there are mortgages and car loans to pay.

Professional matters should be left to professionals. Whoever does the work should be responsible. A few days ago, my colleague asked if the newly deployed server works well. My colleague replied, “Forget it.”

ARM Server Optimization: A Cautionary Tale

People in Xinchuang

Leave a Comment Cancel reply

People in Xinchuang

Related posts

Leave a Comment Cancel reply