Editor | Xiao Zhi
At the North American Open Source Summit held on August 31, Linus Torvalds, the creator of Linux, shared his views on the future of Linux during a conversation with Dirk Hohndel, Chief Open Source Officer at VMware. He stated that if he were hit by a bus, he wouldn’t worry about the kernel being affected. Although this hypothetical scenario is rather grim, it makes sense. Why?
Workflow is More Important than Code
“What I really worry about is the patch process; the workflow is more important than the code,” Torvalds said. “If you have the right workflow, the code will take care of itself, and if there are errors, we know how to handle them.”
He admitted that he is not clear about every line of code in the kernel now, but this is not a bad thing. Torvalds stated that the vast scale of the kernel has led to its increasing complexity, and the open-source model is at the core of the kernel’s success. In a complex world, the only way to manage complexity is through open communication of ideas. You cannot manage complexity in a closed environment.
Since 1992, Linus has been adopting patches from other developers, and today, he has a powerful kernel maintenance team. The collaborative model of the Linux system involves Linus being responsible for overall coordination and communication, interfacing with over a dozen core contributors, each responsible for specific areas and project content. Whenever there is a new development task, Linus assigns it to the corresponding person; these core contributors also have their own trusted small teams of experts. Linus only needs to know which member of his team to assign the task to.
Dirk Hohndel once asked Linus if such a development model is sustainable. Linus replied with a smile that if programmers in the current team grow old and fat (seemingly referring to himself, haha) and do not want to continue, it is fine because new programmers will come in. Dirk further asked Linus whether he has absolute decision-making power during the continuous iterations of the kernel. Linus responded, “No,” and he genuinely encourages everyone to fork according to their needs. If such ideas prove to have good results, their essence will be absorbed into the Linux kernel project. Dirk summarized that the current model of branching development and code absorption actually reflects Linus himself or his team’s decisiveness.
It is evident that technical giants like Linus place great importance on the software development process. A well-designed and smoothly operating software development process greatly enhances the efficiency of software engineering and addresses unexpected issues. So the question arises: how to achieve this? Let’s take a look at Facebook’s case study.
How Does Facebook Handle Development and Deployment?
Facebook is the largest social networking site in the world, with over 2 billion monthly active users as of 2017, more than double that of WeChat. How do Facebook engineers manage to run such a site while continuously releasing new features?
Facebook engineers do not use the waterfall model typical of the traditional software industry for development; they continuously develop new features and quickly deploy them, allowing users to access these new features. This is what is often referred to as continuous deployment. In their view, Facebook’s development never truly ends; the codebase is constantly growing, and the code exhibits a super-linear growth trend over time.
At Facebook, all front-end engineers work on the same stable branch, which accelerates development speed by eliminating the cumbersome branch merging process. In daily development, everyone develops locally using git, and when the code is ready, it is pushed to SVN (the use of SVN is due to historical reasons), naturally separating the code in development from the code ready for deployment.
However, to ensure the stable operation of the website, it is not enough for engineers to push code to SVN and assume it can be deployed. Facebook employs a method that balances speed and stability—combining daily releases with weekly releases. All code changes are defaulted to weekly releases, which include relatively more changes. Every Sunday afternoon, the code is pushed to SVN by release engineers, followed by extensive automated testing, including many regression tests for correctness and performance. This version becomes the default version for internal use by Facebook employees, with formal releases typically scheduled for Tuesday afternoons.
Release engineers score each engineer’s historical performance, internally referred to as “Push Karma.” For instance, those whose code frequently encounters issues receive lower scores, and their code naturally receives more “attention.” The purpose of this is to control release risks rather than to judge individuals, so these scores are kept confidential. Additionally, larger changes or code that has been discussed extensively during Code Review are also considered higher risk and receive more scrutiny.
Before being included in a release, the code has already undergone unit testing and Code Review by developers. At Facebook, Code Review is a very important process, and they use a tool called Phabricator for Code Review, which is integrated with version control.
In addition to extensive automated testing, every employee effectively conducts high-density testing while using Facebook internally, as they can report any issues they discover. With more developers writing code, the code grows rapidly, and correspondingly, more people are available to test the code.
In terms of performance, Facebook uses Perflab to compare the performance of new and old code. If new code performs poorly and the developer cannot fix it promptly, the related code will be excluded from the current release and will be released once the issue is resolved. Every small performance issue is significant because small problems can quickly accumulate and become major issues affecting capacity and performance. Perflab can visually present system performance through charts.
Facebook’s weekly releases are staged. First is H1, which is deployed to servers with internal access only for final testing, often referred to as “pre-release” by many companies; then H2, which is deployed to thousands of servers and opened to a small number of users; if no issues are found in H2, it proceeds to H3, which is deployed to all servers.
If issues are discovered during this process, engineers will immediately fix them and restart the phased deployment. Of course, they can also choose to roll back the code, with two rollback methods—commonly rolling back a specific change and its dependent files, or rolling back the entire binary package.
The greatest advantage of this “quasi-continuous” release cycle is that it forces us to develop the next generation of tools, automation, and processes to enable the company to scale.
During releases, developers associated with the changes must be online, and release engineers confirm via an IRC bot. If they are not present, their changes will be rolled back. This ensures that issues can be quickly identified and fixed at the onset of deployment. However, it can sometimes be challenging to promptly identify issues in such a large system, so Facebook continuously monitors system health using internal tools like Claspin and external information sources (such as Twitter).
Through the Gatekeeper system, engineers can easily control how many users can access specific new features, with filtering criteria based on region or age. In case of issues, they can quickly disable access to a feature. With the help of Gatekeeper, engineers can easily conduct A/B testing to rapidly gather real user experiences and adjust the product accordingly. Don’t forget, at Facebook, it is the engineers who choose what to work on, so they are likely to choose to create something and see how users respond rather than sitting in a conference room guessing what users want.
Currently, Facebook has thousands of development engineers but no dedicated testing engineers. Every engineer can see all the code and submit patches or detailed issue descriptions. Engineers are required to write comprehensive unit tests for their code, which must pass all regression tests and support various operational tasks.
In addition to being responsible for their own code, they face various significant challenges, often needing to experiment with multiple solutions. For example, to address PHP performance issues, three different solutions were developed simultaneously. When the lead of one solution discovered that another was better, they would stop their work. Ultimately, HipHop emerged victorious, but the efforts of the other two teams were not in vain, as they provided important backup capabilities.
After reading about Facebook’s experience, what are your thoughts?
Can you share how your company’s software development and management processes work?
Today’s Recommended Article
Click the image below to read
What is Facebook’s engineering culture like?