news

Why did the “blue screen incident” not have an impact on China’s civil aviation industry?

2024-07-21

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

[Text/Observer Network columnist Zhang Zhonglin]

On July 19, local time, countless workers around the world suddenly found that their computer screens were either blue or unable to connect to the system server. The "restart method" that used to work so well also lost its effect, and after restarting, they still had to face the huge blue screen.

The system paralysis caused by Microsoft's blue screen spread all over the world, but was particularly serious in North America, which had a serious impact on social operations: flights were grounded, 911 hotlines were not available, hotels were unable to check in, hospitals canceled surgeries, and stores were unable to operate. All of this originated from a little-known cybersecurity company, CrowdStrike - of course, it has now become a household name.

The reason for this global "blue screen incident" is not surprising. As one of the top companies in the field of global network security and cloud computing endpoint protection, a large number of companies and cloud servers use CrowdStrike's Falcon platform and run on the Windows platform.

This incident was caused by a serious compatibility issue between the latest software update of CrowdStrike and the Windows platform, which led to a large-scale "blue screen of death" and "infinite loop". If it was limited to personal computers, it would be fine, but the problematic update was also applied to cloud servers (such as Microsoft's own Azure cloud service) and also caused serious problems, which made the "blue screen incident" have a wide impact on the public domain, and the aviation industry was the first to bear the brunt.

U.S. Airlines on the “Blue Screen”

Since the information system solutions adopted by airlines in different countries are different, the impacts they suffer in the "blue screen incident" are also different: some self-service check-in systems are unusable and can only be processed at the counter, some boarding passes cannot be printed and can only be written by hand, and some are completely unusable from check-in to loading systems, and have completely lost the ability to operate.

Airlines' information systems involving Microsoft Azure cloud services and Windows-based terminals are the hardest hit, and the worst hit are the information system servers running on cloud services.


That day, people finally remembered the fear of being dominated by the blue screen and the humiliation of being powerless in the face of the Windows system.

Due to their geographical advantages in the United States, American airlines have become the hardest hit by this round of "blue screen incidents". All three major US airlines (Delta, American, and United) have been affected and issued ground suspension orders for all flights. The FAA requires air traffic controllers to inform pilots that airlines are currently experiencing communication problems. In addition, small and medium-sized airlines such as JetBlue, Frontier Airlines, and Spirit Airlines have also been seriously affected, with key systems unavailable and a large number of flights canceled.


It can be seen that due to the system crash, the number of flights in the United States on July 19 was significantly lower than the day before.

As the main victims of this round of blue screen incidents, Delta, American Airlines, and United Airlines had a large number of flights canceled, and the most affected was the Atlanta Airport, which has the largest passenger flow in the United States. As the largest hub airport in the United States and the base airport of Delta Airlines, more than 500 flights were canceled in this round of "blue screen incidents", most of which were Delta Airlines flights. Chicago O'Hare Airport followed closely behind, canceling nearly 200 flights, and New York LaGuardia Airport canceled one-third of flights. Flights at European airports outside the United States were also greatly affected. 40% of flights in and out of Amsterdam Airport were delayed, and one-third of flights at Berlin Airport were canceled.

Interestingly, this round of large-scale system failure did not affect Southwest Airlines and Alaska Airlines, but also included UPS and FEDEX, two air cargo companies. The reason behind this can be described as "black humor."

Southwest Airlines' current flight control system is based on Windows 3.1, which was released in 1992, and its crew dispatch system is based on telephone calls. Therefore, this round of large-scale Windows system and cloud service outages caused by an incorrect update package is really "the system is too outdated, so it has no impact" for Southwest Airlines.

UPS and FEDEX were in a similar situation. They were still using Windows 95 or Windows 3.1 to run their critical operating systems, so they were able to escape this disaster.

Other US airlines that were not affected were mostly regional feeder airlines. The information and operation systems of these small airlines were relatively primitive and could not afford expensive cloud services, so they also escaped the disaster and were able to operate normally. Considering the large-scale delays caused by the blizzard in North America on Christmas Day in 2022, Southwest Airlines was unable to resume flight operations due to its outdated system. This incident can be regarded as a "turn of events", proving the "high stability" advantage of "mature systems".


32-year-old Windows system prevents Southwest from running Yahoo News

Lack of emergency response

In the "blue screen incident" that caused a large-scale system crash during this round of updates, the most surprising thing was that the three major US airlines directly raised the white flag and grounded all flights after the system crash. In my opinion, this is undoubtedly incredible, because these operation control systems are important systems, not only related to the daily operation control of the airlines themselves, but also part of the country's key transportation system.

Such aviation control systems often have extremely high requirements for their reliability and resilience to ensure that they will not cause serious impacts on aviation operations due to crashes. The International Civil Aviation Organization (ICAO) has put forward specific requirements for the backup and redundancy of aviation control systems in a series of documents to avoid serious consequences caused by the crash of a single system, including:

Regular backup of critical operational data is required. Redundancy must be implemented in hardware and software, including backup servers, storage devices, etc. A detailed disaster recovery plan must be developed to cover various catastrophic scenarios. Critical systems (such as air traffic control systems) need to have automatic failover capabilities and synchronized operating data, so that once the main system fails, it can immediately switch to backup mode.


If we look at this "blue screen incident", we will find that those US airlines do not have (or have not implemented) disaster recovery plans, nor have they implemented automatic switching to backups after critical system failures. Of course, there is a possibility that they do have backups, but the backups also suffered blue screens (for example, they are also running on the Windows system and were affected by the wrong update), which gives people a feeling of "in order to avoid putting all eggs in one basket, they bought multiple P2P financial management to prevent thunderstorms."

As a person with rich on-site experience, I am also quite puzzled by the performance of my American colleagues this time, because airlines must have emergency plans for such situations to ensure minimum operation when the system is degraded or completely unavailable. In my experience in front-line work, although the loading of aircraft is now carried out through information systems, every loading staff retains the craftsmanship of manually drawing loading tables. Once the loading system fails and cannot be used, the PDF document of the loading table is found according to the model corresponding to the aircraft number, the loading table is printed out, and then the loading and calculation are performed manually to obtain the aircraft takeoff data. This manual operation is an extremely basic business skill, and it is practiced every year, every month, and every week to ensure that it will not fall behind at the critical moment when manual calculation is required.


Manual operation is the basic skill of this industry

Other related links and departments also have almost paranoid requirements for emergency drills. As a department that overlaps with the check-in department, we receive calls from the check-in department almost every month, asking them to set up a virtual flight for emergency drills. The content of the check-in emergency drill is to check in and issue boarding passes to passengers based on the local mode when the TravelSky system (the civil aviation operating system used in China) is down, and even to handwrite boarding passes to passengers to allow them to board the plane when printing is not possible.

Therefore, when I saw my American counterparts’ flight operations completely paralyzed because of the “blue screen incident”, such as the check-in system and the loading system, I was puzzled: Don’t you practice manual work? Don’t you have an emergency plan? Don’t you practice your emergency plan? Don’t you have a backup system?

Why China was not affected

This "blue screen incident" that affected the world had almost no impact on China. China's civil aviation operations were completely normal. Only some foreign airlines (such as American Airlines and United Airlines) were delayed due to foreign influences, and the reasons were not complicated.

First, for terminal computers, they use Windows systems and have CrowdStrike's security software installed. After updating the wrong patch, the problem of infinite "blue screen restart" will occur. However, domestic airline computer terminals often do not use the company's security software. Moreover, they are often more cautious about system updates and will not update unless necessary. The Windows version they use is mainly the older, more mature and stable version.

Secondly, most domestic airlines use the TravelSky system, which is based on Linux and does not use Microsoft's Azure cloud service or Amazon's AWS. This has to some extent avoided the complete collapse of my country's civil aviation key infrastructure systems caused by incorrect updates.

As an important system related to the operation of China's civil aviation, the computer system and network operated by TravelSky is a "key basic information system" and is listed as one of the eight key systems supervised by the State Council. Except for a few airlines such as Spring Airlines, other airlines use the TravelSky system. The security and stability of the TravelSky system have also been highly valued and strictly supervised by the state, ensuring the stability and reliability of the system.

Of course, this does not mean that there will be no problems with the TravelSky system. On August 25, 2020, there was an abnormality in the use of the TravelSky departure system, which caused some airports to be unable to check in. According to the report, an abnormality occurred at 10:32 am that day, causing some airports to be unable to check in, and everything returned to normal at 11:07. Although it caused a certain impact, it only lasted for half an hour, so it did not cause a major impact, and the overall operation was stable.

Although the command and operation interface of the CAAC system has not changed for decades and has been criticized, stable operation is paramount for key basic information systems. Based on a completely independent information system and operating environment, we can avoid being affected by the "blue screen incident" and avoid making a big joke like our American counterparts.

Through this incident, we have become more aware that at a time when critical information systems have become important infrastructure, it is extremely important to achieve complete autonomy and control. This includes not only information systems, but also operating systems. At a time when the cybersecurity situation is becoming increasingly severe, its necessity is beyond doubt. This is not only a technical choice, but also a strategic need for national security and industrial development.


This article is an exclusive article of Guancha.com. The content of the article is purely the author's personal opinion and does not represent the platform's opinion. It cannot be reproduced without authorization, otherwise legal liability will be pursued. Follow Guancha.com WeChat guanchacn to read interesting articles every day.