news

NetEase Cloud Music, WPS, and DingTalk have "crashed" one after another. How important is the platform disaster recovery construction!

2024-08-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Just as the topic of "NetEase Cloud Music crashed" topped the hot search list and sparked widespread discussion among netizens, WPS and DingTalk documents also experienced application "crashes" and "downtime" one after another. In the past few days, platform applications have "crashed" repeatedly. Fortunately, they were restored to normal use in a short period of time, and users were given certain "membership compensation" in addition to public apologies. However, after returning to normal and apologizing and compensating, will "crashes" and "downtime" occur again? This is something we need to reflect deeply.
An announcement released on NetEase Cloud Music’s official Weibo account after the “crash” occurred.
Which comes first, “downtime” or “tomorrow”?
On the afternoon of August 19, many netizens posted that the NetEase Cloud Music webpage had a "502 Bad Gateway" error, and the App was unusable. It was not until two hours later that it returned to normal. NetEase Cloud Music officials said it was due to "infrastructure failure."
On the morning of August 21, netizens reported that Kingsoft Documents was also unusable and WPS shared documents could not be opened. WPS officials responded in a statement that after emergency repairs by engineers, WPS services have been restored.
Coincidentally, some netizens said that DingTalk documents also had abnormal usage that afternoon. DingTalk’s official response was: “The sudden increase in usage traffic caused some users to access DingTalk documents abnormally.”
Who would have thought that the crash of an app would become a new way to “become a hot topic” and “compete for exposure”? Some netizens joked: “I don’t know which will come first, tomorrow or ‘downtime’.” This also indirectly shows that Internet applications have been integrated into people’s daily lives, and netizens’ digital lives are deeply dependent on them.
"In recent years, large-scale app crashes have occurred frequently, including on major platforms such as Alibaba, Tencent, Baidu, Didi, Douyin, and Bilibili." Liu Juan, general manager of the Network and Data Security Research Center of CCID Consulting, said that once a failure occurs on such a large platform, it will cause the entire system to crash, and the repair work will involve the coordination of multiple links and systems.
In the view of Zhang Yi, founder of Security 419, the cybersecurity incident of NetEase Cloud Music has once again exposed the existing difficulties and threats of data protection. Similar failures have become a common phenomenon in technology-driven online service platforms. Any service interruption caused by infrastructure failure will affect the user experience.
In addition, at the critical infrastructure level, software failures have often caused "crashes" in recent years. Yang Guang, chief analyst at Omdia, a global communications and IT industry research organization, said that not long ago, an update by the cybersecurity company Crowd Strike caused a large-scale blue screen "downtime" in Windows around the world, causing chaos in the aviation, railway, medical and financial systems of many countries. These crashes that have already occurred or are currently occurring are all adding a warning "footnote" to cybersecurity.
WPS official Weibo response
Behind the code are more "human problems"
By reviewing the causes of past large-scale App crashes, we can find that every link in the Internet business system may have system or App problems caused by equipment operating status, software code, personnel processing mechanisms, etc.
"Most of the time, the failures are caused by the underlying hardware, software systems and other infrastructure." Liu Juan gave examples, such as a failure in the computer room or server; programming or logical errors or unhandled exceptions during the system update and upgrade process; insufficient overall system processing power leading to the exhaustion of resources such as CPU, memory, and disk space, causing a crash, etc.
Therefore, in her opinion, for large platforms like this, it is crucial to ensure the stability of the infrastructure. This involves issues related to the construction of internal software and hardware infrastructure, the standardization of daily operation and maintenance, as well as network protection and emergency response capabilities.
Yang Guang also believes that the frequent software crashes in recent years are closely related to the "increasing complexity of the current system." "There may be a variety of specific reasons for the frequent crashes of mobile software, but there must be some common problems, that is, the internal quality control is not done well and there are certain problems with the internal process."
"For Internet companies, the occurrence of these incidents is ultimately mainly a human problem. If companies can do a good job of process control, create a good corporate atmosphere for engineers, and balance development and security, it is expected that similar incidents can be avoided to a large extent," said Yang Guang.
Zhang Yi also mentioned that in addition to service interruptions, the server migration strategy and long-term stability issues behind it have triggered industry thinking, and also warned more platforms to make adequate preparations for technical maintenance and emergency plans, continuously optimize technical architecture, and improve operation and maintenance management capabilities to reduce the risk of service interruptions and ensure the continuity and stability of user experience.
On July 19, many flights were delayed or canceled at Benito Juarez International Airport in Mexico City, the capital of Mexico, and a large number of passengers were waiting at the airport. Xinhua News Agency (Photo by Francisco Canedo)
Disaster recovery services should become an important standard
The repeated occurrence of "system downtime" incidents warns us that network security and stability cannot be compromised. How to make up for the security shortcomings has become a difficult problem facing us.
"In terms of infrastructure construction, Internet companies should plan their service capabilities in advance, ensure the high availability of software and hardware equipment through design, and increase investment in system stability to ensure the continuity of system services." Liu Juan suggested that Internet companies should give more comprehensive consideration to the security construction of such products, not only to meet compliance and legal risks, but also to start from actual business, taking into account data security, business security, basic security, personnel security and other aspects, and strengthen multi-level and full-scenario network security construction.
DingTalk official Weibo response
She also mentioned that it is necessary to minimize the occurrence of security incidents such as sensitive data leakage, business interruption, system stability and availability, and to make security operations a normalized and practical task, improve monitoring, early warning and emergency response capabilities, so as to quickly respond to, control and recover from sudden network security incidents and ensure business continuity and data security.
Zhang Yi suggested that, based on security compliance and real threats, disaster recovery services should be made a standard configuration for enterprises to ensure business continuity and guarantee the ability to recover critical data when faced with uncontrollable risks. "As a key measure, disaster recovery construction will effectively reduce the impact of security incidents on enterprise operations and build the last line of defense for data security."
Judging from the recent "crash" and "downtime" incidents, the relevant companies have provided short-term membership compensation to users, but it is obvious that this is not a "long-term solution."
"For users, relevant compensation is very necessary, but we cannot just stay in the cycle of 'apologies and compensation after a failure, and then the failure continues'." Yang Guang said that large-scale software related to national economy and people's livelihood should balance development and security. We should not only focus on prevention and further implement the main responsibility, but also use technology to fully guarantee the stability and security of services. In addition, industry organizations should also take positive actions to promote the healthy development of the industry. (Reporter Li Zhengwei Lei Miaoxin Li Fei Intern Liu Xinkun)
Source: Guangming.com
Report/Feedback