Is the blunder far from us? Not really. A while ago, the LUG server malfunctioned and mistakenly sent out 70,000 text messages, depleting the balance of the school’s text message platform. It was not until the teacher from the network center called me that I found out.

The trouble started with the service monitoring script. It retrieves site information from the database, periodically visits the monitored sites, and if a problem is detected, it will send a text message alert to the website owner. When the service monitoring script fails to connect to the database, it will also send me an alert text message. Initially, it would not attempt to reconnect to the database, so it would only send an alert once, but the monitoring service could not automatically resume operation after the database was restored. This bug was discovered during a blog malfunction, so it was changed to automatically reconnect, but the logic for sending alert text messages was not modified, so if the database could not be connected continuously, it would keep sending.

To prevent text message bombing, the messages sent out had to go through my “risk control”, limiting the number of text messages sent to each phone number every 24 hours. Risk control is to query the text message log table in the database to get the number of text messages sent to this number in the last 24 hours. When the database crashes, the value queried is NULL, which is implicitly type converted to 0 in PHP, so it is considered that it has not exceeded the limit and is sent out. The school’s text message gateway also has no “risk control” , resulting in a large number of text messages flooding into the operator’s network.

It may be due to the automatic blocking of duplicate text messages by my phone or the operator, I did not receive these text messages, so I did not know about this problem at the first time. As a result, the “blunder” of 70,000 text messages occurred.

What are the pitfalls of this blunder?

  • The alert text message for automatically reconnecting to the database should only be sent once when the status changes, and should not be sent continuously during the database failure.

  • The risk control module should not consider that 0 text messages have been sent out when the text message log query fails, but should consider the worst case and reject this text message sending request.

  • Automatic reconnection to the database is a temporary addition learned from the blog malfunction, but after modifying the code, it was not tested, and the first encounter after being deployed to the lug server caused a blunder.

  • Alerts should not only be sent via text messages, but preferably via text messages + emails, so that if there is a problem with the text messages, emails can be used, and this kind of blunder can be discovered as soon as possible. I did not receive those duplicate text messages, but I should be able to receive duplicate emails.
    Let’s review the pitfalls of the “blunder” incident of Everbright on August 16:

  • The strategic investment department was not included in risk management.

  • The “re-lower” function (for re-declaration of untraded stocks) of the ETF arbitrage module in the order generation system was mistakenly written as the “buy individual stock function” as the “buy ETF basket stock function” during design.

  • The order execution system mistakenly defaulted the stock purchase price of the market order to “0”, and the system could not correctly check whether the market order exceeded the account credit limit.

  • The “re-lower” function has never been used in real trading, and serious program errors have not been discovered.

  • It took a long time for Everbright to confirm where the problem was after receiving the notice from the Shanghai Stock Exchange.
    One is an automatic monitoring system, and the other is an automatic trading system. The reasons for the problems are so similar, it can be seen that while automatic systems bring us convenience, they also hide huge risks. This malfunction sounded an alarm for me, and I hope everyone will take it as a warning.

Comments

2013-10-11