Better Embedded System SW: 2019

Friday, January 4, 2019

Counter Rollover Brings Down Rail Service

In October 2018 Hong Kong had "six hours of turmoil" in their rail service due to as signalling outage. The culprit has now been identified as counter roll-over.

https://www.scmp.com/news/hong-kong/transport/article/2178723/unknown-signalling-system-incompatibility-caused-october

South China Morning Post
https://www.scmp.com/news/hong-kong/transport/article/2178723/unknown-signalling-system-incompatibility-caused-october

Summary version: a system synchronization counter had been counting away since 1996 and required a system reset when it saturated. (At least it didn't just roll over without anything noticing.) But over the years two different systems with slightly different counter roll-over procedures were installed. When rollover time came, they disagreed with each other on count value, paralyzing the system during the window until the second system shut down due to counter saturation. Details below quoted from the official report. (https://www.mtr.com.hk/archive/corporate/en/press_release/PR-18-108-E.pdf)

The Detailed version:

"5.1.3. Data transmission between sector computers is always synchronized through an internal software counter in each sector computer. If any individual sector computer is individually rebooted, its counter will be re-initialized and will immediately synchronize to the higher counter figure for the whole synchronized network. Therefore, when the Siemens sector computers were commissioned and put into service in 2001/2002, the relevant counters were synchronized to those of the Alstom sector computers which were installed in 1996. If the counter reaches its ceiling figure, the associated sector computer will halt and need to be re-initialized. However the counter re-initialization arrangements for the two suppliers’ sector computers are different. The Alstom sector computers will be re-initialized automatically once their counters reach an inbuilt re-initialization triggering point approximately 5 hours before reaching the ceiling figure. However, this internal software function was not made known to the operators and maintainers. The Siemens sector computers do not have an automatic reinitialization function and therefore need to be manually reinitialized through rebooting in SER by maintenance staff.

5.1.4 At around 05:26 hours on the incident day, the Alstom software counters reached the triggering point for automatic re- initialization while the Siemens sector computers continued counting up, creating an inconsistent re-initialization situation between the two interconnected sector computers at KWT (Alstom) and LAT (Siemens). This resulted in repeated execution of re-initialization followed by re-synchronization with the higher counter figure from LAT, in the KWT sector computer in an endless loop causing corresponding instability in all 25 Alstom sector computers in the system.

5.1.5 When all the Siemens software counters reached the ceiling figure at around 10:22 hours, some 5 hours after the Alstom sector computers had passed their automatic re-initialization triggering point, the 8 Siemens sector computers halted as designed. Moreover, trains on the TKL had already encountered trainborne signalling failure earlier at 10:02 hours due to the around 20 minutes counter look ahead validity requirements.

5.1.6 After the interconnections between the signalling systems of the relevant lines and the Alstom and Siemens sector computers between KWT and LAT were isolated, all sector computers were effectively rebooted to complete the entire re-initialization process and the signalling system for the four incident lines resumed normal. "

With credit for calling my attention to the report to:
Date: Sun, 30 Dec 2018 15:39:37 +0800

From: Richard Stein 
Subject: Re: MTR East Rail disruption caused by failure of both primary

and backup (Stein, RISKS-30.89)

Thursday, January 3, 2019

Sometimes Bug Severity Isn't the Most Important Thing

Generally you need to take into account both the consequence of a software defect as well as how often it occurs when doing bug triage. (See: Using a Risk Analysis Table to Categorize Bug Priority)

But an important special case is one in which the consequence is a business consequence such as brand tarnish rather than a spectacular software crash. I used to use a hypothetical example of the audience's company name being misspelled on the system display to illustrate the point. Well, it's not hypothetical any more!

Lamborghini sells a quarter-million dollar SUV with numerous software defects, including spelling the company name as "Lanborghini" Guess which defect gets the press?