Saturday, August 1, 2020

LINT does not do peer reviews

Once in a while I run into developers who think that peer review can be completely automated by using a good static analysis (generically "lint" or compiler warnings).  In other words, run PC-LINT (or whatever), and when you have no warnings peer review is done.


But the reality has some nuance, so here's how I see it.

There are two critical aspects to style:
  (1) coding style for compilers  (will the compiler generate the code you're expecting)
  (2) coding style for humans   (can a human read the code)

A good static analysis tool is good at #1.  Should you run a static analysis tool?  Absolutely.  Pick a good tool.  Or at least do better than -Wall for Gcc (hint, "all" doesn't mean what you think it means (*see note below)).  When your code compiles clean with all relevant warnings turned on, only then is it time for a human peer review.

For #2, capabilities vary widely, and no automated tool can evaluate many aspects of good human-centric coding style.  (Can they use heuristics to help with #1?  Sure.  Can they replace a human?  Not anytime soon.)

My peer review checklist template has a number of items that fall into the #1 bin. The reason is that it is common for embedded software teams to not use static analysis at all, or to use inadequate settings. So the basics are there.  As they become more sophisticated at static analysis, they should delete the automated checks (subsuming them into item #0 -- has static analysis been done?).  Then they should add additional items they've found from experience are relevant to them to re-fill the list to a couple dozen total items.

Summarizing: static analysis tools don't automate peer reviews. They automate a useful piece of them if you are warning-free, but they are no substitute for human judgement about whether your code is understandable and likely to meet its requirements.

* Note: in teaching I require these gcc flags for student projects:
-Werror -Wextra -Wall -Wfloat-equal -Wconversion -Wparentheses -pedantic -Wunused-parameter -Wunused-variable -Wreturn-type -Wunused-function -Wredundant-decls -Wreturn-type -Wunused-value -Wswitch-default -Wuninitialized -Winit-self

Friday, January 4, 2019

Counter Rollover Brings Down Rail Service

In October 2018 Hong Kong had "six hours of turmoil" in their rail service due to as signalling outage. The culprit has now been identified as counter roll-over.

South China Morning Post

Summary version: a system synchronization counter had been counting away since 1996 and required a system reset when it saturated.  (At least it didn't just roll over without anything noticing.)  But over the years two different systems with slightly different counter roll-over procedures were installed.  When rollover time came, they disagreed with each other on count value, paralyzing the system during the window until the second system shut down due to counter saturation.  Details below quoted from the official report. (

The Detailed version:
"5.1.3. Data transmission between sector computers is always synchronized through an internal software counter in each sector computer. If any individual sector computer is individually rebooted, its counter will be re-initialized and will immediately synchronize to the higher counter figure for the whole synchronized network. Therefore, when the Siemens sector computers were commissioned and put into service in 2001/2002, the relevant counters were synchronized to those of the Alstom sector computers which were installed in 1996. If the counter reaches its ceiling figure, the associated sector computer will halt and need to be re-initialized. However the counter re-initialization arrangements for the two suppliers’ sector computers are different. The Alstom sector computers will be re-initialized automatically once their counters reach an inbuilt re-initialization triggering point approximately 5 hours before reaching the ceiling figure. However, this internal software function was not made known to the operators and maintainers. The Siemens sector computers do not have an automatic reinitialization function and therefore need to be manually reinitialized through rebooting in SER by maintenance staff.  
5.1.4 At around 05:26 hours on the incident day, the Alstom software counters reached the triggering point for automatic re- initialization while the Siemens sector computers continued counting up, creating an inconsistent re-initialization situation between the two interconnected sector computers at KWT (Alstom) and LAT (Siemens). This resulted in repeated execution of re-initialization followed by re-synchronization with the higher counter figure from LAT, in the KWT sector computer in an endless loop causing corresponding instability in all 25 Alstom sector computers in the system.  
5.1.5 When all the Siemens software counters reached the ceiling figure at around 10:22 hours, some 5 hours after the Alstom sector computers had passed their automatic re-initialization triggering point, the 8 Siemens sector computers halted as designed. Moreover, trains on the TKL had already encountered trainborne signalling failure earlier at 10:02 hours due to the around 20 minutes counter look ahead validity requirements. 
5.1.6 After the interconnections between the signalling systems of the relevant lines and the Alstom and Siemens sector computers between KWT and LAT were isolated, all sector computers were effectively rebooted to complete the entire re-initialization process and the signalling system for the four incident lines resumed normal. "
With credit for calling my attention to the report to:
Date: Sun, 30 Dec 2018 15:39:37 +0800
From: Richard Stein 
Subject: Re: MTR East Rail disruption caused by failure of both primary 
 and backup (Stein, RISKS-30.89)

Thursday, January 3, 2019

Sometimes Bug Severity Isn't the Most Important Thing

Generally you need to take into account both the consequence of a software defect as well as how often it occurs when doing bug triage.  (See: Using a Risk Analysis Table to Categorize Bug Priority)

But an important special case is one in which the consequence is a business consequence such as brand tarnish rather than a spectacular software crash.   I used to use a hypothetical example of the audience's company name being misspelled on the system display to illustrate the point.  Well, it's not hypothetical any more!

Lamborghini sells a quarter-million dollar SUV with numerous software defects, including spelling the company name as "Lanborghini"   Guess which defect gets the press?

And it turns out that a software update not only didn't solve the typo, but also broke a bunch more functionality.  

Tuesday, October 2, 2018

Cost of highly safety critical software

It's always interesting to see data on industry software costs. I recently came across a report on software costs for the aviation industry. The context was flight-critical radio communications, but the safety standards discussed were DO-178B and DO-254, which apply to flight controls as well.

Here's the most interesting picture from the report for my purposes:

(Source: Page 28 )

Translating from DO-178B terminology, this means:

  • DAL A  (failure would be "catastrophic"):  3 - 12 SLOC/day
  • DAL B  (failure would be "hazardous"): 8 - 20 SLOC/day
  • DAL C (failure would be "major"): 15 - 40 SLOC/day
  • DAL D (failure would be "minor"): 25 - 64 SLOC/day
Worth noting is that, in my experience, really solid mission critical but NOT life-critical embedded software can be done at up to 16 SLOC per day for well-run experienced teams, so it tends to line up with DAL B costs.

For interpretation, "DAL" expresses a criticality level (a "Development Assurance Level"), with more critical software requiring more rigorous processes.  The document has quite a lot to say about how the engineering process works, and is worth a read if you want to see how the aviation folks do business.  (I'm aware that DO-178C is out, but this paper talks about the older "B" version.)    Note that there are other cost models in the paper that are less pessimistic in that report, but this is the one that says "industry experience."

Have you found other cost of software data for embedded or mission critical systems?

Monday, October 1, 2018

A Million Page Views

Blogger analytics say that I just hit 1 million page views.

For fun, the top 10 countries reading this blog in rank order are:
  • USA (about half the total views)
  • Germany (almost tied with India at more than 10%)
  • India (almost tied with Germany at more than 10%)
  • Russia
  • United Kingdom
  • Canada
  • France
  • Ukraine
  • Poland
  • Brazil
I look forward to continuing to post articles about embedded system software, as well as on my Safe Autonomy Blog

Thanks to all the readers who made this happen!