Saturday, January 9, 2021

The Y2038 Problem. Sooner than you think.

In the coming years, there will be other time rollovers beyond Y2K. The next big one isn't all that far away.

Contrary to what you might have heard, the reason more computers didn't break on Jan 1st 2000 wasn't because it was a false alarm. It was because massive resources were poured into avoiding many of the problems.  And many things did in fact break, but backup plans were in place.  (I recall not getting financial reports for most of 2000 for my spending accounts at work.  So I had to keep my own books and hope I didn't overspend -- because the old accounting system expired at the end of 1999 and the new one wasn't on-line until Fall 2000.)

In January 2021 we saw some aftershocks when a 2-year time digit window hack ran out of steam from Y2K patches.  But the world didn't come to an end.

The next potentially huge time problem will be January 2038 when the 32-bit signed Unix time in seconds rolls over.  

Plenty of embedded systems last 20+ years (already we are closer than that to 2038).  Plenty of embedded systems are using 32-bit Unix, since 64-bit CPUs just cost too much for the proverbial toaster oven.  An increasing number of systems are updatable, but many require manual intervention.   Updating your DVD player (if we still have them in 2038) won't be so bad.  Updating a natural gas pipeline valve in the middle of nowhere -- not as fun.   Updating all your smart light bulbs will range from tedious to buying all new lightbulbs. And so on.

This is a good time for embedded system designers to decide what their game plan is for Y2038.  As your expected product life starts overlapping with that (as I write this, it's only 17 years away), you're accumulating technical debt that will come due in a big chunk that year.  Better to have a plan now than a panic later.  Later has a way of sneaking up on you when you're not looking.

For a more detailed list of timer rollover issues, see:

Tuesday, January 5, 2021

62 Software Experience Lessons by Karl Weigers

Karl Weigers has an essay about lessons he's learned from a long career in software development. You should benefit from his experience. The essay covers requirements, project management, quality, process improvement, and other insights.

A good example from the article is:

"You don’t have time to make every mistake that every software practitioner before you has already made. Read and respect the literature. Learn from your colleagues. Share your knowledge freely with others." 

Saturday, August 1, 2020

LINT does not do peer reviews

Once in a while I run into developers who think that peer review can be completely automated by using a good static analysis (generically "lint" or compiler warnings).  In other words, run PC-LINT (or whatever), and when you have no warnings peer review is done.


But the reality has some nuance, so here's how I see it.

There are two critical aspects to style:
  (1) coding style for compilers  (will the compiler generate the code you're expecting)
  (2) coding style for humans   (can a human read the code)

A good static analysis tool is good at #1.  Should you run a static analysis tool?  Absolutely.  Pick a good tool.  Or at least do better than -Wall for Gcc (hint, "all" doesn't mean what you think it means (*see note below)).  When your code compiles clean with all relevant warnings turned on, only then is it time for a human peer review.

For #2, capabilities vary widely, and no automated tool can evaluate many aspects of good human-centric coding style.  (Can they use heuristics to help with #1?  Sure.  Can they replace a human?  Not anytime soon.)

My peer review checklist template has a number of items that fall into the #1 bin. The reason is that it is common for embedded software teams to not use static analysis at all, or to use inadequate settings. So the basics are there.  As they become more sophisticated at static analysis, they should delete the automated checks (subsuming them into item #0 -- has static analysis been done?).  Then they should add additional items they've found from experience are relevant to them to re-fill the list to a couple dozen total items.

Summarizing: static analysis tools don't automate peer reviews. They automate a useful piece of them if you are warning-free, but they are no substitute for human judgement about whether your code is understandable and likely to meet its requirements.

* Note: in teaching I require these gcc flags for student projects:
-Werror -Wextra -Wall -Wfloat-equal -Wconversion -Wparentheses -pedantic -Wunused-parameter -Wunused-variable -Wreturn-type -Wunused-function -Wredundant-decls -Wreturn-type -Wunused-value -Wswitch-default -Wuninitialized -Winit-self

Friday, January 4, 2019

Counter Rollover Brings Down Rail Service

In October 2018 Hong Kong had "six hours of turmoil" in their rail service due to as signalling outage. The culprit has now been identified as counter roll-over.

South China Morning Post

Summary version: a system synchronization counter had been counting away since 1996 and required a system reset when it saturated.  (At least it didn't just roll over without anything noticing.)  But over the years two different systems with slightly different counter roll-over procedures were installed.  When rollover time came, they disagreed with each other on count value, paralyzing the system during the window until the second system shut down due to counter saturation.  Details below quoted from the official report. (

The Detailed version:
"5.1.3. Data transmission between sector computers is always synchronized through an internal software counter in each sector computer. If any individual sector computer is individually rebooted, its counter will be re-initialized and will immediately synchronize to the higher counter figure for the whole synchronized network. Therefore, when the Siemens sector computers were commissioned and put into service in 2001/2002, the relevant counters were synchronized to those of the Alstom sector computers which were installed in 1996. If the counter reaches its ceiling figure, the associated sector computer will halt and need to be re-initialized. However the counter re-initialization arrangements for the two suppliers’ sector computers are different. The Alstom sector computers will be re-initialized automatically once their counters reach an inbuilt re-initialization triggering point approximately 5 hours before reaching the ceiling figure. However, this internal software function was not made known to the operators and maintainers. The Siemens sector computers do not have an automatic reinitialization function and therefore need to be manually reinitialized through rebooting in SER by maintenance staff.  
5.1.4 At around 05:26 hours on the incident day, the Alstom software counters reached the triggering point for automatic re- initialization while the Siemens sector computers continued counting up, creating an inconsistent re-initialization situation between the two interconnected sector computers at KWT (Alstom) and LAT (Siemens). This resulted in repeated execution of re-initialization followed by re-synchronization with the higher counter figure from LAT, in the KWT sector computer in an endless loop causing corresponding instability in all 25 Alstom sector computers in the system.  
5.1.5 When all the Siemens software counters reached the ceiling figure at around 10:22 hours, some 5 hours after the Alstom sector computers had passed their automatic re-initialization triggering point, the 8 Siemens sector computers halted as designed. Moreover, trains on the TKL had already encountered trainborne signalling failure earlier at 10:02 hours due to the around 20 minutes counter look ahead validity requirements. 
5.1.6 After the interconnections between the signalling systems of the relevant lines and the Alstom and Siemens sector computers between KWT and LAT were isolated, all sector computers were effectively rebooted to complete the entire re-initialization process and the signalling system for the four incident lines resumed normal. "
With credit for calling my attention to the report to:
Date: Sun, 30 Dec 2018 15:39:37 +0800
From: Richard Stein 
Subject: Re: MTR East Rail disruption caused by failure of both primary 
 and backup (Stein, RISKS-30.89)

Thursday, January 3, 2019

Sometimes Bug Severity Isn't the Most Important Thing

Generally you need to take into account both the consequence of a software defect as well as how often it occurs when doing bug triage.  (See: Using a Risk Analysis Table to Categorize Bug Priority)

But an important special case is one in which the consequence is a business consequence such as brand tarnish rather than a spectacular software crash.   I used to use a hypothetical example of the audience's company name being misspelled on the system display to illustrate the point.  Well, it's not hypothetical any more!

Lamborghini sells a quarter-million dollar SUV with numerous software defects, including spelling the company name as "Lanborghini"   Guess which defect gets the press?

And it turns out that a software update not only didn't solve the typo, but also broke a bunch more functionality.