Tuesday, February 2, 2021

Better Embedded System Software e-Book & Paperback

There are only a handful of hardcover books left of the first edition, so I spend some time converting things over to an eBook & Paperback edition.


This is not a 2nd edition, but more like version 1.1.  The changes are:

  • Some minor rewording and cleanup.
  • A few small sections rewritten to reflect lessons I've learned about how to better explain things from teaching courses.  However, scope remains the same and the hardcover book is still serviceable if you already have that.
  • A new summary list of high-level takeaways in the conclusions chapter.
  • Publication support for everywhere KDP reaches, with local distribution in all supported markets.
The paperback is probably what you'd expect given the above, and is Print-On-Demand with production handled directly by Amazon.  There is no index due to publication platform issues. However, the table of contents is pretty well structured and in most cases that will get you where you need to go.  The price is significantly lower than the hardcover, and non-US readers can get it printed and shipped from someplace much closer to home since KDP has local POD for markets in Europe and Asia.

Amazon indicates they will print-on-demand for the following markets: US, DE, ES, FR, IT, UK, JP and CA.  Some readers report availability in other markets as well (for example, AU). So try your usual Amazon marketplace first, then one of the others close to you to minimize shipping cost.

The eBook is reflowable text for the body text (authored in EPUB format, but Amazon changes formats I believe).  Bitmaps are used for figures and equations, so it should look fine on most viewing devices without symbol font issues.  The price is significantly lower than the paperback since production and distribution costs are much lower.

Amazon has worldwide distribution rights for the eBook, but availability varies based on which country your device is set up for.  If your kindle is set up as being in the US market then you should have no troubles purchasing.  Many other markets (especially in Europe) should be fine as well.  Amazon promises e-boook availability in these specific markets: US (.com), IN, UK, DE, FR, ES, IT, NL, JP, BR, CA, MZ, AU.

The book is published via KDP, but Digital Rights Management (DRM) is OFF.  That should help folks with non-Amazon viewers (but I'm not able to provide support for how to side-load onto whatever platform).  If you have Kindle or a machine that runs the Kindle App then it should be seamless as with any other Kindle book.

I really appreciate the support of the thousands of readers of the hardcover edition over the past years. I hope that this makes the material more broadly available!

Saturday, January 9, 2021

The Y2038 Problem. Sooner than you think.

In the coming years, there will be other time rollovers beyond Y2K. The next big one isn't all that far away.

Contrary to what you might have heard, the reason more computers didn't break on Jan 1st 2000 wasn't because it was a false alarm. It was because massive resources were poured into avoiding many of the problems.  And many things did in fact break, but backup plans were in place.  (I recall not getting financial reports for most of 2000 for my spending accounts at work.  So I had to keep my own books and hope I didn't overspend -- because the old accounting system expired at the end of 1999 and the new one wasn't on-line until Fall 2000.)

In January 2021 we saw some aftershocks when a 2-year time digit window hack ran out of steam from Y2K patches.  But the world didn't come to an end.

The next potentially huge time problem will be January 2038 when the 32-bit signed Unix time in seconds rolls over.  

Plenty of embedded systems last 20+ years (already we are closer than that to 2038).  Plenty of embedded systems are using 32-bit Unix, since 64-bit CPUs just cost too much for the proverbial toaster oven.  An increasing number of systems are updatable, but many require manual intervention.   Updating your DVD player (if we still have them in 2038) won't be so bad.  Updating a natural gas pipeline valve in the middle of nowhere -- not as fun.   Updating all your smart light bulbs will range from tedious to buying all new lightbulbs. And so on.

This is a good time for embedded system designers to decide what their game plan is for Y2038.  As your expected product life starts overlapping with that (as I write this, it's only 17 years away), you're accumulating technical debt that will come due in a big chunk that year.  Better to have a plan now than a panic later.  Later has a way of sneaking up on you when you're not looking.

For a more detailed list of timer rollover issues, see:


Tuesday, January 5, 2021

62 Software Experience Lessons by Karl Weigers

Karl Weigers has an essay about lessons he's learned from a long career in software development. You should benefit from his experience. The essay covers requirements, project management, quality, process improvement, and other insights.


A good example from the article is:

"You don’t have time to make every mistake that every software practitioner before you has already made. Read and respect the literature. Learn from your colleagues. Share your knowledge freely with others." 

Saturday, August 1, 2020

LINT does not do peer reviews


Once in a while I run into developers who think that peer review can be completely automated by using a good static analysis (generically "lint" or compiler warnings).  In other words, run PC-LINT (or whatever), and when you have no warnings peer review is done.


But the reality has some nuance, so here's how I see it.

There are two critical aspects to style:
  (1) coding style for compilers  (will the compiler generate the code you're expecting)
  (2) coding style for humans   (can a human read the code)

A good static analysis tool is good at #1.  Should you run a static analysis tool?  Absolutely.  Pick a good tool.  Or at least do better than -Wall for Gcc (hint, "all" doesn't mean what you think it means (*see note below)).  When your code compiles clean with all relevant warnings turned on, only then is it time for a human peer review.

For #2, capabilities vary widely, and no automated tool can evaluate many aspects of good human-centric coding style.  (Can they use heuristics to help with #1?  Sure.  Can they replace a human?  Not anytime soon.)

My peer review checklist template has a number of items that fall into the #1 bin. The reason is that it is common for embedded software teams to not use static analysis at all, or to use inadequate settings. So the basics are there.  As they become more sophisticated at static analysis, they should delete the automated checks (subsuming them into item #0 -- has static analysis been done?).  Then they should add additional items they've found from experience are relevant to them to re-fill the list to a couple dozen total items.

Summarizing: static analysis tools don't automate peer reviews. They automate a useful piece of them if you are warning-free, but they are no substitute for human judgement about whether your code is understandable and likely to meet its requirements.

* Note: in teaching I require these gcc flags for student projects:
-Werror -Wextra -Wall -Wfloat-equal -Wconversion -Wparentheses -pedantic -Wunused-parameter -Wunused-variable -Wreturn-type -Wunused-function -Wredundant-decls -Wreturn-type -Wunused-value -Wswitch-default -Wuninitialized -Winit-self

Friday, January 4, 2019

Counter Rollover Brings Down Rail Service

In October 2018 Hong Kong had "six hours of turmoil" in their rail service due to as signalling outage. The culprit has now been identified as counter roll-over.

South China Morning Post

Summary version: a system synchronization counter had been counting away since 1996 and required a system reset when it saturated.  (At least it didn't just roll over without anything noticing.)  But over the years two different systems with slightly different counter roll-over procedures were installed.  When rollover time came, they disagreed with each other on count value, paralyzing the system during the window until the second system shut down due to counter saturation.  Details below quoted from the official report. (https://www.mtr.com.hk/archive/corporate/en/press_release/PR-18-108-E.pdf)

The Detailed version:
"5.1.3. Data transmission between sector computers is always synchronized through an internal software counter in each sector computer. If any individual sector computer is individually rebooted, its counter will be re-initialized and will immediately synchronize to the higher counter figure for the whole synchronized network. Therefore, when the Siemens sector computers were commissioned and put into service in 2001/2002, the relevant counters were synchronized to those of the Alstom sector computers which were installed in 1996. If the counter reaches its ceiling figure, the associated sector computer will halt and need to be re-initialized. However the counter re-initialization arrangements for the two suppliers’ sector computers are different. The Alstom sector computers will be re-initialized automatically once their counters reach an inbuilt re-initialization triggering point approximately 5 hours before reaching the ceiling figure. However, this internal software function was not made known to the operators and maintainers. The Siemens sector computers do not have an automatic reinitialization function and therefore need to be manually reinitialized through rebooting in SER by maintenance staff.  
5.1.4 At around 05:26 hours on the incident day, the Alstom software counters reached the triggering point for automatic re- initialization while the Siemens sector computers continued counting up, creating an inconsistent re-initialization situation between the two interconnected sector computers at KWT (Alstom) and LAT (Siemens). This resulted in repeated execution of re-initialization followed by re-synchronization with the higher counter figure from LAT, in the KWT sector computer in an endless loop causing corresponding instability in all 25 Alstom sector computers in the system.  
5.1.5 When all the Siemens software counters reached the ceiling figure at around 10:22 hours, some 5 hours after the Alstom sector computers had passed their automatic re-initialization triggering point, the 8 Siemens sector computers halted as designed. Moreover, trains on the TKL had already encountered trainborne signalling failure earlier at 10:02 hours due to the around 20 minutes counter look ahead validity requirements. 
5.1.6 After the interconnections between the signalling systems of the relevant lines and the Alstom and Siemens sector computers between KWT and LAT were isolated, all sector computers were effectively rebooted to complete the entire re-initialization process and the signalling system for the four incident lines resumed normal. "
With credit for calling my attention to the report to:
Date: Sun, 30 Dec 2018 15:39:37 +0800
From: Richard Stein 
Subject: Re: MTR East Rail disruption caused by failure of both primary 
 and backup (Stein, RISKS-30.89)