Skip to content

TSF Rationale

Context

Typical arguments for critical systems treat software differently from hardware. Whereas hardware is expected to suffer both random and systematic failures, only systematic failures are considered for software.

In practice this has led safety engineers to focus on processes and practices which have not kept pace with industry norms, and are arguably no longer fit for purpose in our rapidly evolving environment. Whereas most software organisations claim to be "Agile" and rely heavily on open source, critical systems practitioners are still advocating "Waterfall" and relying on techniques (and in some cases even tools and software) that fell out of mainstream use decades ago.

The world has changed significantly, but the standards (and indeed the techniques used to devise and maintain the standards) are not keeping up.

Where are we now?

Any promises we hope to make about software must be made with the knowledge that:

  • most modern target hardware is complex and non-deterministic, and comes bundled with a huge amount of firmware (which is just hidden, non-certified, binary software).
  • complex software contains bugs, is non-deterministic and evolves rapidly
  • the external environment for software is also evolving
  • it is not feasible to specify fully the behaviour of complex systems
  • developing new software at scale is obviously risky - usually much more risky than reusing existing code that is widely used and actively maintained
  • modern systems are connected to networks, and thus subject to evolving security threats which must be mitigated throughout their product lifetime

As a result, we must consider that:

  • we cannot hope to achieve 100% confidence in most software, particularly complex software running on multicore processors
  • software cannot be considered to be 100% "safe" or "secure" or "reliable" or "bug free"

So how do we do better?

In our view the best we can (and should) realistically aim for is to

  • analyse the specific Behaviours we require from software in a specific context i.e. running on specific hardware, for a specific set of use-cases
  • analyse what could go wrong in that context
  • devise fixes and/or appropriate monitoring and mitigations
  • demonstrate that the software provides the Expected Behaviours
  • demonstrate that things typically don't go wrong
  • demonstrate that mitigations work as expected when things do go wrong
  • measure our confidence in the above
  • be ready to repeat the above every time we need to change the software, or hardware or both

We consider that delivery of software for critical systems must involve identification and management of the risks associated with the development, integration, release and maintenance of the software.

Further we consider that delivery is not complete without appropriate documentation and systems in place to review and mitigate those risks.

The Eclipse Trustable Software Framework provides a basis to help us, and our customers, to manage these risks as we understand them. Broadly the approach is to consider supply chain and tooling risks as well as the risks inherent in pre-existing or newly developed software, and to apply statistical methods to measure confidence of the whole solution. We believe that this is most usefully applied at the integration level, which is where the problems will usually be noticed.

Our approach is to:

  • make specific promises, in a specific context
  • devise methods to show that the promises are usually met
  • verify that these methods are reporting truthfully
  • analyse the ways that our promises may be broken, and either fix or mitigate
  • measure how often the promises are broken, in engineering and in production
  • calculate confidence values based on the measurements, for each software release
  • provide all of the evidence and tooling for the above, along with source code to our customers, so they can incorporate our work into their overall solution and make their own promises

So for critical software to be considered 'Trustable' we suggest it must be provided with the following constraints:

  • risks/hazards associated with the planned use of the software are analysed
  • Expected Behaviours are explicitly documented
  • prohibited Misbehaviours are explicitly documented
  • Expected Behaviours are shown to be provided, by tests
  • test procedures and results are verified
  • prohibited Misbehaviours are shown to be absent, mitigated or fixed
  • process artifacts and test results are captured as evidence
  • evidence is analysed, distilled and presented with confidence values for each release

Key insights

We can and should:

  • accept that complex software cannot in practice be 100% risk-free
  • expect and intend to provide timely updates to mitigate problems as they arise
  • expect complex software to exhibit random/stochastic behaviour
  • apply statistical methods to establish confidence in software
  • use soak testing to explore software behaviour over extended time periods
  • use stress testing to identify and analyse rare events
  • use CICD to lock down the target code and the whole supply chain