top of page

Is That Really Why Control Systems Go Wrong? - Video Presentation

Presented by Greg Hardin - Senior Principal Specialist, aeSolutions

The British HSE publication “Out of control - Why control systems go wrong and how to prevent failure” (HSE238) reports the primary cause by phase (specification, design and implementation, installation and commissioning, operation and maintenance, changes after commissioning) of failures of 34 safety systems in different industries. This document is frequently referred to in functional safety activities in the process industries. This presentation will consider just how applicable are the quantitative results presented in HSE238 to the process industries. .

Keywords: automation, systems integration, upgrade, process safety, process control network, pcn, safety instrumented systems, SIS, systematic failure

Auto Generated Transcription:

Is that really why control systems go wrong? OK.

Out of control, why control systems go wrong and how to prevent failure? That's a publication of the United Kingdom's Health and Safety executive.

It is, you know, very well down in the functional safety. Business and I have used this chart. In multiple presentations over the years and what it represents is the percent of particular phase of the lifecycle where things went wrong that resulted in eventually in a serious incident. 6% installation and commissioning 20% modifications after commissioning. 15% operation and maintenance. 15% design and implementation and this is the biggie. That was a surprise to a lot of people when it was published. That the idea that where we were going wrong with. Process safety in related to instrumented protective functions was in specification.

So if you take that pie chart, you can do the same thing against. Their safety lifecycle. Specification design and implementation. And then you can break it down

a little bit more to show the areas that we're interested in.

You know hazard and risk analysis, then sift selection safety instrumented function and safety

integrity level determination. Thus Isfel is what we. Used to and still do called the calls. Call the group that I am in. And then device selection safety, integrity, integrity level calculation again. That's assist file function. And then not to cut off half the life cycle. Installation and commissioning. Operation and maintenance modification. I think if you had taken a survey of people before the HSE publication came out, this is where people would have said most of the.

Incidents were caused and I'll have a little bit more about that to say about that later in the talk. What really prompted me to want to do the some of the research that led to this talk was what is known as the streetlight of effect. You may be familiar with this little story of someone on their hands and knees. Obviously looking for something along comes a police officer. Ask the person what are they doing and they say they're looking for the car keys that they drop well. The officer wants to help, so he asks where were you standing exactly when you dropped them? And the person replies back up the street. Then why are you looking here? Because the light is better.

Are we looking at just at the specification to the exclude, not to the exclusion, but, More giving it more effort than we should, because that's, you know, that's our business. That's what's right, and at least you know my part of the business.

That's right, what's right in front of me on my desk or on my computer screen? Is the the sisvel portions of the safety lifecycle that I did just identified? So that's the street light effect?

When I was putting this talk together, I said, well, I'm familiar with the Identification of the different incidents in the report. Where they identified that specification is where they went wrong. But I said, well, I probably ought to go ahead and really read the report. And if I read the report.

The report is not as exclusive. To the analysis of the various phases as. I was assuming they do say that. Poor hazard analysis of the equipment under control. Inadequate assessment.

Systematic approach not used. These are all portions of the specification phase. That when you lump everything together as specification, that's where I started to get worried. If we were. If we were looking where the light was better. So this is the table from that report and.

Where they picked up 44 of the incidents that they reviewed were 44% were due to inadequate specification and they said of those twelve were inadequate functional requirements. Specification in and 32 were, 32% were. Inadequate safety integrity requirements specification. Well, if you look at all of the incidents in the report,

only one third of the total number of incidents.

Which was 15 of the incidents in this case total number only one third of those, or approximately 5 are related to incidents in the chemical or refinery industries.

So do the causes of incidents in the process industries follow the distribution given in out of control? That was my promise in starting this.

There are lots of Compilations of incidents in the process industries. And there's I will give a list of references at the end of this presentation.

But. The granularity of the causes in these compilations is somewhat limited, in other words.

Just because a significant incident happened very few cases do the reports, particularly the summary reports that you can find on multiple incidents.

Rarely do they give you the detail that you would new need to say.

Was this specification related or not? Most of the major incidents, involve a sequence of events they have multiple causes related to organizations and other things and.

You know they generate these large reports and again you may find something that says, well, specification of this control or safety function was inadequate. That's almost never the entire story. So I did go through and this is, you know, several of the reference lists, and I did look at 50 incidents in a particular period of time out of this out of loss prevention and the process industries, which is a commonly cited book and I was only able to identify five that were in the least bit instrument related.

Based on the description that was given and again. You know, can you say from this was these were these specification related? Possibly there just not enough detail to to tell, so my initial premise that I could.

Review the incidents and Compare the results that I could get to the same distribution of incidence of causes in the HSE publication. Turned out to not be very practical, but what can we talk about? Well, here are some of the better known major incidents.

I think everybody's probably heard most about most of these, of course, Pasadena in 1989 was the explosion at the Phillips 66 facility here in the Houston area that resulted in the death of Mary Kay O'Connor and eventually the founding of the Mary Kay O'Connor Process Safety Center.

The one incident of all of these where you could say that Specifications sure sounds like specification was a good portion of the problem. Was bunch field, but essentially it was a

tank overflowed in a fuel depot outside of London and generated and explode a vapor cloud that eventually exploded.

And reading the reports on the incident, if you look at it, it's like boy. If this was the consequence Well, did they not recognize the potential consequences of overfilling a tank? Would it have not made sense to have multiple, independent, diverse technology level instruments and communication to the remote Control Center?

You know, so I would have to say of the major incidents. That most people are familiar with

Bunch Field becomes the closest to being specification related. So where can things go wrong in specifications? Well, you know.

Obviously in the hazards assessment. If you don't identify a hazard if you don't identify an initiating event, if you don't accurately. Predict the potential consequences if you give too much credit for your existing safeguards. Well then you regarding ill then you have missed something that will not be addressed in the rest of your project.

Risk assessment.

People tend to overestimate or

underestimate initiating event frequency.

We happen to be of all involved

in a a project right now.

I did some checking on just the other

day where we're actually doing some

failure mode and effect analysis.

Trying to apply some Bayesian statistics

to help a client identify the closer.

Initiating event frequency to the true

value than what you can get just out

of looking at the reference books.

Obviously you can over under May

underestimate the consequence,


conditional modifiers and enabling

conditions of inappropriately applied

to reduce the potential frequency.

I've borrowed this chart from the

presentation I did a while back on functional

safety assessments and this is just.

You know, I put this together,

it's it's not really a serious analysis.

But one thing we run and run into frequently

when people ask us to help them do.

Safety, integrity level,

determination of safety functions

related to fired equipment is that

they start out assuming that if the

slightest bit of uncombusted fuel

makes its way into the fire box,

then you have a violent explosion

that results in a fatality.

And if you look at it, it's.

Yeah, that makes an awful lot of assumptions,

so this just happens to be a specific

instance that I've seen several times

and people come up with outrageous

what seems unnecessarily high.

Safety, integrity level requirements

beyond that required by the standards.

To address this,

when if they took a,

a more hardheaded look at it,

it would not necessarily occur with

the frequency or the consequence

that they assume.

In the safety requirements specification.

Systematic errors or your field

device is going to be certified to E.

C, Six, 1508 or based on prior use.

The standard is more forgiving of

you if you base them on prior use.

However, this is also some place where you

can go wrong or go astray, I should say.

Because the latest version of ANSI

ISA 615 eleven allows you to have a

safety integrity level, two function

In other words, no redundancy.

Well, that is based on the fact

that the failure rates you're using

to calculate the safety integrity

level are based on prior use,

but the standard doesn't stay

that very clearly an you know.

That's an unfortunate.

Weakness I think in the standard, but

it's a place where you have to be careful.

It makes a difference in how you

evaluate hardware, fault tolerance,

architectural constraints.

Whether you're basing your failure rates.

Are they certified devices or

are they based on prior use?

Is your failure rate data reasonable?


that's something that we deal

with very frequently.

Clients will come to us sometimes

with a manufacturer certificate that has a.

Dangerous undetected failure rate for

a device that's one or two orders of

magnitude lower than what we're used

to seeing even for certified devices.

And sometimes it can be difficult

to get the client to recognize the

risk that they're taking in the past.

Sometimes I have performed the

calculation with their data and

then with more reasonable data

and showed them the difference.

And like I say,

you're trying to identify the risk.

That the client is assuming by

using this potentially unreasonable

failure rate that the device can't

really maintain in the field.

Test intervals.

Are people really thought through?

You know, that's one of the knobs

that they want us to change.

Is test intervals?

Well, yeah, I can't see you know.


let's make this the test interval

shorter and we'll get the

safety integrity level down.

Well yeah, that's true.

Or mean up increases.

Excuse me, but is that really?

You know,

if you've got a five year turn

around frequency and that's the

only time that you can test some

of your safety functions well.

D. I'll just changing the number.

Doesn't really do you anything if

you can't actually operate that way.

Test coverage is.

That's another place where people

want to say oh, our test coverage.

We're night.

We cover 99% of the potential failures.

Well, if you look at the possible,

hopefully the manufacturers safety manual,

that's possibly not.


we had a good presentation to

spend some time ago about vendor

talking about the work that has

been done in the nuclear industry

about what it takes to obtain test,

you know,

proof test coverage for shut off valves,

and they're not even to get to the highest.

Proof test coverage takes an awful lot

of work and an awful lot of resources.

Hardware resources.

Process safety time.

Is it accurate digit?

Is it considered in the design to the

valves really closed fast enough?

All process operating modes consider.

Do you consider startup and shutdown?

Are there times when one piece of

equipment is out of service but

not another will tripping this

safety function at that time?

Create a hazard you hadn't anticipated.

So in summary. Are 44% of the

incidents in the process industries

do just to a specification error

of a safety function? Doubtful.

Most have complex causes noticed. I'm saying serious incidents. Out of control, focused attention on the specification portion of the safety lifecycle, and that was a good thing because before that I think most people would have said that operation and maintenance and problems with management of change where where the main causes of serious incidents were happening and when reason for that is.

Well, you know things don't blow up during the Specification's age. They have to be operating and being maintained before you have a serious incident, and so that's tends to be where the focus is. That doesn't mean that the chain that led to the incident did not start back in the specification phase.

So out of control, I still consider it a valuable reference.


Want all our best content in your inbox?
Sign up now!
Sign up now!

aeSolutions sends out an email newsletter ever other month of our most popular blogs, webinar, whitepapers, and more.

bottom of page