Presented by Greg Hardin - Senior Principal Specialist, aeSolutions
Why Do Control Systems Go Wrong?
The British HSE publication “Out of control - Why control systems go wrong and how to prevent failure” (HSE238) reports the primary cause by phase (specification, design and implementation, installation and commissioning, operation and maintenance, changes after commissioning) of failures of 34 safety systems in different industries. This document is frequently referred to in functional safety activities in the process industries. This presentation will consider just how applicable are the quantitative results presented in HSE238 to the process industries.
Keywords: automation, systems integration, upgrade, process safety, process control network, pcn, safety instrumented systems, SIS, systematic failure
Auto Generated Transcript:
Is that really why control systems go wrong? OK.
Out of control, why control systems go wrong and how to prevent failure? That's a publication of the United Kingdom's Health and Safety executive.
It is, you know, very well down in the functional safety. Business and I have used this chart. In multiple presentations over the years and what it represents is the percent of particular phase of the lifecycle where things went wrong that resulted in eventually in a serious incident. 6% installation and commissioning 20% modifications after commissioning. 15% operation and maintenance. 15% design and implementation and this is the biggie. That was a surprise to a lot of people when it was published. That the idea that where we were going wrong with. Process safety in related to instrumented protective functions was in specification.
So if you take that pie chart, you can do the same thing against. Their safety lifecycle. Specification design and implementation. And then you can break it down
a little bit more to show the areas that we're interested in.
You know hazard and risk analysis, then sift selection safety instrumented function and safety
integrity level determination. Thus Isfel is what we. Used to and still do called the calls. Call the group that I am in. And then device selection safety, integrity, integrity level calculation again. That's assist file function. And then not to cut off half the life cycle. Installation and commissioning. Operation and maintenance modification. I think if you had taken a survey of people before the HSE publication came out, this is where people would have said most of the.
Incidents were caused and I'll have a little bit more about that to say about that later in the talk. What really prompted me to want to do the some of the research that led to this talk was what is known as the streetlight of effect. You may be familiar with this little story of someone on their hands and knees. Obviously looking for something along comes a police officer. Ask the person what are they doing and they say they're looking for the car keys that they drop well. The officer wants to help, so he asks where were you standing exactly when you dropped them? And the person replies back up the street. Then why are you looking here? Because the light is better.
Are we looking at just at the specification to the exclude, not to the exclusion, but, More giving it more effort than we should, because that's, you know, that's our business. That's what's right, and at least you know my part of the business.
That's right, what's right in front of me on my desk or on my computer screen? Is the the sisvel portions of the safety lifecycle that I did just identified? So that's the street light effect?
When I was putting this talk together, I said, well, I'm familiar with the Identification of the different incidents in the report. Where they identified that specification is where they went wrong. But I said, well, I probably ought to go ahead and really read the report. And if I read the report.
The report is not as exclusive. To the analysis of the various phases as. I was assuming they do say that. Poor hazard analysis of the equipment under control. Inadequate assessment.
Systematic approach not used. These are all portions of the specification phase. That when you lump everything together as specification, that's where I started to get worried. If we were. If we were looking where the light was better. So this is the table from that report and.
Where they picked up 44 of the incidents that they reviewed were 44% were due to inadequate specification and they said of those twelve were inadequate functional requirements. Specification in and 32 were, 32% were. Inadequate safety integrity requirements specification. Well, if you look at all of the incidents in the report,
only one third of the total number of incidents.
Which was 15 of the incidents in this case total number only one third of those, or approximately 5 are related to incidents in the chemical or refinery industries.
So do the causes of incidents in the process industries follow the distribution given in out of control? That was my promise in starting this.
There are lots of Compilations of incidents in the process industries. And there's I will give a list of references at the end of this presentation.
But. The granularity of the causes in these compilations is somewhat limited, in other words.
Just because a significant incident happened very few cases do the reports, particularly the summary reports that you can find on multiple incidents.
Rarely do they give you the detail that you would new need to say.
Was this specification related or not? Most of the major incidents, involve a sequence of events they have multiple causes related to organizations and other things and.
You know they generate these large reports and again you may find something that says, well, specification of this control or safety function was inadequate. That's almost never the entire story. So I did go through and this is, you know, several of the reference lists, and I did look at 50 incidents in a particular period of time out of this out of loss prevention and the process industries, which is a commonly cited book and I was only able to identify five that were in the least bit instrument related.
Based on the description that was given and again. You know, can you say from this was these were these specification related? Possibly there just not enough detail to to tell, so my initial premise that I could.
Review the incidents and Compare the results that I could get to the same distribution of incidence of causes in the HSE publication. Turned out to not be very practical, but what can we talk about? Well, here are some of the better known major incidents.
I think everybody's probably heard most about most of these, of course, Pasadena in 1989 was the explosion at the Phillips 66 facility here in the Houston area that resulted in the death of Mary Kay O'Connor and eventually the founding of the Mary Kay O'Connor Process Safety Center.
The one incident of all of these where you could say that Specifications sure sounds like specification was a good portion of the problem. Was bunch field, but essentially it was a
tank overflowed in a fuel depot outside of London and generated and explode a vapor cloud that eventually exploded.
And reading the reports on the incident, if you look at it, it's like boy. If this was the consequence Well, did they not recognize the potential consequences of overfilling a tank? Would it have not made sense to have multiple, independent, diverse technology level instruments and communication to the remote Control Center?
You know, so I would have to say of the major incidents. That most people are familiar with
Bunch Field becomes the closest to being specification related. So where can things go wrong in specifications? Well, you know.
Obviously in the hazards assessment. If you don't identify a hazard if you don't identify an initiating event, if you don't accurately. Predict the potential consequences if you give too much credit for your existing safeguards. Well then you regarding ill then you have missed something that will not be addressed in the rest of your project.
Risk assessment.
People tend to overestimate or
underestimate initiating event frequency.
We happen to be of all involved
in a a project right now.
I did some checking on just the other
day where we're actually doing some
failure mode and effect analysis.
Trying to apply some Bayesian statistics
to help a client identify the closer.
Initiating event frequency to the true
value than what you can get just out
of looking at the reference books.
Obviously you can over under May
underestimate the consequence,
severity,
conditional modifiers and enabling
conditions of inappropriately applied
to reduce the potential frequency.
I've borrowed this chart from the
presentation I did a while back on functional
safety assessments and this is just.
You know, I put this together,
it's it's not really a serious analysis.
But one thing we run and run into frequently
when people ask us to help them do.
Safety, integrity level,
determination of safety functions
related to fired equipment is that
they start out assuming that if the
slightest bit of uncombusted fuel
makes its way into the fire box,
then you have a violent explosion
that results in a fatality.
And if you look at it, it's.
Yeah, that makes an awful lot of assumptions,
so this just happens to be a specific
instance that I've seen several times
and people come up with outrageous
what seems unnecessarily high.
Safety, integrity level requirements
beyond that required by the standards.
To address this,
when if they took a,
a more hardheaded look at it,
it would not necessarily occur with
the frequency or the consequence
that they assume.
In the safety requirements specification.
Systematic errors or your field
device is going to be certified to E.
C, Six, 1508 or based on prior use.
The standard is more forgiving of
you if you base them on prior use.
However, this is also some place where you
can go wrong or go astray, I should say.
Because the latest version of ANSI
ISA 615 eleven allows you to have a
safety integrity level, two function
with zero hardware fault tolerance.
In other words, no redundancy.
Well, that is based on the fact
that the failure rates you're using
to calculate the safety integrity
level are based on prior use,
but the standard doesn't stay
that very clearly an you know.
That's an unfortunate.
Weakness I think in the standard, but
it's a place where you have to be careful.
It makes a difference in how you
evaluate hardware, fault tolerance,
architectural constraints.
Whether you're basing your failure rates.
Are they certified devices or
are they based on prior use?
Is your failure rate data reasonable?
Boy,
that's something that we deal
with very frequently.
Clients will come to us sometimes
with a manufacturer certificate that has a.
Dangerous undetected failure rate for
a device that's one or two orders of
magnitude lower than what we're used
to seeing even for certified devices.
And sometimes it can be difficult
to get the client to recognize the
risk that they're taking in the past.
Sometimes I have performed the
calculation with their data and
then with more reasonable data
and showed them the difference.
And like I say,
you're trying to identify the risk.
That the client is assuming by
using this potentially unreasonable
failure rate that the device can't
really maintain in the field.
Test intervals.
Are people really thought through?
You know, that's one of the knobs
that they want us to change.
Is test intervals?
Well, yeah, I can't see you know.
Well,
let's make this the test interval
shorter and we'll get the
safety integrity level down.
Well yeah, that's true.
Or mean up increases.
Excuse me, but is that really?
You know,
if you've got a five year turn
around frequency and that's the
only time that you can test some
of your safety functions well.
D. I'll just changing the number.
Doesn't really do you anything if
you can't actually operate that way.
Test coverage is.
That's another place where people
want to say oh, our test coverage.
We're night.
We cover 99% of the potential failures.
Well, if you look at the possible,
hopefully the manufacturers safety manual,
that's possibly not.
Reasonable,
we had a good presentation to
spend some time ago about vendor
talking about the work that has
been done in the nuclear industry
about what it takes to obtain test,
you know,
proof test coverage for shut off valves,
and they're not even to get to the highest.
Proof test coverage takes an awful lot
of work and an awful lot of resources.
Hardware resources.
Process safety time.
Is it accurate digit?
Is it considered in the design to the
valves really closed fast enough?
All process operating modes consider.
Do you consider startup and shutdown?
Are there times when one piece of
equipment is out of service but
not another will tripping this
safety function at that time?
Create a hazard you hadn't anticipated.
So in summary. Are 44% of the
incidents in the process industries
do just to a specification error
of a safety function? Doubtful.
Most have complex causes noticed. I'm saying serious incidents. Out of control, focused attention on the specification portion of the safety lifecycle, and that was a good thing because before that I think most people would have said that operation and maintenance and problems with management of change where where the main causes of serious incidents were happening and when reason for that is.
Well, you know things don't blow up during the Specification's age. They have to be operating and being maintained before you have a serious incident, and so that's tends to be where the focus is. That doesn't mean that the chain that led to the incident did not start back in the specification phase.
So out of control, I still consider it a valuable reference.
Comments