Explanation of FMEA criteria and scales

In the FMEA process, we use three key criteria to evaluate each identified failure mode: Severity, Occurrence, and Detection. Each of these criteria is rated on a scale from 1 to 10. Understanding these criteria and scales is crucial for accurately assessing the potential risks associated with each failure mode.

1. Severity (S)

Severity measures the impact of a failure mode on the system or the end-user if it occurs. It helps us understand how serious the consequences would be.

Scale for Severity:

Rating Description Example
1 No effect The failure has no noticeable impact.
2 Minor The failure causes slight inconvenience or annoyance.
3 Moderate The failure affects system performance but is manageable.
4 Major The failure significantly impacts system performance or user satisfaction.
5 Critical The failure causes total system failure or severe operational disruption.

2. Occurrence (O)

Occurrence measures how frequently a failure mode is likely to happen. It helps us estimate the likelihood of the failure occurring.

Scale for Occurrence:

Rating Description Confidence Level (%) Example
1 Very unlikely to occur Above 99% Almost certain that the failure will not occur.
2 Unlikely to occur 90-99% Rare but possible failure.
3 Possible 81-90% Occasional failures, needs attention.
4 Likely to occur 71-80% Regular failures, significant concern.
5 Very likely to occur 70% or less Frequent failures, immediate action required.

3. Detection (D)

Detection measures our ability to identify a failure mode before it causes an impact. It helps us understand how likely it is that we can detect the failure before it affects the system.

Scale for Detection:

Rating Description Example
1 Almost certain Failure will almost certainly be detected before impact.
2 High likelihood Failure is likely to be detected.
3 Moderate likelihood Failure may or may not be detected.
4 Low likelihood Failure is unlikely to be detected.
5 Very unlikely Failure detection is almost impossible.

Calculating the Risk Priority Number (RPN)

Once we have rated each failure mode for Severity, Occurrence, and Detection, we calculate the Risk Priority Number (RPN) for each failure mode. The RPN is calculated by multiplying the three ratings together:

RPN = Severity (S) * Occurrence (O) * Detection (D)

The RPN helps us prioritize the failure modes by their overall risk level. Higher RPNs indicate higher risk, guiding us on where to focus our mitigation efforts.

Example of Applying the Criteria

Consider a failure mode where a Lambda function fails to read a parquet file due to permission issues:

RPN = 4 (Severity) * 3 (Occurrence) * 2 (Detection) = 24

This RPN of 24 indicates a relatively high-risk failure mode that we need to address.

Conclusion

Understanding the criteria of Severity, Occurrence, and Detection, and their respective scales, is essential for accurately assessing and prioritising the potential failure modes in our system. This structured approach helps us systematically identify and address risks, ultimately improving the reliability and robustness of our audit logging service.

In today's session, we will focus on identifying potential failure modes for each component of our audit logging service. This will set the stage for future sessions where we will assess, prioritize, and develop mitigation strategies for these failure modes.