Explanation of FMEA criteria and scales

In the FMEA process, we use three key criteria to evaluate each identified failure mode: Severity, Occurrence, and Detection. Each of these criteria is rated on a scale from 1 to 10. Understanding these criteria and scales is crucial for accurately assessing the potential risks associated with each failure mode.

1. Severity (S)

Severity measures the impact of a failure mode on the system or the end-user if it occurs. It helps us understand how serious the consequences would be.

Scale for Severity:

Rating	Description	Example
1	No effect	The failure has no noticeable impact.
2	Minor	The failure causes slight inconvenience or annoyance.
3	Moderate	The failure affects system performance but is manageable.
4	Major	The failure significantly impacts system performance or user satisfaction.
5	Critical	The failure causes total system failure or severe operational disruption.

2. Occurrence (O)

Occurrence measures how frequently a failure mode is likely to happen. It helps us estimate the likelihood of the failure occurring.

Scale for Occurrence:

Rating	Description	Confidence Level (%)	Example
1	Very unlikely to occur	Above 99%	Almost certain that the failure will not occur.
2	Unlikely to occur	90-99%	Rare but possible failure.
3	Possible	81-90%	Occasional failures, needs attention.
4	Likely to occur	71-80%	Regular failures, significant concern.
5	Very likely to occur	70% or less	Frequent failures, immediate action required.

3. Detection (D)

Detection measures our ability to identify a failure mode before it causes an impact. It helps us understand how likely it is that we can detect the failure before it affects the system.

Scale for Detection:

Rating	Description	Example
1	Almost certain	Failure will almost certainly be detected before impact.
2	High likelihood	Failure is likely to be detected.
3	Moderate likelihood	Failure may or may not be detected.
4	Low likelihood	Failure is unlikely to be detected.
5	Very unlikely	Failure detection is almost impossible.

Calculating the Risk Priority Number (RPN)

Once we have rated each failure mode for Severity, Occurrence, and Detection, we calculate the Risk Priority Number (RPN) for each failure mode. The RPN is calculated by multiplying the three ratings together:

RPN = Severity (S) * Occurrence (O) * Detection (D)

The RPN helps us prioritize the failure modes by their overall risk level. Higher RPNs indicate higher risk, guiding us on where to focus our mitigation efforts.

Example of Applying the Criteria

Consider a failure mode where a Lambda function fails to read a parquet file due to permission issues:

Severity (S): Rated as 4 because it significantly impacts system performance.
Occurrence (O): Rated as 3 because permission issues happen occasionally (confidence level 81-90%).
Detection (D): Rated as 2 because such issues are likely to be detected quickly through monitoring and alerts.

RPN = 4 (Severity) * 3 (Occurrence) * 2 (Detection) = 24

This RPN of 24 indicates a relatively high-risk failure mode that we need to address.

Conclusion

Understanding the criteria of Severity, Occurrence, and Detection, and their respective scales, is essential for accurately assessing and prioritising the potential failure modes in our system. This structured approach helps us systematically identify and address risks, ultimately improving the reliability and robustness of our audit logging service.

In today's session, we will focus on identifying potential failure modes for each component of our audit logging service. This will set the stage for future sessions where we will assess, prioritize, and develop mitigation strategies for these failure modes.