Explanation of FMEA criteria and scales
In the FMEA process, we use three key criteria to evaluate each identified failure mode: Severity, Occurrence, and Detection. Each of these criteria is rated on a scale from 1 to 10. Understanding these criteria and scales is crucial for accurately assessing the potential risks associated with each failure mode.
1. Severity (S)
Severity measures the impact of a failure mode on the system or the end-user if it occurs. It helps us understand how serious the consequences would be.
Scale for Severity:
Rating | Description | Example |
---|---|---|
1 | No effect | The failure has no noticeable impact. |
2 | Minor | The failure causes slight inconvenience or annoyance. |
3 | Moderate | The failure affects system performance but is manageable. |
4 | Major | The failure significantly impacts system performance or user satisfaction. |
5 | Critical | The failure causes total system failure or severe operational disruption. |
2. Occurrence (O)
Occurrence measures how frequently a failure mode is likely to happen. It helps us estimate the likelihood of the failure occurring.
Scale for Occurrence:
Rating | Description | Confidence Level (%) | Example |
---|---|---|---|
1 | Very unlikely to occur | Above 99% | Almost certain that the failure will not occur. |
2 | Unlikely to occur | 90-99% | Rare but possible failure. |
3 | Possible | 81-90% | Occasional failures, needs attention. |
4 | Likely to occur | 71-80% | Regular failures, significant concern. |
5 | Very likely to occur | 70% or less | Frequent failures, immediate action required. |
3. Detection (D)
Detection measures our ability to identify a failure mode before it causes an impact. It helps us understand how likely it is that we can detect the failure before it affects the system.
Scale for Detection:
Rating | Description | Example |
---|---|---|
1 | Almost certain | Failure will almost certainly be detected before impact. |
2 | High likelihood | Failure is likely to be detected. |
3 | Moderate likelihood | Failure may or may not be detected. |
4 | Low likelihood | Failure is unlikely to be detected. |
5 | Very unlikely | Failure detection is almost impossible. |
Calculating the Risk Priority Number (RPN)
Once we have rated each failure mode for Severity, Occurrence, and Detection, we calculate the Risk Priority Number (RPN) for each failure mode. The RPN is calculated by multiplying the three ratings together:
RPN = Severity (S) * Occurrence (O) * Detection (D)
The RPN helps us prioritize the failure modes by their overall risk level. Higher RPNs indicate higher risk, guiding us on where to focus our mitigation efforts.
Example of Applying the Criteria
Consider a failure mode where a Lambda function fails to read a parquet file due to permission issues:
- Severity (S): Rated as 4 because it significantly impacts system performance.
- Occurrence (O): Rated as 3 because permission issues happen occasionally (confidence level 81-90%).
- Detection (D): Rated as 2 because such issues are likely to be detected quickly through monitoring and alerts.
RPN = 4 (Severity) * 3 (Occurrence) * 2 (Detection) = 24
This RPN of 24 indicates a relatively high-risk failure mode that we need to address.
Conclusion
Understanding the criteria of Severity, Occurrence, and Detection, and their respective scales, is essential for accurately assessing and prioritising the potential failure modes in our system. This structured approach helps us systematically identify and address risks, ultimately improving the reliability and robustness of our audit logging service.
In today's session, we will focus on identifying potential failure modes for each component of our audit logging service. This will set the stage for future sessions where we will assess, prioritize, and develop mitigation strategies for these failure modes.