The Fault Management TechKNOWvatioN contains the processes and activities involved with monitoring the systems and networks of a technically complex environment on a real-time basis to isolate, diagnose, log, resolve, and prevent network and systems problems.
The Fault Monitoring TechKNOWvatioN Service is the set of services that monitors the components of a technically complex environment for interruptions to service. Fault Monitoring is done in real time very frequently to ensure that user services are available.
The Network Fault Monitoring Element is the service that monitors and detects network interruptions. Network Fault Monitoring is concerned with real time (or near real time) monitoring for interruptions to network services and impending (short term) interruption to network services.
The System Monitoring Element is the service that monitors and detects system interruptions. System Fault Monitoring is concerned with real time (or near real time) monitoring for interruptions to system (or server) services and impending (short term) interruption to system services.
The Fault Isolation and Recovery TechKNOWvatioN Service is the set of services that isolates, troubleshoots, and corrects system and network interruptions detected by the Fault Monitoring TechKNOWvatioN Service. When a problem is found, Fault Isolation and Recovery is concerned with correcting it as quickly as possible. This TechKNOWvatioN Service is reactive in nature. The focus here is on availability, not security, or performance.
The Network Troubleshooting Element is the service that isolates, diagnoses, and resolves interruptions to network services. Network Troubleshooting focuses on the activities and processes needed to find and fix network problems as quickly as possible.
The System Troubleshooting Element is the service that isolates, diagnoses, and resolves interruptions to system services. System Troubleshooting focuses on the activities and processes needed to find and fix system problems as quickly as possible. System are defined as primarily server based components, not individual desktop computers.
The Fault Prevention TechKNOWvatioN Service is the set of services that identifies distributed computing environment interruptions, their causes, and prevents them from occurring again. Fault Prevention is concerned with performing causal analysis of interruptions and actively monitoring vendor bug reports and patches for fixes that could prevent problems before that occur.
The Fault Analysis Element is the service that analyzes of causes of faults (root cause analysis) and monitors vendor advisories and bug reports to ensure that a technically complex environment is safe from new faults.
The Fault Threshold Determination Element is the service that establishes fault thresholds, alert levels, and processes to be activated when a fault occurs or is about to occur.
|
|