What is Server Monitoring?
The concept behind server monitoring is straight-forward: server monitoring is the regular collection and analysis of data to ensure that servers are performing optimally and providing their intended function. The data used for server monitoring encompasses key performance indicators (KPIs), network connectivity, and application availability. For example, monitoring a Windows file server would examine:
- Server operating system KPIs (CPU, memory, network and disk performance metrics)
- Network Share availability
- Log File Monitoring
- Event Log Monitoring
Data from each of these categories is analyzed in order to minimize, or ideally prevent, server outages or slowdowns. The selection of the data points and how they are analyzed will vary based on the server and its function, however the general data collection and evaluation methodology is consistent no matter the operating system or server function.
Server monitoring becomes more complex as IT infrastructures become both denser, more complicated, and dispersed. Significantly larger quantities of server data and the need to analyze that data quickly can only be accomplished with automation. This allows IT personnel to spend their limited resources on advancing high value initiatives rather than chasing down avoidable server issues.
Server Monitoring ROI
Server downtime results in costs such as lost sales opportunities, lost productivity, or penalties for not meeting SLA requirements. By reducing downtime, server monitoring minimizes these costs and, when executed properly, reduces operational costs, enhances communication, and increases productivity. When calculating the return on investment for server monitoring, weigh the company wide costs generated by downtime against the IT resources required to deliver maximum uptime.
Server Monitoring Design
The following outline is a list of items to take into account when implementing a server monitoring system:
What should you monitor?
- Monitor standardized operating system specific KPIs and use appropriate thresholds
- Monitor Operating System availability with pings
- Monitor the availability of server specific functions
- Monitor Event Logs on Windows and syslogs on Unix/Linux and network devices
- Server specific known problems (e.g. service or process crashes)
What constitutes a problem?
- KPIs that exceed threshold values
- Event Log Errors or Syslog Emergency, Alert or Critical severity records
- Ping Failures
- Inability to access services
- Occurrence of known problems
What should you do when a problem is identified?
- Automate an OS command or script to fix the problem if possible
- Prioritize and escalate high severity alerts with text messages or email alerts
- For recurring problems build detailed repair notes into the alert to speed repair
What are the benefits of analyzing long-term historical server data?
- Use for root cause analysis of problems that are distributed over time and multiple devices
- Baseline behavior based on server function and identify servers that deviate from the norm
- Historical reporting of server availability and performance