系统日志报警汇总

本文汇总了 Linux 系统 syslog 和物理机硬件日志相关的报警说明, 以便于系统问题的发现和诊断.

syslog 日志

syslog 消息报警策略参考以下规则:

(
  (msg:xfs 
    OR msg:hang 
    OR msg:timeout 
    OR msg:error 
    OR msg:"Call Trace" 
    OR msg:"hung_task_timeout_sec" 
    OR msg:"waitingfor controller reset" 
    OR msg:"Out of memory" 
    OR msg:Kill 
    OR msg:segfault  
    OR msg:MCE 
    OR msg:threshold 
    OR msg:Uhhuh 
    OR msg:"soft lockup" 
    OR msg:"blocked for"
  ) AND msg:kernel )

  OR msg:"Too many" 
  OR msg:"SIGTERM"
  OR (pri:0 OR pri:1 OR (pri:2 AND NOT msg:"limit notification")) 
  OR (
        (msg:"kernel edac" AND msg:memory)
       OR (msg:"kernel mce" AND NOT msg:banks) 
       OR (msg:"kernel sbridge")
     ) 
  OR ((msg: "kernel: megasas"  OR  msg: "kernel: megaraid_sas")) 
  OR (pri:4 AND (msg:kernel) AND (msg:ffffffff)
)

上述关键字主要集中在以下几点:

kernel 相关:    包括异常重启信息, 文件系统卡顿信息, OOM 信息, 软锁信息以及 MCE 内存信息;
非 kernel 相关: 包含异常信号信息, 资源限制信息, raid 卡相关信息;

物理机硬件日志

硬件通常可以使用远控日志来做监控分析, 服务器厂商在远控日志方面通常有很详细的说明(比如 DELL-LifeCycle-Log), 一般都会包含日志消息的分类, 级别等. 以 DELL 服务器为例, 下述为简单的消息示例:

日志序列,信息ID, 信息ID, 分类, AgentID, 事件级别, 事件时间, 事件消息, FQDD(Fully Qualified Device Descriptor)
1562181,TMP0120,System,SEL,Warning,2024-03-20 11:22:23,The system inlet temperature is greater than the upper warning threshold.,System.Embedded.1

备注: 分类一般包含 System, System Health, Storage, Audit, Configuration 等, AgentId 等同消息组件的来源, 一般包含 SEL(传感器), iDRAC, RACLOG 等.

常见的信息ID 参考以下, 可以按不通的信息ID 来调整报警策略:

更多见: DELL-LifeCycle-Log

级别 信息ID 消息说明
Information BAT1027 The battery successfully completed a charge cycle
Information PDR10 This message is generated after a rebuild starts on a physical disk.
Information PDR54 This message is generated after a disk media error is corrected on a physical disk.
Information SYS1003 System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL.
Wranning BAT0000 System settings may be preserved if input power is not removed from the power supplies.
Wranning BAT1033 The controller cannot communicate with the battery. Either the battery was removed, or the contact point between the controller and the battery is degraded.
Wranning CPU0012 Correctable Machine Check Exception detected on CPU arg1.
Wranning FAN0000 The fan is not performing optimally. The fan may be installed improperly or may be failing.
Wranning HWC8607 The data communication with the device NIC in Slot 2 running on the port 1 is lost.
Wranning JCP042 Job XXXX failed because Unable to complete the job because of an error during iDRAC firmware update
Wranning MEM0701 The memory may not be operational. This an early indicator of a possible future uncorrectable error.
Wranning NIC100 The network link is down. Either the network cable is not connected or the network device is not working.
Wranning PDR16 The physical disk is predicted to fail. Many physical disks contain Self Monitoring Analysis and Reporting Technology (SMART). When enabled, SMART monitors the disk health based on indications such as the number of write operations that were performed on the disk.
Wranning PDR5 A physical disk has been removed from the disk group. This alert can also be caused by loose or defective cables or by problems with the enclosure.
Wranning PDR50 The global hot spare is not large enough to protect all virtual disks that reside on the controller.
Wranning TMP0118 Ambient air temperature is too cool.
Wranning TMP0120 Ambient air temperature is too warm.
Wranning VDR8 This message occurs when a physical disk in the disk group was removed or when a physical disk included in a redundant virtual disk fails. Because the virtual disk is redundant (uses mirrored or parity information) and only one physical disk has failed, the virtual disk can be rebuilt.
Critical BAT0021 The xxxx battery has reached the end of its usable life or has failed
Critical HWC2003 The cable may be necessary for proper operation. System functionality may be degraded.
Critical PDR1001 The controller detected a failure on the disk and has taken the disk offline.
Critical PDR1016 The controller detected that the drive was removed.
Critical FAN0001 The fan is not performing optimally. The fan may be installed improperly or may be failing.
Critical PDR3 The RAID Controller may not be able to read/write data to the physical disk drive indicated in the message. This may be due to a failure with the physical disk drive or because the physical disk drive was removed from the system.
Critical PSU0003 The power supply is installed correctly but an input source is not connected or is not functional.
Critical MEM0001 The memory has encountered a uncorrectable error. System performance may be degraded. The operating system and/or applications may fail as a result.
Critical MEM0702 The memory may not be operational. This an early indicator of a possible future uncorrectable error.
Critical UEFI0079 One or more Uncorrectable Memory errors occurred in the previous boot.
Critical VDR34 Background initialization of a virtual disk failed.
Critical VDR7 One or more physical disks included in the virtual disk have failed. If the virtual disk is non-redundant (does not use mirrored or parity data), then the failure of a single physical disk can cause the virtual disk to fail. If the virtual disk is redundant, then more physical disks have failed than you can rebuild using mirrored or parity information.
Critical VDR8 This message occurs when a physical disk in the disk group was removed or when a physical disk included in a redundant virtual disk fails. Because the virtual disk is redundant (uses mirrored or parity information) and only one physical disk has failed, the virtual disk can be rebuilt.
Critical VLT0204 System hardware detected an over voltage or under voltage condition. If multiple voltage exceptions occur consecutively the system may power down in failsafe mode.

其它说明

除了日志报警, 也可以考虑更多的辅助功能, 比如以下列出的几点:

1. 日志汇总到 ELK 方便查看;
2. 基于 ELK 做一些主机日志的检查, 尽量覆盖可能没有配置 syslog, 或者 syslog hang 等的情况;
3. 日志可能延迟(系统修改时间或日志接收端阻塞), 报警策略可以基于日志接收时间调整, 而不使用日志产生时间.

另外基于日志, 也可以做更多的异常分析, 比如系统意外重启等情况, 可以参考以下说明:

redhat-206873
troubleshoot-unexpected-server-shutdown