搭建Nagios core+SNMP监控平台

公司目前的监控预警体系主要靠自己编写Python脚本,读取各种性能指标,并判断是否超过阀值;每一台被监控机器上都部署该脚本;中央监视平台定期访问每个被监控端获取超越阀值的预警信息,如果有预警则发送邮件。在服务器数量少的情况下运维压力不大,而如今服务器数量增长到几十台,分布在阿里云和自有机房,所以需要寻求一种更加可靠的统一的监控预警体系。于是我们将目光落在Nagios上~~

Nagios简介

Nagios core本身是一个开源框架,它可以注册需要监控的主机和服务,周期性调用插件去检测主机和服务的状态,并提供web界面来查看这些状态信息,同时支持email等方式发送预警。与被监控主机间的通信和监控内容则是通过各种插件来实现的。官方提供了一些插件,通过安装Nagios Plugins即可,如果想要使用snmp插件还需要预先安装一些依赖包。

nagios-001

Nagios实施计划

操作系统:Centos7
内网机器:172.16.0.100
公网机器:xxx.xxx.xxx.xxx

计划在内网机器上部署Nagios、Apache和Php,Apache和Php主要用于支持cgi和web界面;公网机已有Nginx,用于配置Nagios的web访问;其他需要被监控的机器安装snmp。

搭建Nagios监控平台

  • 安装依赖源。
yum install gcc glibc glibc-common gd gd-devel make openssl-devel  #编译环境等,视情况可跳过
yum install httpd php php-cli  #安装Apache和php
yum install net-snmp net-snmp-devel net-snmp-perl  net-snmp-python net-snmp-utils  #snmp依赖,如果不需要可跳过
  • 创建用户。
useradd nagios
groupadd nagcmd
usermod -a -G nagcmd nagios
usermod -a -G nagcmd apache
  • 下载安装nagios core。
cd ~
curl -L -O https://assets.nagios.com/downloads/nagioscore/releases/nagios-4.1.1.tar.gz
tar zxvf nagios-4.1.1.tar.gz
cd nagios-4.1.1/
./configure --with-command-group=nagcmd
make all
make install
make install-commandmode
make install-init
make install-config
make install-webconf
  • 下载安装nagios plugins。
cd ~
curl -L -O http://www.nagios-plugins.org/download/nagios-plugins-2.1.1.tar.gz
tar zxvf nagios-plugins-2.1.1.tar.gz
cd nagios-plugins-2.1.1/
./configure --with-nagios-user=nagios --with-nagios-group=nagios --with-openssl
make all
make install
  • Apache启动和配置权限。
systemctl start httpd
htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin  #会提示设置nagiosadmin的密码
systemctl restart httpd
  • Nagios验证和启动。
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg  #验证所有cfg是否配置正确,如果有问题会提示
systemctl start nagios

nagios-002

  • 公网Nginx配置和访问。

增加以下location后reload,访问http://xxx.xxx.xxx.xxx/nagios/

location /nagios/{
    proxy_pass         http://172.16.0.100/nagios/;
    proxy_redirect     off;
    proxy_http_version 1.1;
    proxy_set_header   Host             $host;
    proxy_set_header   X-Real-IP        $remote_addr;
    proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
    proxy_set_header   Connection       "";
}

nagios-003

被监控服务器安装snmp

yum install -y net-snmp net-snmp-utils
cat /etc/snmp/snmpd.conf
systemctl start snmpd
systemctl enable snmpd

从nagios上测试是否可以连接

/usr/local/nagios/libexec/check_snmp -H 10.172.228.11 -C public -o .1.3.6.1.2.1.1.4.0

如果连接报错需要查看OID是否在client侧的/etc/snmp/snmpd.conf里配置,或者权限是否开放。

配置项举例

以上步骤完成了nagios的web访问,通过snmp建立了nagios平台和各机器的通信,接下来可以修改各种配置文件达到监控预警目的。

  • /etc/snmp/snmpd.conf

被监控端开放view给nagios,OID配置成.1表示从根部all in~

# map the community name "public" into a "security name"
#             sec.name          source          community
#com2sec      notConfigUser      default           public
com2sec      notConfigUser      10.173.38.200           public
#com2sec        mynetwork       127.0.0.1         private
# make mynetwork2 be accessed from 192.168.0.0~192.168.0.255 .the community is "public2"
#com2sec        mynetwork2     192.168.0.0/24     public2
#com2sec        mynetwork   nagios.agix.com.au    myweb

# map the security name into a group name:
#             groupName      securityModel      securityName
group        notConfigGroup        v1           notConfigUser
group        notConfigGroup        v2c          notConfigUser

# create views to let the group have rights to "snmpwalk -v1 -c public localhost .1"
#              name            incl/excl       subtree         [mask](optional)
view        systemview         included          .1
# .1 is the root of tree. contains all tree nodes.
#view       systemview         included       .1.3.6.1.4.1.2021.2.1
#view       systemview         included       .1.3.6.1.4.1.2021.10.1

# grant the group read-only access to the systemview view.
#              group       context    sec.model   sec.level   prefix    read        write      notif
access     notConfigGroup     ""         any        noauth     exact  systemview     none      none

# add all disks to monitor. they can be found under .1.3.6.1.4.1.2021.9.1
includeAllDisks 10%

#disk /
#disk /alidata1 80%

# Keep 100 MB free for /alidata2
#disk /alidata2 100000

# add process to monitor. they can be found under .1.3.6.1.4.1.2021.2.1
proc    mongod
proc    mongos
proc    java
proc    nginx
proc    mysqld
proc    postgres
proc    h2o

# set process sshd min 1 max 10
#proc    sshd    10    1

#proc       httpd
#procfix    httpd  systemctl restart httpd

# MIB-Specific extension commands. to make your information available.
#pass    .1.3.6.1.4.1.4413.4.1    /usr/bin/ucd5820stat

# specify a super user be provided access to the full OID tree. the user will be allowed read-only or read-write 
# rouser/rwuser (for SNMPv3) or rocommunity/rwcommunity (for SNMPv1 or SNMPv2c)
#rouser         supermfp
#rwuser         supermfp2
#rocommunity    supermfp
#rwcommunity    supermfp2


# some system contact information
#syslocation        Unknown (edit /etc/snmp/snmpd.conf)
#syscontact         Root <root@localhost> (configure /etc/snmp/snmp.local.conf)

# some logging settings
dontLogTCPWrappersConnects    yes



##### set proxy for another snmpd. #####
# map the community name "ctx_m108" into a "security name"
#         [-Cn context]        sec.name          source           community
#com2sec    -Cn ctx_m108       notConfigUser      default          ctx_m108
#com2sec    -Cn ctx_hm         notConfigUser      10.173.38.200    ctx_hm

# grant the group read-only access to the systemview view with context name "ctx_m108".
#              group            context   sec.model   sec.level   prefix      read        write    notif
#access      notConfigGroup     ctx_m108     any       noauth     prefix    systemview     none     none
#access      notConfigGroup     ctx_hm       any       noauth     prefix    systemview     none     none

# define proxy with context name "ctx_m108" for 10.174.8.228 host. 
# the oid .1.3 should be in views. and host 10.174.8.228 should be accessed from current host.
#        [-Cn contextname] [snmpcmd_args]         host           oid         [remoteoid]
#proxy     -Cn ctx_m108     -v 1 -c public      10.174.8.228     .1.3
#proxy     -Cn ctx_hm       -v 2c -c ctx_hs_9   175.25.21.94     .1.3
  • /usr/local/nagios/etc/nagios.cfg

修改cfg_file或cfg_dir属性,将自定义的cfg文件注册进来。

cfg_dir=/usr/local/nagios/etc/mfp_servers
cfg_dir=/usr/local/nagios/etc/mfp_config
  • /usr/local/nagios/libexec/mfp_nagios_snmp_plugin.sh

自定义插件脚本,后面的mfp_command.cfg中有引用,该脚本权属用户需要跟运行nagios的用户一致。主流的snmp插件有check_snmp(功能难定制)和*.pl的一堆(不想依赖太多perl),所以自己用shell基于snmpget实现符合业务要求的插件。

#!/bin/bash 
# -H IP -C COMMUNITY -M METHOD -o OID -w WARNING_THRESHOLD -c CRITICAL_THRESHOLD -s SUBOID
# dont use more than one char as getopts option.
# dont use system parameter as local parameter. eg PATH
METHOD=
IP=
COMMUNITY=
OID=
SUBOID=
WARNING_THRESHOLD=
CRITICAL_THRESHOLD=

while getopts M:H:C:o:w:c:s: OPTION
do
     case $OPTION in
      M)
       METHOD=$OPTARG 
       ;;
      H)
       IP=$OPTARG 
       ;;
      C)
       COMMUNITY=$OPTARG
       ;;
      o)
       OID=$OPTARG
       ;;
      w)
       WARNING_THRESHOLD=$OPTARG
       ;;
      c)
       CRITICAL_THRESHOLD=$OPTARG
       ;;
      s)
       SUBOID=$OPTARG
       ;;
     esac
done
# echo $METHOD $IP $COMMUNITY $OID $WARNING_THRESHOLD $CRITICAL_THRESHOLD $SUBOID

# -C mycommunity -H xxx.xxx.xxx.xxx -o someoid -w 80 -c 100
check_snmp(){
    #FOCUS=$(snmpget -c $COMMUNITY -v1 $IP $OID | awk '{print $4}' )
    FOCUS=$(snmpget -c $COMMUNITY -v2c -OqvtU $IP $OID)
    DETAIL="$FOCUS"
    print_status $FOCUS $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# -C mycommunity -H xxx.xxx.xxx.xxx -M load -w 400 -c 800
check_load(){
    OID=".1.3.6.1.4.1.2021.10.1.5.1 .1.3.6.1.4.1.2021.10.1.3.1 .1.3.6.1.4.1.2021.10.1.3.2 .1.3.6.1.4.1.2021.10.1.3.3"
    FOCUS=$(snmpget -c $COMMUNITY -v1 -OqvtU $IP $OID)
    FOCUSES=($FOCUS)
    
    # TODO cpu num
    LOAD_1_INT=${FOCUSES[0]}
    LOAD_1=${FOCUSES[1]}
    LOAD_5=${FOCUSES[2]}
    LOAD_15=${FOCUSES[3]}
    
    DETAIL="load average : "$LOAD_1", "$LOAD_5", "$LOAD_15

    print_status $LOAD_1_INT $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# -C mycommunity -H xxx.xxx.xxx.xxx -M uptime -w 365 -c 800
check_uptime(){
    # unix only
    OID=".1.3.6.1.2.1.25.1.1.0"
    FOCUS=$(snmpget -c $COMMUNITY -v1 -OqvtU $IP $OID)

    DAY=$(expr $FOCUS / 6000 / 60 / 24)
    HOUR=$(expr $FOCUS / 6000 / 60 % 24)
    MINUTE=$(expr $FOCUS / 6000 % 60)
    
    DETAIL="system uptime : "$DAY" days "$HOUR" hours "$MINUTE" minutes"

    print_status $DAY $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# -C mycommunity -H xxx.xxx.xxx.xxx -M disk -w 80 -c 95 -s 1
check_disk(){
    OID=".1.3.6.1.4.1.2021.9.1.2.$SUBOID .1.3.6.1.4.1.2021.9.1.3.$SUBOID .1.3.6.1.4.1.2021.9.1.6.$SUBOID .1.3.6.1.4.1.2021.9.1.7.$SUBOID .1.3.6.1.4.1.2021.9.1.8.$SUBOID .1.3.6.1.4.1.2021.9.1.9.$SUBOID"
    FOCUS=$(snmpget -c $COMMUNITY -v1 -OqvtU $IP $OID)
    FOCUSES=($FOCUS)
    
    DISK_PATH=${FOCUSES[0]}
    DEVICE=${FOCUSES[1]}
    TOTAL=${FOCUSES[2]}
    AVAIL=${FOCUSES[3]}
    USE=${FOCUSES[4]}
    PERCENTAGE_USE=${FOCUSES[5]}
    CANUSE_M=$(expr $AVAIL / 1024)
    TOTAL_M=$(expr $TOTAL / 1024)
    USE_M=$(expr $USE / 1024)
    
    DETAIL="disk  "$DISK_PATH" ("$DEVICE") : total "$TOTAL_M" MB - used "$USE_M" MB ("$PERCENTAGE_USE"%) - free "$CANUSE_M" MB"

    print_status $PERCENTAGE_USE $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# -C mycommunity -H xxx.xxx.xxx.xxx -M memory -w 90 -c 96
check_memory(){
    OID=".1.3.6.1.4.1.2021.4.5.0 .1.3.6.1.4.1.2021.4.11.0 .1.3.6.1.4.1.2021.4.14.0 .1.3.6.1.4.1.2021.4.15.0"
    FOCUS=$(snmpget -c $COMMUNITY -v1 -OqvtU $IP $OID)
    FOCUSES=($FOCUS)
    
    TOTAL=${FOCUSES[0]}
    AVAIL=${FOCUSES[1]}
    BUFFER=${FOCUSES[2]}
    CACHE=${FOCUSES[3]}
    CANUSE=$(expr $AVAIL + $BUFFER + $CACHE)
    USE=$(expr $TOTAL - $CANUSE)
    PERCENTAGE_USE=$(expr $USE \* 100 / $TOTAL)
    CANUSE_M=$(expr $CANUSE / 1024)
    TOTAL_M=$(expr $TOTAL / 1024)
    USE_M=$(expr $USE / 1024)
    
    DETAIL="memory usage : total "$TOTAL_M" MB - used "$USE_M" MB ("$PERCENTAGE_USE"%) - free "$CANUSE_M" MB"

    print_status $PERCENTAGE_USE $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# -C mycommunity -H xxx.xxx.xxx.xxx -M net -w 60000 -c 90000 -s 2
check_net(){
    # OID=".1.3.6.1.2.1.2.2.1.2.$SUBOID .1.3.6.1.2.1.2.2.1.10.$SUBOID .1.3.6.1.2.1.2.2.1.16.$SUBOID"
    # FOCUS=$(snmpget -c $COMMUNITY -v1 -OqvtU $IP $OID)
    OID=".1.3.6.1.2.1.31.1.1.1.1.$SUBOID .1.3.6.1.2.1.31.1.1.1.6.$SUBOID .1.3.6.1.2.1.31.1.1.1.10.$SUBOID"
    FOCUS=$(snmpget -c $COMMUNITY -v2c -OqvtU $IP $OID)
    FOCUSES=($FOCUS)
    
    NET_INTERFACE=${FOCUSES[0]}
    NET_IN=${FOCUSES[1]}
    NET_OUT=${FOCUSES[2]}
    DATA_FILE=/var/tmp/net_${IP}_${NET_INTERFACE}_${COMMUNITY}.data
    NET_TIMESTAMP=$(date '+%s')
    
    ###
    #  NET_TIMESTAMP_L=
    #  NET_IN_L=
    #  NET_OUT_L=
    ###
    if [ -f $DATA_FILE ]; then
        while read line
        do
            eval "$line"
        done < $DATA_FILE
    fi
    
    echo "NET_TIMESTAMP_L="$NET_TIMESTAMP > $DATA_FILE
    echo "NET_IN_L="$NET_IN >> $DATA_FILE
    echo "NET_OUT_L="$NET_OUT >> $DATA_FILE
    
    if [ "x$NET_TIMESTAMP_L" == "x" ]; then
        echo "OK: init".
        exit 0
    fi
    
    SECONDS=$(expr $NET_TIMESTAMP - $NET_TIMESTAMP_L)
    NET_IN_USE=$(expr $NET_IN - $NET_IN_L)
    NET_OUT_USE=$(expr $NET_OUT - $NET_OUT_L)
    NET_IN_TRAFFIC=$(expr $NET_IN_USE / $SECONDS / 1024)
    NET_OUT_TRAFFIC=$(expr $NET_OUT_USE / $SECONDS / 1024)
    NET_ALL_TRAFFIC=$(expr $NET_IN_TRAFFIC + $NET_OUT_TRAFFIC)
    
    DETAIL="net interface "$NET_INTERFACE" : total "$NET_ALL_TRAFFIC" KB/s - in "$NET_IN_TRAFFIC" KB/s - out "$NET_OUT_TRAFFIC" KB/s"

    print_status $NET_ALL_TRAFFIC $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# -C mycommunity -H xxx.xxx.xxx.xxx -M cpu -w 80 -c 95
check_cpu(){
    OID=".1.3.6.1.4.1.2021.11.9.0 .1.3.6.1.4.1.2021.11.10.0 .1.3.6.1.4.1.2021.11.11.0"
    FOCUS=$(snmpget -c $COMMUNITY -v1 -OqvtU $IP $OID)
    FOCUSES=($FOCUS)
    
    PERCENTAGE_USER=${FOCUSES[0]}
    PERCENTAGE_SYS=${FOCUSES[1]}
    PERCENTAGE_IDLE=${FOCUSES[2]}
    PERCENTAGE_USE=$(expr $PERCENTAGE_USER + $PERCENTAGE_SYS)
    
    DETAIL="cpu usage : user "$PERCENTAGE_USER"% - system "$PERCENTAGE_SYS"% - idle "$PERCENTAGE_IDLE"%"

    print_status $PERCENTAGE_USE $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# -C mycommunity -H xxx.xxx.xxx.xxx -M io -w 2000 -c 3000 -s 1
check_io(){
    #OID=".1.3.6.1.4.1.2021.13.15.1.1.2.$SUBOID .1.3.6.1.4.1.2021.13.15.1.1.3.$SUBOID .1.3.6.1.4.1.2021.13.15.1.1.4.$SUBOID"
    #FOCUS=$(snmpget -c $COMMUNITY -v1 -OqvtU $IP $OID)
    OID=".1.3.6.1.4.1.2021.13.15.1.1.2.$SUBOID .1.3.6.1.4.1.2021.13.15.1.1.12.$SUBOID .1.3.6.1.4.1.2021.13.15.1.1.13.$SUBOID"
    FOCUS=$(snmpget -c $COMMUNITY -v2c -OqvtU $IP $OID)
    FOCUSES=($FOCUS)
    
    IO_DISK=${FOCUSES[0]}
    IO_READ=${FOCUSES[1]}
    IO_WRITE=${FOCUSES[2]}
    DATA_FILE=/var/tmp/io_${IP}_${IO_DISK}_${COMMUNITY}.data
    IO_TIMESTAMP=$(date '+%s')
    
    ###
    #  IO_TIMESTAMP_L=
    #  IO_READ_L=
    #  IO_WRITE_L=
    ###
    if [ -f $DATA_FILE ]; then
        while read line
        do
            eval "$line"
        done < $DATA_FILE
    fi
    
    echo "IO_TIMESTAMP_L="$IO_TIMESTAMP > $DATA_FILE
    echo "IO_READ_L="$IO_READ >> $DATA_FILE
    echo "IO_WRITE_L="$IO_WRITE >> $DATA_FILE
    
    if [ "x$IO_TIMESTAMP_L" == "x" ]; then
        echo "OK: init".
        exit 0
    fi
    
    SECONDS=$(expr $IO_TIMESTAMP - $IO_TIMESTAMP_L)
    IO_READ_USE=$(expr $IO_READ - $IO_READ_L)
    IO_WRITE_USE=$(expr $IO_WRITE - $IO_WRITE_L)
    IO_READ_TRAFFIC=$(expr $IO_READ_USE / $SECONDS / 1024)
    IO_WRITE_TRAFFIC=$(expr $IO_WRITE_USE / $SECONDS / 1024)
    IO_ALL_TRAFFIC=$(expr $IO_READ_TRAFFIC + $IO_WRITE_TRAFFIC)
    
    DETAIL="io disk "$IO_DISK" : total "$IO_ALL_TRAFFIC" KB/s - read "$IO_READ_TRAFFIC" KB/s - write "$IO_WRITE_TRAFFIC" KB/s"

    print_status $IO_ALL_TRAFFIC $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# -C mycommunity -H xxx.xxx.xxx.xxx -M proc -w 10 -c 20 -s 1
check_process(){
    OID=".1.3.6.1.4.1.2021.2.1.2.$SUBOID .1.3.6.1.4.1.2021.2.1.5.$SUBOID"
    FOCUS=$(snmpget -c $COMMUNITY -v1 -OqvtU $IP $OID)
    FOCUSES=($FOCUS)
    
    PROC_NAME=${FOCUSES[0]}
    PROC_COUNT=${FOCUSES[1]}
    
    DETAIL="process ( "$PROC_NAME" ) : count "$PROC_COUNT

    print_status $PROC_COUNT $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# -C mycommunity -H xxx.xxx.xxx.xxx -M proc_count -w 120 -c 200
check_process_count(){
    OID="host.hrSystem.hrSystemProcesses.0"
    FOCUS=$(snmpget -c $COMMUNITY -v1 -OqvtU $IP $OID)
    FOCUSES=($FOCUS)
    
    PROC_COUNT=${FOCUSES[0]}
    
    DETAIL="processes : count "$PROC_COUNT

    print_status $PROC_COUNT $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# -C mycommunity -H xxx.xxx.xxx.xxx -M tcp_open -w 1600 -c 2400
check_tcp_open(){
    OID="tcp.tcpCurrEstab.0"
    FOCUS=$(snmpget -c $COMMUNITY -v1 -OqvtU $IP $OID)
    FOCUSES=($FOCUS)
    
    TCP_CURRENT_ESTABLISH=${FOCUSES[0]}
    
    DETAIL="tcp : current established "$TCP_CURRENT_ESTABLISH

    print_status $TCP_CURRENT_ESTABLISH $WARNING_THRESHOLD $CRITICAL_THRESHOLD "$DETAIL --warn ($WARNING_THRESHOLD)"
}

# 80=gt   :80=gt   80:=lt   70:80=ltgt
print_status(){
    FOCUS=$1
    WARN=$2
    CRITICAL=$3
    DETAIL="$4"
    
    if [ "x$FOCUS" == "x" ] || [ "x$WARN" == "x" ] || [ "x$CRITICAL" == "x" ]; then
    	echo "UNKNOWN: $DETAIL".
    	exit 3
    fi
    
    print_status_sub "$CRITICAL" "CRITICAL" 2 "$FOCUS"
    
    print_status_sub "$WARN" "WARNNING" 1 "$FOCUS"
    
    echo "OK: $DETAIL".
    exit 0
}

print_status_sub(){
	STR_ABC=$1
	HEAD_INFO=$2
	EXIT_CODE=$3
	FOCUS=$4
    
	maohao=$(expr index "$STR_ABC" ":")
	LEFT_RANGE=""
	RIGHT_RANGE=""
	
	if [ $maohao -eq 0 ]; then
		RIGHT_RANGE=$STR_ABC
	elif [ $maohao -eq 1 ]; then
		RIGHT_RANGE=${STR_ABC:$maohao}
	elif [ $maohao -eq ${#STR_ABC} ]; then
		LEFT_RANGE=${STR_ABC:0:$maohao-1}
	else	
		RIGHT_RANGE=${STR_ABC:$maohao}
		LEFT_RANGE=${STR_ABC:0:$maohao-1}
	fi
	
	if [ "x$RIGHT_RANGE" != "x" ] && [ $FOCUS -gt $RIGHT_RANGE ]; then
        echo "$HEAD_INFO: $DETAIL".
        exit $EXIT_CODE
    fi
	
	if [ "x$LEFT_RANGE" != "x" ] && [ $FOCUS -lt $LEFT_RANGE ]; then
        echo "$HEAD_INFO: $DETAIL".
        exit $EXIT_CODE
    fi
}

case "$METHOD" in
    load)          check_load
                   ;;
    uptime)        check_uptime
                   ;;
    disk)          check_disk
                   ;;
    memory)        check_memory
                   ;;
    net)           check_net
                   ;;
    cpu)           check_cpu
                   ;;
    io)            check_io
                   ;;
    proc)          check_process
                   ;;
    proc_count)    check_process_count
                   ;;
    tcp_open)      check_tcp_open
                   ;;
    *)             check_snmp
                   ;;
esac
  • /usr/local/nagios/etc/mfp_config/mfp_command.cfg

定义nagios格式的命令,其实是调用插件脚本。

# mfp_load: Load Average
define command{
    command_name    mfp_load
    command_line    $USER1$/mfp_nagios_snmp_plugin.sh -C $ARG1$ -H $HOSTADDRESS$ -M load -w $ARG2$ -c $ARG3$
}

# mfp_uptime: System Uptime
define command{
    command_name    mfp_uptime
    command_line    $USER1$/mfp_nagios_snmp_plugin.sh -C $ARG1$ -H $HOSTADDRESS$ -M uptime -w $ARG2$ -c $ARG3$
}

# mfp_load: Memory
define command{
    command_name    mfp_memory
    command_line    $USER1$/mfp_nagios_snmp_plugin.sh -C $ARG1$ -H $HOSTADDRESS$ -M memory -w $ARG2$ -c $ARG3$
}

# mfp_load: Disk x
define command{
    command_name    mfp_disk
    command_line    $USER1$/mfp_nagios_snmp_plugin.sh -C $ARG1$ -H $HOSTADDRESS$ -M disk -w $ARG2$ -c $ARG3$ -s $ARG4$
}

# mfp_net: Net x
define command{
    command_name    mfp_net
    command_line    $USER1$/mfp_nagios_snmp_plugin.sh -C $ARG1$ -H $HOSTADDRESS$ -M net -w $ARG2$ -c $ARG3$ -s $ARG4$
}

# mfp_io: IO x
define command{
    command_name    mfp_io
    command_line    $USER1$/mfp_nagios_snmp_plugin.sh -C $ARG1$ -H $HOSTADDRESS$ -M io -w $ARG2$ -c $ARG3$ -s $ARG4$
}

# mfp_cpu: CPU
define command{
    command_name    mfp_cpu
    command_line    $USER1$/mfp_nagios_snmp_plugin.sh -C $ARG1$ -H $HOSTADDRESS$ -M cpu -w $ARG2$ -c $ARG3$
}

# mfp_proc: process x
define command{
    command_name    mfp_proc
    command_line    $USER1$/mfp_nagios_snmp_plugin.sh -C $ARG1$ -H $HOSTADDRESS$ -M proc -w $ARG2$ -c $ARG3$ -s $ARG4$
}

# mfp_proc_count: process count
define command{
    command_name    mfp_proc_count
    command_line    $USER1$/mfp_nagios_snmp_plugin.sh -C $ARG1$ -H $HOSTADDRESS$ -M proc_count -w $ARG2$ -c $ARG3$
}

# mfp_tcp_open: tcp open count
define command{
    command_name    mfp_tcp_open
    command_line    $USER1$/mfp_nagios_snmp_plugin.sh -C $ARG1$ -H $HOSTADDRESS$ -M tcp_open -w $ARG2$ -c $ARG3$
}




#####  mfp_notify  #####
# mfp_notify_host_sendmail
define command{
    command_name    mfp_notify_host_sendmail
    command_line    /usr/bin/printf "%b" "Subject: $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$\n\n***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/sbin/sendmail -vt $CONTACTEMAIL$
}

# mfp_notify_service_sendmail
define command{
    command_name    mfp_notify_service_sendmail
    command_line    /usr/bin/printf "%b" "Subject: $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$\n\n***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /usr/sbin/sendmail -vt $CONTACTEMAIL$
}
  • /usr/local/nagios/etc/mfp_config/mfp_contact.cfg

添加管理员的联系方式,如邮件。

# group: all people
define contactgroup{
    contactgroup_name        all-admins
    alias                    All Administrators
    #members                 myname,oudana,taodashi
    members                  myname
}
# members
define contact{
    contact_name                    myname
    alias                           myname
    email                           myname@microfunplus.com
    host_notifications_enabled      1
    service_notifications_enabled   1
    service_notification_period     24x7
    host_notification_period        24x7
    service_notification_options    w,u,c,r
    host_notification_options       d,u,r
    service_notification_commands   mfp_notify_service_sendmail
    host_notification_commands      mfp_notify_host_sendmail
}
  • /usr/local/nagios/etc/mfp_servers/mongodb-01.cfg

实际被监控的某台机器和服务,引用mfp_command.cfg中的命令,配置warning和critical阀值、监控周期以及预警联系组等。

define host{
    host_name                       mongodb-102
    alias                           Mongodb slave 102
    address                         10.11.11.110
    check_command                   check-host-alive
    check_interval                  3
    retry_interval                  1
    max_check_attempts              5
    check_period                    24x7
    process_perf_data               0
    retain_nonstatus_information    0
    contact_groups                  all-admins
    notification_interval           30
    notification_period             24x7
    notification_options            d,u,f,s
}

# mfp_load
define service{
        host_name               mongodb-102
        service_description     Load
        check_command           mfp_load!public!1600!6400
        max_check_attempts      5
        check_interval          5
        retry_interval          3
        check_period            24x7
        notification_interval   30
        notification_period     24x7
        notification_options    w,c,u,f,s
        contact_groups          all-admins
}

# mfp_memory
define service{
        host_name               mongodb-102
        service_description     Memory
        check_command           mfp_memory!public!94!98
        max_check_attempts      5
        check_interval          5
        retry_interval          3
        check_period            24x7
        notification_interval   30
        notification_period     24x7
        notification_options    w,c,u,f,s
        contact_groups          all-admins
}

# mfp_cpu
define service{
        host_name               mongodb-102
        service_description     CPU
        check_command           mfp_cpu!public!80!95
        max_check_attempts      5
        check_interval          3
        retry_interval          3
        check_period            24x7
        notification_interval   30
        notification_period     24x7
        notification_options    w,c,u,f,s
        contact_groups          all-admins
}

# mfp_disk 1
define service{
        host_name               mongodb-102
        service_description     Disk 1 space
        check_command           mfp_disk!public!80!95!1
        max_check_attempts      5
        check_interval          5
        retry_interval          3
        check_period            24x7
        notification_interval   30
        notification_period     24x7
        notification_options    w,c,u,f,s
        contact_groups          all-admins
}

最终可以看到这样的监控效果:

nagios-004

nagios-005

常用命令和注意事项

nagios的libexec目录下存放nagios各种插件,etc目录下存放各种cfg。

在nagios服务器上可以用命令调试snmp或者插件。

snmpwalk -c public -v1 10.10.10.111 .1
snmpwalk -c public -v1 10.10.10.111 .1.3.6.1.4.1.2021.9.1
snmpget -c public -v1 -OqvtU 10.10.10.111 .1.3.6.1.2.1.25.1.1.0
snmpget -c public -v1 10.10.10.111 .1.3.6.1.4.1.2021.10.1.3.1 .1.3.6.1.4.1.2021.10.1.3.2
snmpget -c public -v1 10.10.10.111 .1.3.6.1.4.1.2021.9.1.6.1,.1.3.6.1.4.1.2021.9.1.6.6 | awk '{print $4}'
snmpget -c public -v1 10.10.10.111 .1.3.6.1.4.1.2021.9.1.6.1 | cut -d " " -f 4
snmpwalk -c public -v 1 10.10.10.111 host.hrStorage.hrStorageTable.hrStorageEntry.hrStorageUsed

常用OID。

# load
.1.3.6.1.4.1.2021.10.1.3
.1.3.6.1.4.1.2021.10.1.5
# unix uptime
.1.3.6.1.2.1.25.1.1.0
# memory
.1.3.6.1.4.1.2021.4
# disk
.1.3.6.1.4.1.2021.9.1
# net
.1.3.6.1.2.1.2.2.1
.1.3.6.1.2.1.31.1.1.1
# cpu ...
.1.3.6.1.4.1.2021.11

如果snmpwalk访问no response,可能需要设置对方防火墙。

firewall-cmd --zone=public --add-port=161/udp --permanent 
firewall-cmd --reload

设置 -v 2c 可以读取64位的数据,如io、net。

snmpwalk -c ctx_hs_2 -v2c 10.171.50.219 .1.3.6.1.4.1.2021.13.15.1.1
#对比部分输出
UCD-DISKIO-MIB::diskIONRead.2 = Counter32: 801242112
UCD-DISKIO-MIB::diskIONReadX.2 = Counter64: 550557056000
UCD-DISKIO-MIB::diskIONWritten.3 = Counter32: 20390912
UCD-DISKIO-MIB::diskIONWrittenX.3 = Counter64: 20390912

Nagios可以识别4种状态返回信息:0(OK)1(WARNING)2(CRITICAL)3(UNKNOWN)。

参考
http://www.cnblogs.com/mchina/archive/2013/02/20/2883404.html
https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/objectdefinitions.html
http://freeloda.blog.51cto.com/2033581/1306743
http://nagios.proy.org/index_snmp.html
http://blog.itpub.net/29500582/viewspace-1610049/
https://www.monitoring-plugins.org/doc/man/check_snmp.html
http://www.net-snmp.org/docs/man/snmpd.conf.html
https://github.com/dsully/snmp-iostat-bridge/blob/master/snmp-iostat-bridge

Creative Commons License

本文基于署名-非商业性使用-相同方式共享 4.0许可协议发布,欢迎转载、使用、重新发布,但请保留文章署名wanghengbin(包含链接:https://wanghengbin.com),不得用于商业目的,基于本文修改后的作品请以相同的许可发布。

发表评论