Troubleshooting Zabbix agent’s not supported items issue

Sometimes items used by zabbix agent are not supported in whatever mean.
Possible root causes are

  1. Incorrect item key
    Sometimes the item key needs custom defined macro to represent hostname or IP.
    Macro format must be right like {$MYAgentIP}. Not ${MyAgentIP}.Dollar sign must be inside{}.
  2. Incorrect UserParameter defining
    It is better to use command’s absolute path to define UserParameter.
    Buggy scripts uses to define Userparameter.

Tools uses
zabbix_get, zabbix_agentd

Troubleshooting Example
Local check
Login problem host(agent installed)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
###Test the specific key of an item###
###Assume memcached service is running on 192.168.10.3:11211#####
cat /etc/zabbix/zabbix_agentd.d/memcache.conf
UserParameter=memcache[*],/bin/echo -e "stats\nquit" | /bin/nc 127.0.0.1 11211 | /bin/grep "STAT $1 " | /usr/bin/awk '{print $$3}'
zabbix_agentd -t memcache[bytes_read]
memcache[bytes_read] [t|]
###The null value is returned####
###Test the commands####
/bin/echo -e "stats\nquit" | /bin/nc 127.0.0.1 11211
###Return nothing#####
###netstat to check the memcached status####
netstat -nlp | grep memcache
tcp 0 0 192.168.10.3:11211 0.0.0.0:* LISTEN 43186/memcached
udp 0 0 192.168.10.3:11211 0.0.0.0:* 43186/memcached
###memcached is listening on 192.168.10.3 IP###
###Test the commands with IP 192.168.10.3###
/bin/echo -e "stats\nquit" | /bin/nc 192.168.10.3 11211
...
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 1767556176
STAT bytes_written 1907901707
STAT limit_maxbytes 128403374080
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT threads 48
STAT conn_yields 0
STAT hash_power_level 16
....
###Return memcached data###
###Edit the UserParameter####
UserParameter=memcache[*],/bin/echo -e "stats\nquit" | /bin/nc 192.168.10.3 11211 | /bin/grep "STAT $1 " | /usr/bin/awk '{print $$3}'
###Restart Agent####
/etc/init.d/zabbix_agent restart
###Try again####
zabbix_agentd -t memcache[bytes_read]
memcache[bytes_read] [t|1782035412]
###The value returned correctly####

Remote Check
Login Zabbix Server(Assume the server has zabbix_get command installed)

1
2
3
zabbix_get -s 192.168.10.3 -k memcache[bytes_read]
1795695623
###The value return correctly#####

If the above two checks are okay then this item should also be okay to enable in zabbix Web UI.

It is also useful to change the zabbix agent’s debug level and check the zabbix_agentd.log.

1
2
vim /etc/zabbix/zabbix_agentd.conf
DebugLevel=4

Comments

OpenStack Fuel 6.0.1 – Zabbix issues and fixes

Most of the issues are items and scripts related.
eg: memcached related items are not supported.
The following is the rough review and workaround.

Issue#1 : “memcache serivce running” item always feeds back the service down
Reason : did not assign the ip in the item
Fix : assign the {$IP_MANAGEMENT} to the item , for example : net.tcp.service[tcp,{$IP_MANAGEMENT},11211]

Issue#2 : all memcache[*] items not supported
Reason : memcached listens on 192.168.10.x IP , but zabbix agent pulls the data from 127.0.0.1 , not work.
command paths in userparameter file should be absolute paths
Fix : edit memcache.conf

1
2
3
4
5
root@node-1:/etc/zabbix/zabbix_agentd.d# netstat -nlp | grep mem
tcp 0 0 192.168.10.3:11211 0.0.0.0:* LISTEN 43186/memcached
udp 0 0 192.168.10.3:11211 0.0.0.0:* 43186/memcached
root@node-1:/etc/zabbix/zabbix_agentd.d# cat memcache.conf
UserParameter=memcache[*],echo -e "stats\nquit" | nc 127.0.0.1 11211 | grep "STAT $1 " | awk '{print $$3}'

change it to

1
2
root@node-1:/etc/zabbix/zabbix_agentd.d# cat memcache.conf
UserParameter=memcache[*],/bin/echo -e "stats\nquit" | /bin/nc 192.168.10.3 11211 | /bin/grep "STAT $1 " | /usr/bin/awk '{print $$3}'

restart zabbix agent

Issue#3 : item keys rabbitmq.missing.nodes, rabbitmq.missing.queues, rabbitmq.queue.items, rabbitmq.queues.without.consumers, rabbitmq.unmirror.queues not supported
Reason : /etc/zabbix/check_rabbit.conf , not included the user and URL IP was incorrect(should be own local IP not remote)

1
2
3
4
5
6
7
[rabbitmq]
log_level=DEBUG
user=
password=<secret)
host=http://192.168.10.5:15672
#OpenStack queues, Y - number of service types, N - count of *this* service, max_queues=Y*(2*N+1)
max_queues=128

Fix : edit /etc/zabbix/check_rabbit.conf , input the correct user name/password and URL IP for example

1
2
3
4
5
6
7
[rabbitmq]
log_level=DEBUG
user=nova
password=<secret)
host=http://192.168.10.3:15672
#OpenStack queues, Y - number of service types, N - count of *this* service, max_queues=Y*(2*N+1)
max_queues=128

Issue#4 : “Swift Proxy Server is listening on port” item not supported
Reason : item keynet.tcp.service[http,${IP_STORAGE},8080] , the macro format wrong , should be net.tcp.service[http,{$IP_STORAGE},8080], dollar sign must be inside{}.
Fix : edit the item from zabbix UI

Issue#5 : Horizon HTTPS Server is listening on port item not supported
Reason : item key net.tcp.service[https,{$IP_MANAGEMENT},443] is checking port 443 , fuel’s default does not deploy horizon on 443
Fix : disable the item

Issue#6 : Horizon HTTP Server process is running item not supported
Reason : item key proc.num[,apache,,/usr/sbin/httpd] is for checking rhel , centos… distro , not for ubuntu
Fix : change the item key to ubuntu & openstack appropriate : proc.num[,horizon,,/usr/sbin/apache2]

Issue#7 : “Neutron DHCP/L3 Agent process is running” item not supported
Reason : item key proc.num[python,openstack-neutron,,neutron-dhcp-agent] wrong
Fix : should be proc.num[python,neutron,,neutron-dhcp-agent] , remove the unwanted openstack in second parameter

Issue#8 : crm.node.check[p_openstack-neutron-dhcp/l3-agent] not supported
Reason : key’s parameter incorrect, crm resource name is p_neutron-dhcp-agent and p_neutron-l3-agent
Fix : remove “openstack” in the key : crm.node.check[p_openstack-neutron-dhcp/l3-agent]

Issue#9 : crm.node.check[p_openstack-ceilometer-agent-central] & crm.node.check[p_openstack-ceilometer-alarm-evaluator] not supported
Reason : key’s parameter incorrect, crm resource name is p_ceilometer-agent-central
Fix : remove “openstack” in the key : parameter crm.node.check[p_openstack-ceilometer-agent-central]

Issue#10 : Too many process on node x
Reason : trigger thershold was set too low . {Template Fuel OS Linux:proc.num[].last(0)}>300
Fix : {Template Fuel OS Linux:proc.num[].last(0)}>1500

Comments

Zabbix Ceilometer Proxy

ZCP(Zabbix Ceilometer Proxy)

ZCP is a proxy works between Zabbix and Openstack Ceilometer to monitor the nova instances.
It uses zabbix trapper to receive monitoring/metering data from AMPQ protocol.

Installation & Configuration:
Login Zabbix server

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
apt-get install -y git
easy_install pika
cd /opt/
git clone https://github.com/OneSourceConsult/ZabbixCeilometer-Proxy.git
cd /opt/ZabbixCeilometer-Proxy
vim proxy.conf</span>

[zabbix_configs]
zabbix_admin_user = Admin
zabbix_admin_pass = (input zabbix password)
zabbix_host = zabbix_server_IP
zabbix_port = 10051

[os_rabbitmq]
rabbit_host = x.x.x.x (check rabbitmq proxy IP from /etc/haproxy/conf.d/)
rabbit_user = xxx(check it from /etc/rabbitmq/rabbitmq.config)
rabbit_pass = xxx(check it from /etc/rabbitmq/rabbitmq.config)

[ceilometer_configs]
ceilometer_api_host = x.x.x.x (check ceilometer proxy IP from /etc/haproxy/conf.d/)
ceilometer_api_port = 8777

[keystone_authtoken]
admin_user = admin
admin_password = xxxx (check it from controller node's openrc file)
admin_tenant = admin
keystone_host = 192.168.10.2
keystone_admin_port = 35357
keystone_public_port = 5000
# The port number which the OpenStack Compute service listens on, defined in keystone.conf file
nova_compute_listen_port = 8774

[zcp_configs]
# Interval in seconds
polling_interval = 600
# template name to be created in Zabbix
template_name = Template Nova
# proxy name to be registered in Zabbix
zabbix_proxy_name = ZCP01

Running ZCP

1
2
3
4
cd /opt/ZabbixCeilometer-Proxy
python proxy.py
Nova listener started
Project Listener started

Prompted with some python warning messages like 404, “NOT FOUND” , but can ignore.

Testing by launching instances
Login one of the controller nodes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
source ~/openrc
#Check available image and noted its id
glance image-list
+--------------------------------------+--------+-------------+------------------+----------+--------+
| ID | Name | Disk Format | Container Format | Size | Status |
+--------------------------------------+--------+-------------+------------------+----------+--------+
| 68739b8a-1fc5-4c49-8ceb-62d08055dfd2 | TestVM | qcow2 | bare | 13167616 | active |
+--------------------------------------+--------+-------------+------------------+----------+--------+
#Check available flavor and noted id
nova flavor-list
+--------------------------------------+-----------+-----------+------+-----------+------+-------+-------------+-----------+
| ID | Name | Memory_MB | Disk | Ephemeral | Swap | VCPUs | RXTX_Factor | Is_Public |
+--------------------------------------+-----------+-----------+------+-----------+------+-------+-------------+-----------+
| 1 | m1.tiny | 512 | 1 | 0 | | 1 | 1.0 | True |
| 2 | m1.small | 2048 | 20 | 0 | | 1 | 1.0 | True |
| 3 | m1.medium | 4096 | 40 | 0 | | 2 | 1.0 | True |
| 4 | m1.large | 8192 | 80 | 0 | | 4 | 1.0 | True |
| 5 | m1.xlarge | 16384 | 160 | 0 | | 8 | 1.0 | True |
| e2adcd8e-c980-4cb5-9ab4-9b2715585b65 | m1.micro | 64 | 0 | 0 | | 1 | 1.0 | True |
+--------------------------------------+-----------+-----------+------+-----------+------+-------+-------------+-----------+
#Check available networks and noted its id
neutron net-list
+--------------------------------------+-----------+-------------------------------------------------------+
| id | name | subnets |
+--------------------------------------+-----------+-------------------------------------------------------+
| f8db8a52-bdfd-469c-bf63-b671e4621888 | net04 | 7262742c-92c4-47a5-b666-ee40c8fb2630 192.168.111.0/24 |
| b29de80e-69b5-4a76-ac05-c5a9ad185835 | net04_ext | 91fa3b33-de95-40da-a9de-1cd0b5f8682a 172.28.0.0/16 |
+--------------------------------------+-----------+-------------------------------------------------------+
#Boot instances from image
nova boot --flavor 1 instance1 --image 68739b8a-1fc5-4c49-8ceb-62d08055dfd2 --security-groups default \
--nic net-id=f8db8a52-bdfd-469c-bf63-b671e4621888
+--------------------------------------+-----------------------------------------------+
| Property | Value |
+--------------------------------------+-----------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | - |
| OS-EXT-SRV-ATTR:hypervisor_hostname | - |
| OS-EXT-SRV-ATTR:instance_name | instance-00000008 |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | U8RZedyHprYP |
| config_drive | |
| created | 2015-01-13T06:21:43Z |
| flavor | m1.tiny (1) |
| hostId | |
| id | 03c25b2d-0190-433c-87f2-f8750fdae741 |
| image | TestVM (68739b8a-1fc5-4c49-8ceb-62d08055dfd2) |
| key_name | - |
| metadata | {} |
| name | instance1 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | BUILD |
| tenant_id | d336c99f96b04962821547605d44aaf4 |
| updated | 2015-01-13T06:21:43Z |
| user_id | 3e68c340dc5d48ce8f426d2aa7ceb017 |
+--------------------------------------+-----------------------------------------------+
nova boot --flavor 2 instance2 --image 68739b8a-1fc5-4c49-8ceb-62d08055dfd2 --security-groups default \
--nic net-id=f8db8a52-bdfd-469c-bf63-b671e4621888
+--------------------------------------+-----------------------------------------------+
| Property | Value |
+--------------------------------------+-----------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | - |
| OS-EXT-SRV-ATTR:hypervisor_hostname | - |
| OS-EXT-SRV-ATTR:instance_name | instance-0000000b |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | a4HrVZgWqxtG |
| config_drive | |
| created | 2015-01-13T06:22:26Z |
| flavor | m1.small (2) |
| hostId | |
| id | 8087c209-158d-4f7c-9924-e91fab430649 |
| image | TestVM (68739b8a-1fc5-4c49-8ceb-62d08055dfd2) |
| key_name | - |
| metadata | {} |
| name | instance2 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | BUILD |
| tenant_id | d336c99f96b04962821547605d44aaf4 |
| updated | 2015-01-13T06:22:26Z |
| user_id | 3e68c340dc5d48ce8f426d2aa7ceb017 |
+--------------------------------------+-----------------------------------------------+
#Check instance status
nova list
+--------------------------------------+-----------+--------+------------+-------------+---------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-----------+--------+------------+-------------+---------------------+
| 03c25b2d-0190-433c-87f2-f8750fdae741 | instance1 | ACTIVE | - | Running | net04=192.168.111.5 |
| 8087c209-158d-4f7c-9924-e91fab430649 | instance2 | ACTIVE | - | Running | net04=192.168.111.6 |
+--------------------------------------+-----------+--------+------------+-------------+---------------------+

Actions Recorded
Switch back to proxy terminal , the screen will prompt the actions I did above.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Creating a host in Zabbix Server
Creating a host in Zabbix Server
Checking host:instance1-03c25b2d-0190-433c-87f2-f8750fdae741
- Item disk.read.bytes
- Item disk.write.bytes
- Item cpu
- Item disk.write.requests
- Item disk.read.requests
- Item cpu_util
- Item network.outgoing.packets
- Item network.outgoing.bytes
- Item network.incoming.bytes
- Item network.incoming.packets
Checking host:instance2-8087c209-158d-4f7c-9924-e91fab430649
- Item disk.read.bytes
- Item disk.write.bytes
- Item cpu
- Item disk.write.requests
- Item disk.read.requests
- Item cpu_util
- Item network.outgoing.packets
- Item network.outgoing.bytes
- Item network.incoming.bytes
- Item network.incoming.packets

Checking
Login to Zabbix UI
A proxy called ZCP01 created and contains 2 instances in admin host group

ZCP

Administration > DM

Administration > Hostgroups

Administration > Host groups

Monitoring > Overview

Monitoring > Overview

Have fun~

Comments

Rackspace Private Cloud RPC v9 Installation

Planning

  1. Controller HA
  2. Dedicated network node
  3. Dedicated block storage node
  4. HAProxy node instead of HW load balancers
  5. 4 NICs per node
    Sizing & HW Selecting
    Just a POC system and won’t calculate seriously. But make sure to use the consistent HW. Each server with 4 interfaces and a workable hard disk is a must. Switches are able to create vlan and tagging function is workable. The following HWs were used to build the environment.
CPU Memory HDD NIC Quantity
Server 2 x Intel Xeon CPU E5640 @ 2.67GHz 6 x 8G 1 x 500G 4 x 1Gb 11
Switch 1G 48 ports (Edge-core AS4600 54T) 2
Cables RJ45 cables 44

Hardware Preparing:

  1. Servers , switches and power preparing
  2. Cabling
  3. Switch configuration
    Nodes cabling

    Node to switch cabling

    Node to switch cabling

    Switch configuration
    port 1~12 : vlan 10 , vlan 20 tagged , default vlan untagged
    port 13~24 : vlan 30 tagged , default vlan untagged


 Software Preparing

  1. Follow Rackspace installation guide’s chapter 3 to prepare deployment host.
  2. Follow Rackspace installation guide’s chapter 4 to prepare target hosts.
  3. Refer to configuring network on a target host to configure all hosts‘ network in OS.

    Linux bridging and bonding of host interfaces

    Linux bridging and bonding of host interfaces

hostname bond0 bond1 bonding mode bond0 ip br-mgmt(vlan10) br-storage(vlan20) bond1 ip br-vxlan(vlan30)
node1 ansible eth0,eth2 eth1,eth3 active backup 172.16.26.11 172.29.236.11 172.29.244.11 no need 172.29.240.11
node2 infra1 eth0,eth2 eth1,eth3 active backup 172.16.26.12 172.29.236.12 172.29.244.12 no need 172.29.240.12
node3 infra2 eth0,eth2 eth1,eth3 active backup 172.16.26.13 172.29.236.13 172.29.244.13 no need 172.29.240.13
node4 infra3 eth0,eth2 eth1,eth3 active backup 172.16.26.14 172.29.236.14 172.29.244.14 no need 172.29.240.14
node5 compute1 eth0,eth2 eth1,eth3 active backup 172.16.26.15 172.29.236.15 172.29.244.15 no need 172.29.240.15
node6 compute2 eth0,eth2 eth1,eth3 active backup 172.16.26.16 172.29.236.16 172.29.244.16 no need 172.29.240.16
node7 compute3 eth0,eth2 eth1,eth3 active backup 172.16.26.17 172.29.236.17 172.29.244.17 no need 172.29.240.17
node8 cinder1 eth0,eth2 eth1,eth3 active backup 172.16.26.18 172.29.236.18 172.29.244.18 no need 172.29.240.18
node9 logger1 eth0,eth2 eth1,eth3 active backup 172.16.26.19 172.29.236.19 172.29.244.19 no need 172.29.240.19
node10 network1 eth0,eth2 eth1,eth3 active backup 172.16.26.20 172.29.236.20 172.29.244.20 no need 172.29.240.20
node11 haproxy1 eth0,eth2 eth1,eth3 active backup 172.16.26.1 172.29.236.1 172.29.244.1 no need 172.29.240.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
###example network configuration file of node1###
# Physical interface 1
auto eth0
iface eth0 inet manual
bond-master bond0
bond-primary eth0

# Physical interface 2
auto eth1
iface eth1 inet manual
bond-master bond1
bond-primary eth1

# Physical interface 3
auto eth2
iface eth2 inet manual
bond-master bond0

# Physical interface 4
auto eth3
iface eth3 inet manual
bond-master bond1
# Bond interface 0 (physical interfaces 1 and 3)
auto bond0
iface bond0 inet static
bond-slaves none
bond-mode active-backup
bond-miimon 100
bond-downdelay 200
bond-updelay 200
address 172.16.26.11
netmask 255.255.0.0
gateway 172.16.1.254
dns-nameservers 168.95.1.1 8.8.8.8

# Bond interface 1 (physical interfaces 2 and 4)
auto bond1
iface bond1 inet manual
bond-slaves none
bond-mode active-backup
bond-miimon 100
bond-downdelay 250
bond-updelay 250

# Container management VLAN interface
iface bond0.10 inet manual
vlan-raw-device bond0

# OpenStack Networking VXLAN (tunnel/overlay) VLAN interface
iface bond1.30 inet manual
vlan-raw-device bond1

# Storage network VLAN interface (optional)
iface bond0.20 inet manual
vlan-raw-device bond0

# Container management bridge
auto br-mgmt
iface br-mgmt inet static
bridge_stp off
bridge_waitport 0
bridge_fd 0
# Bridge port references tagged interface
bridge_ports bond0.10
address 172.29.236.11
netmask 255.255.252.0
dns-nameservers 168.95.1.1 8.8.8.8

# OpenStack Networking VXLAN (tunnel/overlay) bridge
auto br-vxlan
iface br-vxlan inet static
bridge_stp off
bridge_waitport 0
bridge_fd 0
# Bridge port references tagged interface
bridge_ports bond1.30
address 172.29.240.11
netmask 255.255.252.0

# OpenStack Networking VLAN bridge
auto br-vlan
iface br-vlan inet manual
bridge_stp off
bridge_waitport 0
bridge_fd 0
# Bridge port references untagged interface
bridge_ports bond1

# Storage bridge (optional)
auto br-storage
iface br-storage inet static
bridge_stp off
bridge_waitport 0
bridge_fd 0
# Bridge port reference tagged interface
bridge_ports bond0.20
address 172.29.244.11
netmask 255.255.252.0

Rackspace RPC v9 Environment Overview

Rackspace RPC v9 with HAProxy Overview

Rackspace RPC v9 with HAProxy


Defining OpenStack environment

  1. Follow Rackspace installation guide’s chapter 5 to define OpenStack environment
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    ###my /etc/rpc_deploy/rpc_user_config.yml file####
    ####do not use tab key when editing this file.use space instead####
    environment_version: e0955a92a761d5845520a82dcca596af
    cidr_networks:
    container: 172.29.236.0/22
    tunnel: 172.29.240.0/22
    storage: 172.29.244.0/22
    used_ips:
    - 172.29.236.11,172.29.236.20
    - 172.29.244.11,172.29.244.20
    global_overrides:
    internal_lb_vip_address: 172.29.236.1
    external_lb_vip_address: 172.16.26.1
    tunnel_bridge: "br-vxlan"
    management_bridge: "br-mgmt"
    provider_networks:
    - network:
    group_binds:
    - all_containers
    - hosts
    type: "raw"
    container_bridge: "br-mgmt"
    container_interface: "eth1"
    ip_from_q: "container"
    - network:
    group_binds:
    - glance_api
    - cinder_api
    - cinder_volume
    - nova_compute
    type: "raw"
    container_bridge: "br-storage"
    container_interface: "eth2"
    ip_from_q: "storage"
    - network:
    group_binds:
    - neutron_linuxbridge_agent
    container_bridge: "br-vxlan"
    container_interface: "eth10"
    ip_from_q: "tunnel"
    type: "vxlan"
    range: "1:1000"
    net_name: "vxlan"
    - network:
    group_binds:
    - neutron_linuxbridge_agent
    container_bridge: "br-vlan"
    container_interface: "eth11"
    type: "flat"
    net_name: "vlan"
    - network:
    group_binds:
    - neutron_linuxbridge_agent
    container_bridge: "br-vlan"
    container_interface: "eth11"
    type: "vlan"
    range: "1:1000"
    net_name: "vlan"
    lb_name: lb_name_in_core
    infra_hosts:
    infra1:
    ip: 172.29.236.12
    infra2:
    ip: 172.29.236.13
    infra3:
    ip: 172.29.236.14
    compute_hosts:
    compute1:
    ip: 172.29.236.15
    compute2:
    ip: 172.29.236.16
    compute3:
    ip: 172.29.236.17
    storage_hosts:
    cinder1:
    ip: 172.29.236.18
    container_vars:
    cinder_backends:
    limit_container_types: cinder_volume
    lvm:
    volume_group: cinder-volumes
    volume_driver: cinder.volume.drivers.lvm.LVMISCSIDriver
    volume_backend_name: LVM_iSCSI
    log_hosts:
    logger1:
    ip: 172.29.236.19
    network_hosts:
    network1:
    ip: 172.29.236.20
    haproxy_hosts:
    haproxy1:
    ip: 172.29.236.1
  2. Generate user variables that will be used for service authentication.
    1
    2
    cd /opt/ansible-lxc-rpc
    ./scripts/pw-token-gen.py --file /etc/rpc_deploy/user_variables.yml

Deploying

  1. Login ansible node as root
  2. Make sure all nodes can be reachable via hostname and ssh rsa key
  3. Refer to the following steps to run the playbook(install OpenStack)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    cd /opt/ansible-lxc-rpc/rpc_deployment/
    fping infra1 infra2 infra3 compute1 compute2 compute3 logger1 network1 cinder1 \ haproxy1
    infra1 is alive
    infra2 is alive
    infra3 is alive
    compute1 is alive
    compute2 is alive
    compute3 is alive
    logger1 is alive
    network1 is alive
    cinder1 is alive
    haproxy1 is alive
    ###The first playbook performs setup for the target hosts with the required software repos, creates the LXC containers, and validates network configuration: (runtime 15-20 minutes)###
    ansible-playbook -e @/etc/rpc_deploy/user_variables.yml playbooks/setup/host-setup.yml
    ###Since the v9.0 reference architecture expects a HW Load Balancer in front, we need to install a software load balancer (this playbook installs HAProxy and configures required OpenStack backends): (runtime 5 minutes)###
    ansible-playbook -e @/etc/rpc_deploy/user_variables.yml playbooks/infrastructure/haproxy-install.yml
    ###This playbook installs all support software required to run OpenStack and any additional tools used with RPC: (runtime 20-25 minutes)###
    ansible-playbook -e @/etc/rpc_deploy/user_variables.yml playbooks/infrastructure/infrastructure-setup.yml
    ###The final playbook builds and installs OpenStack components: (runtime 30-35 minutes)###
    ansible-playbook -e @/etc/rpc_deploy/user_variables.yml playbooks/openstack/openstack-setup.yml
  4. Make sure all playbooks run with 0 failure.

    Running playbook with 0 failure

    Running playbook without failure


Validating

  1. Login Openstack Dashboard using haproxy’s public IP https://172.16.26.1/
  2. Find the keystone admin password in /etc/rpc_deploy/user_variables.yml file
  3. Create image from URL : http://download.cirros-cloud.net/0.3.3/cirros-0.3.3-x86_64-disk.img
  4. Create and run instance using cirros image.

Troubleshooting

  1. Login specific host first.
  2. lxc-attach to attach specific container. eg:

    Attach a container to troubleshoot

    Attach a container to troubleshoot

  3. Restart all containers
    1
    2
    cd /opt/ansible-lxc-rpc/rpc_deployment/
    ansible-playbook -e @/etc/rpc_deploy/user_variables.yml playbooks/setup/restart-containers.yml
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    ********************************************************************
    cinder1 : ok=3 changed=0 unreachable=0 failed=0
    compute1 : ok=3 changed=0 unreachable=0 failed=0
    compute2 : ok=3 changed=0 unreachable=0 failed=0
    compute3 : ok=3 changed=0 unreachable=0 failed=0
    infra1 : ok=3 changed=0 unreachable=0 failed=0
    infra2 : ok=3 changed=0 unreachable=0 failed=0
    infra3 : ok=3 changed=0 unreachable=0 failed=0
    logger1 : ok=3 changed=0 unreachable=0 failed=0
    network1 : ok=3 changed=0 unreachable=0 failed=0

Conclusion

  1. Rackspace RPC v9 needs to prepare the base OS first and then run the playbooks step by step. It takes a little time.
  2. All Openstack services are containerized except nova_compute which is running on bare metal. It is resonable due to the bare metal performance.
  3. Restart all containers(restart all Openstack services) takes only 1~2 seconds. This is very fast and convenient.
  4. It implements the rsyslog , logstash , elastic search with Kibana. This is good for monitoring the environment and troubleshoot at runtime.

Comments

Red Hat OpenStack 4.0 vlan mode for multi-nodes installation

Scenario One : 1 controller node + 2 compute nodes + neutron network + vlan mode

Server Spec : 4 x 1Gbs NICs per server

Physical Switch and port arrangement :
nic1 –> ext –> vlan10, untagged
nic2 –> mgt –> vlan20, untagged
nic3 –> vm/instance network –> vlan30 trunk enable , untagged
nic4 –> unused –> vlan40, untagged

Network Topology and arrangement

Network Topology and arrangement

OS layer software and network arrangement :
Controller node :
Base OS : RHEL 6.5
Disable NetworkManager : “service NetworkManager stop ; chkconfig NetworkManager off
Disable Selinux : vi /etc/selinux/config , SELINUX=disabled

cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=controller.zzzzz.com
cat /etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=”eth0″
BOOTPROTO=”static”
IPV6INIT=”yes”
MTU=”1500″
IPADDR=172.16.26.102
NETMASK=255.255.0.0
GATEWAY=172.16.1.254
DNS1=8.8.8.8
ONBOOT=”yes”
TYPE=”Ethernet”
cat /etc/sysconfig/network-scripts/ifcfg-eth1

DEVICE=eth1
TYPE=Ethernet
ONBOOT=yes
BOOTPROTO=static
IPADDR=192.168.43.102
DEFROUTE=no
cat /etc/sysconfig/network-scripts/ifcfg-eth2

DEVICE=eth2
TYPE=Ethernet
ONBOOT=no
DEFROUTE=no
cat /etc/sysconfig/network-scripts/ifcfg-eth3

DEVICE=eth3
TYPE=Ethernet
ONBOOT=no
DEFROUTE=no

Compute node 1:
Base OS : RHEL 6.5
Disable NetworkManager : “service NetworkManager stop ; chkconfig NetworkManager off
Disable Selinux : vi /etc/selinux/config , SELINUX=disabled

cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=compute1.zzzzz.com
cat /etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=”eth0″
BOOTPROTO=”static”
IPV6INIT=”yes”
MTU=”1500″
IPADDR=172.16.26.103
NETMASK=255.255.0.0
GATEWAY=172.16.1.254
DNS1=8.8.8.8
ONBOOT=”yes”
TYPE=”Ethernet”
cat /etc/sysconfig/network-scripts/ifcfg-eth1

DEVICE=eth1
TYPE=Ethernet
ONBOOT=yes
BOOTPROTO=static
IPADDR=192.168.43.103
DEFROUTE=no
cat /etc/sysconfig/network-scripts/ifcfg-eth2

DEVICE=eth2
TYPE=Ethernet
ONBOOT=no
DEFROUTE=no
cat /etc/sysconfig/network-scripts/ifcfg-eth3

DEVICE=eth3
TYPE=Ethernet
ONBOOT=no
DEFROUTE=no

Compute node 2:
Base OS : RHEL 6.5
Disable NetworkManager : “service NetworkManager stop ; chkconfig NetworkManager off
Disable Selinux : vi /etc/selinux/config , SELINUX=disabled

cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=compute2.zzzzz.com
cat /etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=”eth0″
BOOTPROTO=”static”
IPV6INIT=”yes”
MTU=”1500″
IPADDR=172.16.26.104
NETMASK=255.255.0.0
GATEWAY=172.16.1.254
DNS1=8.8.8.8
ONBOOT=”yes”
TYPE=”Ethernet”
cat /etc/sysconfig/network-scripts/ifcfg-eth1

DEVICE=eth1
TYPE=Ethernet
ONBOOT=yes
BOOTPROTO=static
IPADDR=192.168.43.104
DEFROUTE=no
cat /etc/sysconfig/network-scripts/ifcfg-eth2

DEVICE=eth2
TYPE=Ethernet
ONBOOT=no
DEFROUTE=no
cat /etc/sysconfig/network-scripts/ifcfg-eth3

DEVICE=eth3
TYPE=Ethernet
ONBOOT=no
DEFROUTE=no

Reboot all machines

Subscription manager to register the machine to RHSM :
Please follow the SOP in this guide(2.1.2)  : Red_Hat_Enterprise_Linux_OpenStack_Platform-4-Getting_Started_Guide-en-US

Make sure all servers’ yum repositories are setting up correctly
Must be able to access RHEL 6 and openstack-4.0 package:
rhel-6-server-openstack-4.0-rpms(Red Hat OpenStack 4.0 (RPMs))
rhel-6-server-rpms(Red Hat Enterprise Linux 6 Server (RPMs))

Service placement:
Controller node(192.168.43.102) : keystone,mysqld,glance,cinder,swift,ceilometer,heat,neutron(server,l3 agent, openvswitch plugin,dhcp agent,lbaas-agent,metadata-agent),nova-compute,nova-api,nova-cert,nova-conductor,nova-scheduler
Controller node(172.16.26.102) : horizon,vncproxy,nagios
Compute node(192.168.43.103,192.168.43.104) : nova-compute,openvswitch plugin

Install the openstack deployment tool packstack:
yum install openstack-packstack -y

Run the packstack using this answer file: multi-node-vlan.txt
packstack –asnwer-file=multi-node-vlan.txt

Post Install

1.
Controller node
vi /etc/keystone/keystone.conf
token_format =UUID
service openstack-keystone restart

2.
Controller and compute node
vi /etc/sysconfig/network-scripts/ifcfg-eth2
ONBOOT=yes
“service network restart”

3.Configure bridge for external network: br-ex
Controller node
vi /etc/sysconfig/network-scripts/ifcfg-br-ex
DEVICE=br-ex
DEVICETYPE=ovs
TYPE=OVSBridge
BOOTPROTO=static
IPADDR=172.16.26.102
NETMASK=255.255.0.0
GATEWAY=172.16.1.254
ONBOOT=yes

vi /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
TYPE=OVSPort
DEVICETYPE=ovs
OVS_BRIDGE=br-ex
ONBOOT=yes
“service network restart”

Verify the Installation :

Get the admin login password by “cat ~/keystone_admin”

Login dashboard and create networks:http://172.16.26.102/dashboard

Create external network and subnet:
Networks > Create Network > Name: pub > Project : admin > check “Admin state and External”
Networks > pub > Create Subnet > Subnet Name = pubsub , Network Address = 172.16.0.0/16, IP Version=IPv4, Gateway IP = 172.16.1.254 > Subnet Detail > Allocation Pools = 172.16.26.30,172.16.26.100 > Create

Create Router and set gateway:
Click Project > Routers > Create Router > Router name = router1 > Create router
Click Project > Routers > Click “Set Gateway” for router1

Create private network:
Click Project > Networks > Create Network > Network Name = priv > Subnet * > Subnet Name = pubsub , Network Address = 10.0.0.0/24, IP Version=IPv4, Gateway IP = leave blank > Create

Create or upload image:
Click Project > Images & Snapshots > Name = Cirros_img > Description = CirrOS > Image Source = Image Location > Image Location = http://download.cirros-cloud.net/0.3.1/cirros-0.3.1-x86_64-disk.img > Format = QCOW2 – QEMU Emulator > Tick Public > Create Image

Create Instance :Click Project > Instances > Launch Instance > Availability Zone = nova > Instance Name = test > Flavor = m1.tiny > Instance Count = 3 > Instance Boot Source = Boot from image > Select Image = Cirros_img > Networking to select priv > Launch

Comments

vSphere 5.5 亮點 : VSAN & vFRC

雖然有些久了(十月份),但今天心血來潮,就介紹一下VMWARE推出的這兩個新功能VSAN & vFRC

VSANvirtual SAN,顧名思義就是虛擬的SAN,也就是Storage了。
Storage有啥了不起?就因為扣上了一個v字。把儲存虛擬化了。講到虛擬化,幾年前當MIS聽到會很反感,但現在已經能接受了,因為要雲端化嘛。要自動化嘛。

現今,在設計虛擬化環境架構的時候,除了網路,最頭痛的就是Storage了。Storage可說是整個虛擬化環境架構的心臟。
會有三個問題煩你: HA, Performance,Scalibility。

  1. 要怎麼設計架構,才能達到HA或reduntant?
  2. 要怎麼增加Performance,讓VM可以更快讀寫?
  3. 如果VM數增加,讀寫需求增加,要怎麼擴充來因應?

VSAN對以上三個問題做出以下的回答:

  • 在VSAN架構下,Hypervisor跟Storage是一體的。這種稱為 Hyperconverged infrastructure。這種架構下,Hosts貢獻出本機不含任何partitions的HDDs&SSDs給VSAN storage Pool當作儲存。VSAN最少需要三個Hosts,每個Host至少一顆不含任何partition的SSD跟HDD。SSD不算成容量,它主要是當作Read Cache 跟 Write Buffer。講到HA,這樣三台的架構下他是這樣表達的:(2n+1) = 需要的host數,n 是可以容許壞掉的Host或HDD數。定義在Storage Policy中。n=1是它的預設Policy。講到Policy,你可以把不同的 Policy運用到不同的VM或VMDK物件。你可以設定自己的Policy,如n=2,那就需要5台Host每個Host至少一顆不含任何partition的SSD跟HDD。有些VM比較重要,就把好一點的policy,指定給它。這就是所謂的Policy driven storage。是Software defined storage的基楚。看圖:

VSAN as scale out architecture both for storage and hypervisor

1 host or 1 HDD failure is tolerable for this VM

1 host or 1 HDD failure is tolerable for this VM

 

  • Storage Policy的設定裡面除了可容許壞掉的HOST/HDD數量外,還有兩個參數可以定義performance需求 : Number of disk stripes per objectFlash read cache reservation。Number of disk stripes per object是說,這個物件要跨到多少顆HDDs,可能在同一個HOST的HDDs,也可能跨在不同HOST的HDDs。目的是可增加讀寫效能。Flash read cache reservation是定義要保留多少的SSD或flash給該VM/VMDK或物件做cache,來特別的增加讀取效能,間接地增加了寫入效能。也就是說VSAN給的還不能滿足需求時,還可以用它來補救。一般是很難走到這步的 。看圖:
Storage policy

Parameters can be configured for a vm storage policy

  • 第三個問題是在講Scalibility。VSAN架構下,想開更多VM就加Host想要更多空間就加HDD&SSD完全是水平擴充的。隨著Host的增加,運算能力也增加。隨著HDD跟SSD的增加容量也增加。可設定的參數上限值也隨著增加。注意一下,目前beta版可設定的參數上限值是有限的。
    http://www.virtual-blog.com/2013/09/vmware-virtual-san-scalability-limits-vsan/

從一些截圖來了解VSAN

VSAN status

VSAN status

VSAN's disk groups &  backing devices

VSAN’s disk groups & backing devices

另外一個亮點是vFRC(Virtual Flash Read Cache)。先把VSAN忘掉。vFRC是for集中式儲存的加速選項。是利用Hosts上的SSD或PCIe flash devices來做成cache針對個別VM,assign給適當的cache。作法就是把Hosts上的SSD或PCIe flash devices,透過vSphere Web Client,format 成 VFFS格式。就可由Host上的VM取用。可增加讀取速度及response time。VFFS也是一種cluster file system,跨在多個Host之上。看圖:

An aggregated flash pool to offer read cache

An aggregated flash pool to offer read cache

VFFS Pool
VFFS Pool

Add vFRC

Create or add vFRC from SSD backing devices

Assign some vFRC to a VM.

Assign vFRC to a VM

Flash Read Cache Advanced setting
Flash Read Cache Advanced setting

 

我個人非常喜歡VSAN這種架構,他非常適合用來做VDI。概念簡單,設定簡單,擴充簡單。三個Hosts就可起步VDI。跟傳統最不一樣的地方就是VM的Storage Policy在現有的條件下(幾個Hosts,SSD,HDD),設定各種的Policy,並用在不同重要等級的VM上。這在傳統上應該就是不同tier的Storage。VSAN一定有API。這就是Software defined storage的基礎了吧。

Comments

vCenter : Prepare SQL DB for vCloud and VDI installation

For the detailed steps, please refer to :
http://vmwaremine.com/2012/11/12/prepare-dbs-for-vsphere-5-1-installation-or-upgrade-part-1/

The important thing is to execute the following 4 DB scripts in MSSQL management studio.Which will create the relative DBs and users. Please change the password in the script accordingly.

vcdb_db_schema  <DB schema for vCenter Server>
vum_db_schema  <DB schema for VMWare Update Manager>
RSA_db_schema  <DB schema for vCenter SSO RSA DB>
RSAUser_db_schema <DB schema for vCenter SSO RSA USER>

Then configure the privilege for vpxuser and vumuser.

Then configure MS SQL ODBC on vCenter server.
Note, only 32bit odbc(C:\Windows\SysWOW64\odbcad32.exe) can be used to establish the VUM DB connection.
For VCDB, you can use the 2008R2 builtin 64bit odbc.
For the detailed steps, please refer to :
http://vmwaremine.com/2012/11/12/prepare-dbs-for-vsphere-5-1-installation-or-upgrade-part-1/

Comments

vCloud Director : How to renew SSL certificate?

By default, the SSL certificates generated by the following commands only valid for 90 days :

1
2
3
4
5
#ssh root@vcloudip
#cd /opt/vmware/vcloud-director/jre/bin/
#./keytool -keystore certificates.ks -storetype JCEKS -storepass yourpasswd -genkey -keyalg RSA -alias http
And
#./keytool -keystore certificates.ks -storetype JCEKS -storepass yourpasswd -genkey -keyalg RSA -alias consoleproxy

We can create  longer(example 360 days) certificates by assigning -validity option as the following commands:

1
2
3
4
5
#ssh root@vcloudip
#cd /opt/vmware/vcloud-director/jre/bin/
#./keytool -keystore certificates.ks -storetype JCEKS -storepass yourpasswd -validity 360 -genkey -keyalg RSA -alias http
#./keytool -keystore certificates.ks -storetype JCEKS -storepass yourpasswd -validity 360 -genkey -keyalg RSA -alias consoleproxy
#cp certificates.ks /opt/vmware/vcloud-director/ssl/

After creating the certificate , use the following command to replace the old certificate:

1
2
3
4
5
6
7
8
9
10
#ssh root@vcloudip
#service vmware-vcd stop
#cd /opt/vmware/vcloud-director/bin/
#./configure
Specify your generated SSL certificate's path(this example: /opt/vmware/vcloud-director/ssl/certificates.ks)
Enter the keystore and certificate passwords.
Please enter the password for the keystore:
Please enter the private key password for the 'http' SSL certificate:yourpasswd
Please enter the private key password for the 'consoleproxy' SSL certificate:yourpasswd
Choose "Yes" when it asks you to start the VCD service.

At the browser side , clean the browser cookies and cache and link to vCloud Director URL.

Comments

簡單介紹一下Zabbix

Zabbix Logo

Zabbix Logo

Zabbix 是一個監控套件
可監控網路上的devices

主要利用的監控方法有 Zabbix Agent, SNMP, SNMP trap, IPMI, SSH,TELNET,WEB,Database,JMX 跟自訂的scripts(Zabbix 稱External check)

Zabbix Agent 又可分主動(Active)跟被動(Passive)

Passive Agent : 由zabbix server 定期訪問Client device 上的 zabbix agent, 取得的讀值
Active  Agent : Client device 上的 zabbix agent 定期主動回報自己的讀值

Zabbix Agent 內建有多種對OS層的監控項目
https://www.zabbix.com/documentation/2.0/manual/config/items/itemtypes/zabbix_agent

可自訂自己的方法到Zabbix Agent來加強想要監控的項目。Zabbix 稱 UserParameterm。如get ipmi sdr from KCS。只要OS能拿到的都能監控。

Zabbix 主要的監控觀念  : Host, Item , Trigger ,  Action

Host : 要被監控的Device
Item : 監控的方法或是資料蒐集的方法
Trigger : 邏輯判斷蒐集到的值是否在定義的範圍內
Action : 若不是就行動。
             如Email 給誰 or IM 給走XMPP標準的的IM or SMS 給誰orv執行scripts等~

Zabbix 主要的設定項目有 : Host , item, trigger, template

Host : 設定哪個device要被監控
item : 對這個host要監控些什麼,要蒐集什麼讀值,要如何蒐集
trigger : 當蒐集到的值超出定義的範圍就 …
template : Template 包含以上辛苦設定好的三項,可export 成xml,再import到其他台zabbix上重複使用。import時可選擇不要Host, 使得template能通用。所有動作在Web UI完成,非常容易方變

跟Nagios不一樣的地方 :

1. Zabbix不需特別安裝RRD來達到繪圖功能,Zabbix內建簡單的繪圖功能,每個搜集的值都有簡易趨勢圖。
2. Nagios NRPE plugin(perl script) 在Client 端上判斷是否為問題後回傳,
Zabbix卻是在 Server端做邏輯判斷,非常彈性的可設定/改變判斷值
如 :
    Problem: If cpu1 temperature over 60 °C for last 10 minutes,
        then define it as a warning event.
    Recovery: If cpu1 temperature is within 20~60 °C for last 10 minutes,
        then define it as a recovery event.
3. 所有監控設定都是在WEB UI用滑鼠鍵盤完成。
4. 有內建對IPMI Device的監控,2.2版已加強到能監控descrete sensor
5. 前端用php
6. Zabbix 綁DB,Nagios 不用DB。

一些截圖:

Comments

用pXe遠端安裝多台Servers

所需設備

1.1台pXe Server(dhcp,tftp,apache)

2.1~40台 blank servers

3.每台servers需要有BMC功能 (沒有的話要一台一台手動開機調bios也ok)

4.一台48 Ports的SWITCH

pXe Server利用工具

1.dhcpd,tftpd,apache

2.ipmitool (控制bios開機程序)

準備工具

#yum install dhcp tftp-server httpd

設定 dhcpd.conf

Read the rest of this entry »

Comments

« Previous entries Next Page » Next Page »