Where is the report from the
cioctl report command?
The output of the
cioctl report command is in the /var/lib/storidge directory.
Please forward the report to email@example.com with details of the error you are troubleshooting.
Insufficient cluster capacity available to create this vdisk
Error message: "Fail: Add vd: Insufficient cluster capacity available to create this vdisk. Use smaller size"
If you are running a Storidge cluster on virtual servers or VMs, this error comes from a data collection process that creates twenty volumes and runs fio to collect performance data for Storidge's QoS feature.
The Storidge software will normally only run the data collection on physical servers. However the data collection can be started on virtual servers or VMs that are not on the supported list.
Please run the
cioctl report command and forward the report in /var/lib/storidge directory to firstname.lastname@example.org. The report command will collect configuration information and logs including information on the virtual server. When forwarding the report, please make a request to add the virtual server to the supported list.
cio node ls shows node in maintenance mode and missing the node name. How do I recover the node?
This situation is likely a result of a node being cordoned or shutdown for maintenance. Then the cluster was rebooted or power cycled.
Once the cluster is rebooted, the node that was previously in maintenance mode will still stay in maintenance mode. The output of the
cio node ls command may look something like this:
root@u1:~# cio node ls NODENAME IP NODE_ID ROLE STATUS VERSION 192.168.3.95 d12a81bd sds maintenance u3 192.168.3.29 7517e436 backup1 normal V1.0.0-2986 u4 192.168.3.91 91a78c14 backup2 normal V1.0.0-2986 u1 192.168.3.165 a11314f0 storage normal V1.0.0-2986 u5 192.168.3.160 888a7dd3 storage normal V1.0.0-2986
To restore the cordoned node, you can:
Login to the cordoned node and run
cioctl node uncordonto rejoin the node to the cluster
Uncordon the node by running
cioctl node uncordon <IP address>from any node. In the example above, run
cioctl node uncordon 192.168.3.95. The Storidge software does not depend on identifiers that can be changed by users, e.g. hostname.
Reset or power cycle the cordoned node and it will automatically rejoin the cluster after rebooting
dockerd: msg="Node 085d698b3d2e/10.0.2.235, added to failed nodes list"
Error message: dockerd: time="2019-10-19T03:18:22.862011422Z" level=info msg="Node 085d698b3d2e/10.0.2.235, added to failed nodes list"
The error message indicates that internode cluster traffic is being interrupted. This could be a result of network interface failure or network bandwidth being saturated with too much incoming data. This will impact the ability of the Storidge cluster to maintain state.
Monitoring bandwidth usage for each instance to confirm if network bandwidth is being exhausted. Entries in syslog that indicate nodes added to failed list, ISCSI connection issues, or missing heartbeats are also indicators of network congestion.
If there is only one network interface per instance, it will be supporting incoming data streams, orchestrator system internode traffic and Storidge data traffic.
For use cases handling a lot of front end data, consider splitting off the storage traffic to a separate network, e.g. use instances with two network interfaces. Assign an interface for front-end network traffic and assign second interface for storage network.
When creating the Storidge cluster, you can specify which network interface to use with the
--ip flag, e.g. run
cioctl create --ip 10.0.1.51. When you run the
cioctl node join command on the storage nodes, it will suggest an IP address from the same subnet.
Verify if incoming data is going to just one node. Consider approaches such as a load balancer to spread incoming data across multiple nodes.
Calculate the amount of network bandwidth that will be generated by your use case. Verify that the network interface is capable of sustaining the data throughput. For example, a 10GigE interface can sustain about 700MB/s.
In calculations for data throughput, note that for every 100MB/s of incoming data, there is a multiple of the throughput used for replicating data. For 2-copy volumes, 100MB/s will be written to local node and 100MB/s will go through the network interface to other nodes as replicated data, i.e. 100MB/s incoming data stream results in 200MB/s of used network bandwidth.
dockerd: dockerd: level=warning msg="failed to create proxy for port 9999: listen tcp :9999: bind: address already in use"
Error message: dockerd: time="2019-10-10T17:35:59.961861284Z" level=warning msg="failed to create proxy for port 9999: listen tcp :9999: bind: address already in use"
The error message indicates a network port conflict between services. The example above indicates that port number 9999 is being used by more than one service on the node.
Verify there are no conflicts with port numbers used by Storidge cluster.
"iscsid: Kernel reported iSCSI connection 2:0 error"
Error message: iscsid: Kernel reported iSCSI connection 2:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)
The error message indicates an iscsi connectivity issue between cluster nodes. This could be a result of conflicts such as duplicate iscsi initiator names or other networking issues.
For a multi-node cluster to function correctly, the ISCSI initiator name on each node much be unique. Display the ISCSI initiator name on each node by running
cat /etc/iscsi/initiatorname.iscsi, and confirm they are different.
If the ISCSI initiator name is not unique, you can change it with:
echo "InitiatorName=`/sbin/iscsi-iname`" > /etc/iscsi/initiatorname.iscsi
Since the ISCSI initiator name is used to setup connections to ISCSI targets during cluster initialization, it must be made unique before running
cioctl create to start a cluster.
"Fail: node is already a member of a multi-node cluster"
Error message: Fail: node is already a member of a multi-node cluster
This error message in syslog indicates an attempt to add a node to the cluster that is already a member. Check your script or playbook to verify that the
cioctl join command is being issued to a storage(worker) node and not the primary (sds) node.
This error can result in related message below which indicates the Storidge CIO kernel modules are incorrectly unloaded, breaking cluster initialization.
[DFS] dfs_exit:18218:dfs module unloaded [VD ] vdisk_exit:2916:vd module unloaded
Get http://172.23.8.104:8282/metrics: dial tcp 172.23.8.104:8282: connect: connection refused
Error message: connect: connection refused
Getting a "Connection refused" errors on requests to an API endpoint likely means that the API server on the node is not running.
ps aux |grep cio-api to confirm. If not listed, run
cio-api & on the node to restart the API.
cioctl report to generate a cluster report which will be saved to file /var/lib/storidge/report.txz. Please forward the cluster report to email@example.com with details of the error for analysis.
"cluster: Number of remote drives on host 0 (IP 10.11.14.87) 0 is not the expected 9"
Error message: Number of remote drives on host 0 (IP 10.11.14.87) 0 is not the expected 9
This error message during cluster initialization is an indicator of an unstable networking environment, or insufficient compute capacity to handle networking packets.
The error indicates that node information was not properly passed to the primary node during cluster configuration. Example:
[root@EV15-HA1 ~]# cioctl init 67a2c2e8 Warning: Permanently added '10.11.14.90' (ECDSA) to the list of known hosts. cluster: initialization started cluster: Copy auto-multiNode-EV15-HA1.cfg to all nodes (NODE_NUMS:4) cluster: Initialize target cluster: Initialize initiator cluster: Start node initialization node: Clear drives node: Load module node: Add node backup relationship node: Check drives Adding disk /dev/sdb SSD to storage pool Adding disk /dev/sdc SSD to storage pool Adding disk /dev/sdd SSD to storage pool Adding disk /dev/sde SSD to storage pool Adding disk /dev/sdf SSD to storage pool Adding disk /dev/sdg SSD to storage pool Adding disk /dev/sdh SSD to storage pool Adding disk /dev/sdi SSD to storage pool Adding disk /dev/sdj SSD to storage pool Adding disk /dev/sdk SSD to storage pool Adding disk /dev/sdl SSD to storage pool Adding disk /dev/sdm SSD to storage pool node: Collect drive IOPS and BW: Total IOPS:26899 Total BW:1479.9MB/s node: Initializing metadata cluster: Node initialization completed cluster: Number of remote drives on host 0 (IP 10.11.14.87) 0 is not the expected 9 cluster: Number of remote drives on host 1 (IP 10.11.14.88) 0 is not the expected 9 cluster: Number of remote drives on host 2 (IP 10.11.14.89) 0 is not the expected 9 cluster: Number of remote drives on host 3 (IP 10.11.14.90) 0 is not the expected 9 cluster: Cannot initialize cluster cluster: 'cioctl clusterdeinit default.cfg 1' started cluster: Killing MongoDB daemons cluster: Killing cio daemons cluster: Uninitialize initiator cluster: Uninitialize target
"cioctl: insmod: ERROR: could not insert module "
Error message: Feb 4 16:48:02 EV15-HA1 cioctl: insmod: ERROR: could not insert module /lib/modules/3.10.0-1062.el7.x86_64/kernel/drivers/storidge/vd.ko: File exists
If you are running with VMs in a vSphere environment, the error message above means that secure boot is enabled for the VM. Since the Storidge software will insert a kernel module, secure boot needs to be disabled.
To turn secure boot off, the VM must first be powered off. Then right-click the VM, and select Edit Settings. Click the VM Options tab, and expand Boot Options. Under Boot Options, ensure that firmware is set to EFI.
Deselect the Secure Boot check box to disable secure boot. Click OK.
Configuration Error: Cannot determine drive count on node at 10.11.14.87. Verify data drives have no filesystem or partitions
Error message: Configuration Error: Could not determine drive count on node 10.11.14.87 at 10.11.14.87. Verify data drives available
This error message while initializing a cluster indicates that while drives are available on the nodes, they may be formatted with filesystem or have partitions. The Storidge software will not add these drives to the storage pool since there may be user data.
file -sL <device> command to check. For example, drives sdb, sdc and sdd below can be discovered and consumed by Storidge. However drive sda below will be skipped.
root@ubuntu-16:~# file -sL /dev/sd* /dev/sda: DOS/MBR boot sector /dev/sda1: Linux rev 1.0 ext2 filesystem data (mounted or unclean), UUID=f838091f-e90f-4037-8352-4d7d2775667a (large files) /dev/sda2: DOS/MBR boot sector; partition 1 : ID=0x8e, start-CHS (0x5d,113,21), end-CHS (0x3ff,254,63), startsector 2, 40439808 sectors, extended partition table (last) /dev/sda5: LVM2 PV (Linux Logical Volume Manager), UUID: Tx8zdm-LIyl-Am4b-0Bbu-iJxv-yKsI-IR9NtO, size: 20705181696 /dev/sdb: data /dev/sdc: data /dev/sdd: data
dd to wipe out metadata and make the drive available for Storidge. For example, to clear drive /dev/sdb:
dd if=/dev/zero of=/dev/sdb bs=1M count=300
Cluster breaks on VMware vSphere snapshot with error "[SDS] node_mgmt:14380:WARNING: node pingable: node.node_id:ab7dc460 [188.8.131.52] last_alive_sec"
Error message: [SDS] node_mgmt:14380:WARNING: node pingable: node.node_id:ab7dc460 [184.108.40.206] last_alive_sec
When you take a snapshot of a vSphere virtual machine with memory, e.g. for Veeam backup:
- The virtual machine becomes unresponsive or inactive
- The virtual machine does not respond to any commands
- You cannot ping the virtual machine
This is expected behavior in ESXi. Before a backup a snapshot is taken, the backup job runs, and finally the snapshot is removed. This causes the VM to lose connectivity for a period dependent on the amount of memory and changed data.
Storidge uses heartbeats to monitor the health of cluster nodes. When a VM does not respond for an extended time it is marked as a failed node. Losing access to multiple nodes can potentially break a cluster. It is not recommended to use VMware snapshot for backup of Storidge nodes.
Storidge will be introducing a backup service for cluster workloads. This will be based on volume snapshots, i.e. the backups are at granularity of a container and does not require a node to be suspended.