Troubleshooting Guide
Introduction
This guide aims to provide a comprehensive overview of potential troubleshooting scenarios that may arise when installing and configuring KubeSlice.
Installation Issues
Why do I face installation issues while installing KubeSlice on kind clusters on Ubuntu OS?
On Ubuntu OS, if you have too many files open and try installing KubeSlice on kind clusters, then you face issues.
You must increase the ulimit
to 2048
or unlimited
and try installing KubeSlice on kind
clusters. If you still face issues, see errors due to too many open files.
Why do I get an error during a helm upgrade of the KubeSlice Controller?
Currently, you can only upgrade to a software patch version that does not contain schema changes. You cannot upgrade to a software patch/complete version that contains schema changes.
When you try to upgrade the controller using the helm upgrade
command,
during which, if the worker operator pod is down, you get the following error related
to a mutating webhook.
Patch Deployment "kubeslice-controller-manager" in namespace kubeslice-controller
error updating the resource "kubeslice-controller-manager":
cannot patch "kubeslice-controller-manager" with kind Deployment: Internal error occurred: failed calling webhook "mdeploy.avesha.io": failed to call webhook: Post "https://kubeslice-webhook-service.kubeslice-system.svc:443/mutate-appsv1-deploy?timeout=10s": no endpoints available for service "kubeslice-webhook-service"
Looks like there are no changes for Deployment "kubernetes-dashboard"
Looks like there are no changes for Deployment "dashboard-metrics-scraper"
Patch Certificate "kubeslice-controller-serving-cert" in namespace kubeslice-controller
Patch Issuer "kubeslice-controller-selfsigned-issuer" in namespace kubeslice-controller
Patch MutatingWebhookConfiguration "kubeslice-controller-mutating-webhook-configuration" in namespace
Patch ValidatingWebhookConfiguration "kubeslice-controller-validating-webhook-configuration" in namespace
Error: UPGRADE FAILED: cannot patch "kubeslice-controller-manager" with kind Deployment: Internal error occurred: failed calling webhook "mdeploy.avesha.io": failed to call webhook: Post "https://kubeslice-webhook-service.kubeslice-system.svc:443/mutate-appsv1-deploy?timeout=10s": no endpoints available for service "kubeslice-webhook-service"
To resolve this error, manually delete the mutating webhook configuration as described below:
-
Get the name of the
MutatingWebhookConfiguration
webhook using the following command:kubectl get mutatingwebhookconfiguration
Expected Output
NAME WEBHOOKS AGE
cdi-api-datavolume-mutate 1 16d
cert-manager-webhook 1 31d
istio-sidecar-injector 4 15d
kubeslice-controller-mutating-webhook-configuration 7 30d
kubeslice-mutating-webhook-configuration 1 29d
longhorn-webhook-mutator 1 17d
nsm-admission-webhook-cfg 1 29d
virt-api-mutator 4 18dNote down the name of the
MutatingWebhookConfiguration
webhook, which iskubeslice-mutating-webhook-configuration
in the above output. -
Delete the
MutatingWebhookConfiguration
using the following command:kubectl delete mutatingwebhookconfiguration kubeslice-mutating-webhook-configuration
Connectivity Issues
Why is my registered cluster not connected to the KubeSlice Controller?
There could be an issue during the installation of the Slice Operator on the registered cluster. Try these steps:
-
Switch context to the registered cluster on which you are facing connectivity issues using the following command:
kubectx <cluster name>
-
Validate the installation of the Slice Operator by checking the pods belonging to the namespace
kubeslice-controller-system
using the following command (from the output, check the status of the pods):kubectl get pods -n kubeslice-controller-system
-
If the connection issue still persists, check if the KubeSlice Controller endpoint and token in the cluster are correct in the Slice Operator YAML configuration file that is applied in that registered cluster. To know about the configuration, see the Slice Operator YAML file. file.
Registering clusters with the same name does not throw an error.
Each instance of the cluster is registered separately as two different clusters and Kubernetes ignores duplication of the cluster's name.
It is best to avoid the duplication of the clusters names as Kubernetes inherently ignores the duplication of cluster names.
The KubeSlice Controller was successfully installed with a controller endpoint that is not reachable by a slice.
Check if the controller endpoint is correct during the installation of the Slice Operator on the worker cluster. Check if the controller cluster's secret token and ca-cert installed on the worker cluster is correct. To know more, see Getting the Secrets of the Registered Cluster.
Node IP address on the registered cluster was changed but the KubeSlice components were not cleaned up.
When the Node IP address is changed on a registered cluster, then a manual clean-up is required for the worker cluster configuration to use the updated IP. So, we recommend not to change the Node IP manually when it is already configured or add an invalid Node IP address.
While registering a cluster, the Node IP is configured by pulling the value from the cluster.
A cluster registration failed with a correct cluster YAML file.
The registration fails when a cluster.yaml
file is applied to register more than one
clusters.
Ensure that acluster.yaml
file is applied to only one cluster and not multiple clusters.
Why do I experience router connectivity issues when one or more nodes are restarted in the worker clusters?
There is a connection disruption in some routers when one or more nodes are restarted in the worker clusters. You must restart the application pod to restore the router connectivity.
Cluster Issues
The error/warning states that the CRD object is stuck.
-
Patch an empty finalizer with the failing object CRD warning as shown in this example.
(
serviceexportconfigs.hub.kubeslice.io
is a failing CRD object in this example.)kubectl patch crd/serviceexportconfigs.hub.kubeslice.io -p '{"metadata":{"finalizers":[]}}' --type=merge
-
Uninstall and reinstall the KubeSlice Controller.
The error states that the project namespace is stuck.
-
Delete the stuck namespace by running the following command:
kubectl patch ns/<stuck-namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
-
Uninstall and reinstall the KubeSlice Controller.
Slice and Cluster Health Issues
What should I do when nsmgr is down?
Identify the worker cluster on which nsmgr
is down.
nsmgr
is a DaemonSet
running in the kubeslice-system
namespace. You can verify the status
of nsmgr
on the corresponding worker cluster to further trace the issue.
To verify the status of nsmgr
:
-
Get the details of
nsmgr
using the following command:kubectl get daemonset -n kubeslice-system --selector=app=nsmgr
Expected Output
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nsmgr 3 3 3 3 3 <none> 17d -
Get the status of
nsmgr
from all the pods by using the following command:kubectl get pods -n kubeslice-system --selector=app=nsmgr
Expected Output
NAME READY STATUS RESTARTS AGE
nsmgr-6gfxz 2/2 Running 3 (40h ago) 17d
nsmgr-jtxxr 2/2 Running 2 (12d ago) 17d
nsmgr-tdmd8 2/2 Running 0 11dIf the status is not
Running
, thennsmgr
on that pod is down.
What should I do when forwarder is down?
Identify the worker cluster on which forwarder
is down.
forwarder
is a DaemonSet
running in the kubeslice-system
namespace. You can verify the status
of forwarder
on the corresponding worker cluster to further trace the issue.
To verify the status of forwarder
:
-
Get the details of
forwarder
using the following command:kubectl get daemonset -n kubeslice-system --selector=app=forwarder-kernel
Expected Output
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
forwarder-kernel 3 3 3 3 3 <none> 17dIf a
forwarder
pod is down, then the number underREADY
will not be the same as other columns. -
Get the status of
forwarder
from all the pods using the following command:kubectl get pods -n kubeslice-system --selector=app=forwarder-kernel
NAME READY STATUS RESTARTS AGE
forwarder-kernel-2zb9r 1/1 Running 0 17d
forwarder-kernel-jjzz7 1/1 Running 0 17d
forwarder-kernel-r5kcw 1/1 Running 0 17dIf the
STATUS
is notRunning
, then that pod is down.
What should I do when netop is down?
Identify the worker cluster on which netop
is down.
netop
is a DaemonSet
running in the kubeslice-system
namespace. You can verify the status
of netop
on the corresponding worker cluster to further trace the issue.
To verify the status of netop
:
-
Get the details of
netop
using the following command:kubectl get daemonsets -n kubeslice-system | grep kubeslice-netop
Expected Output
kubeslice-netop 2 2 2 2 2 <none> 17d
-
The
netop
pod is restricted to run only on gateway nodes. Get the gateway nodes that run thenetop
pod.kubectl get nodes --selector=kubeslice.io/node-type=gateway
Expected Output
NAME STATUS ROLES AGE VERSION
gke-demo-cluster-2-s-demo-cluster-2-s-3e484d4b-cbnl Ready <none> 17d v1.23.16-gke.1400
gke-demo-cluster-2-s-demo-cluster-2-s-3e484d4b-qnwp Ready <none> 17d v1.23.16-gke.1400 -
Get the
netop
pods on thekubeslice-system
namespace using the following command:kubectl get pods -n kubeslice-system --selector=app=app_net_op
Expected Output
NAME READY STATUS RESTARTS AGE
kubeslice-netop-dqsg7 1/1 Running 0 17d
kubeslice-netop-jc4c2 1/1 Running 0 11dIf the
STATUS
is notRunning
, then that pod is down.
What should I do when spire-agent is down?
Identify the worker cluster on which spire-agent
is down.
spire-agent
is a DaemonSet
running in the spire
namespace. You can verify the status
of spire-agent
on the corresponding worker cluster to further trace the issue.
To verify the status of spire-agent
:
- Get the details of
spire-agent
using the following command:Expected Outputkubectl get daemonset -n spire --selector=app=spire-agent
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
spire-agent 3 3 3 3 3 <none> 17d - Get the
spire-agent
pods from thespire
namespace using the following command:Expected Outputkubectl get pods -n spire --selector=app=spire-agent
If theNAME READY STATUS RESTARTS AGE
spire-agent-l692m 1/1 Running 0 11d
spire-agent-nrfnf 1/1 Running 0 17d
spire-agent-xp5m8 1/1 Running 0 17dSTATUS
is notRunning
, then that pod is down.
What should I do when spire-server is down?
Identify the worker cluster on which spire-server
is down.
spire-server
is a StatefulSet
running in the spire
namespace. You can verify the status
of spire-server
on the corresponding worker cluster to further trace the issue.
To verify the status of spire-server
:
-
Get the details of
spire-server
using the following command:kubectl get statefulset -n spire --selector=app=spire-server
Expected Output
NAME READY AGE
spire-server 1/1 17d -
Get the
spire-agent
pods from thespire
namespace using the following command:kubectl get pods -n spire --selector=app=spire-server
Expected Output
NAME READY STATUS RESTARTS AGE
spire-server-0 2/2 Running 0 17dIf the
STATUS
is notRunning
and2/2
pods are notREADY
, then that pod is down.
What should I do when admission-webhook is down?
Identify the worker cluster on which admission-webhook
is down.
admission-webhook
is a deployment running in the kubeslice-system
namespace. You can verify the status
of admission-webhook
on the corresponding worker cluster to further trace the issue.
To verify the status of admission-webhook
:
-
Get the details of
admission-webhook
using the following command:kubectl get deployment -n kubeslice-system --selector=app=admission-webhook-k8s
Expected Output
NAME READY UP-TO-DATE AVAILABLE AGE
nsm-admission-webhook-k8s 1/1 1 1 17d -
Get the
admission-webhook
pods from thekubeslice-system
namespace using the following command:kubectl get pod -n kubeslice-system --selector=app=admission-webhook-k8s
Expected Output
NAME READY STATUS RESTARTS AGE
nsm-admission-webhook-k8s-698784967d-nmtrl 1/1 Running 0 17dIf the
STATUS
is notRunning
, and1/1
pod is notREADY
, then that pod is down.
What should I do when istiod is down?
Identify the worker cluster on which istiod
is down.
istiod
is a deployment running in the istio-system
namespace. You can verify the status
of istiod
on the corresponding worker cluster to further trace the issue.
To verify the status of istiod
:
-
Get the details of
istiod
using the following command:kubectl get deployment -n istio-system --selector app=istiod
Expected Output
NAME READY UP-TO-DATE AVAILABLE AGE
istiod 1/1 1 1 17d -
Get the
istiod
pods from theistio-system
namespace using the following command:kubectl get pods -n istio-system --selector app=istiod
Expected Output
NAME READY STATUS RESTARTS AGE
istiod-6b56cffbd9-8xx5t 1/1 Running 0 17dIf the
STATUS
is notRunning
, and1/1
pod is notREADY
, then that pod is down.
What should I do when dns is down?
Identify the worker cluster on which dns
is down.
dns
is a deployment running in the kubeslice-system
namespace. You can verify the status
of dns
on the corresponding worker cluster to further trace the issue.
To verify the status of dns
:
-
Get the details of
dns
using the following command:kubectl get deployments -n kubeslice-system --selector=app=kubeslice-dns
Expected Output
NAME READY UP-TO-DATE AVAILABLE AGE
kubeslice-dns 1/1 1 1 17d -
Get the
dns
pods from thekubeslice-system
namespace using the following command:kubectl get pods -n kubeslice-system --selector=app=kubeslice-dns
Expected Output
NAME READY STATUS RESTARTS AGE
kubeslice-dns-79d4fc6477-frjw6 1/1 Running 0 17dIf the
STATUS
is notRunning
, and1/1
pod is notREADY
, then that pod is down.
What should I do when slice-router is down?
slice-router
is a deployment running in the kubeslice-system
namespace, which is only one per slice.
You can verify the status of slice-router
on the corresponding worker cluster to further trace
the issue.
To verify the status of slice-router
:
-
Get the details of
slice-router
using the following command:kubectl get deployment -n kubeslice-system | grep vl3-slice-router-<slice-name>
Example
kubectl get deployment -n kubeslice-system | grep vl3-slice-router-bookinfo-slice
Expected Output
vl3-slice-router-bookinfo-slice 1/1 1 1 17d
-
Get the
slice-router
pods from thekubeslice-system
namespace using the following command:kubectl get pods -n kubeslice-system --selector=kubeslice.io/pod-type=router,kubeslice.io/slice=bookinfo-slice
Expected Output
NAME READY STATUS RESTARTS AGE
vl3-slice-router-bookinfo-slice-795fb754cc-4dnl4 2/2 Running 0 11dIf the
STATUS
is notRunning
, and2/2
pods are notREADY
, then that pod is down.
What should I do when egress is down?
egress
is a deployment running in thekubeslice-system
namespace. You can verify the status
of`egress`` on the corresponding worker cluster to further trace the issue.
To verify the status of egress
:
-
Get the details of
egress
using the following command:kubectl get deployment -n kubeslice-system --selector=istio=egressgateway,slice=bookinfo-slice
Expected Output
NAME READY UP-TO-DATE AVAILABLE AGE
bookinfo-slice-istio-egressgateway 1/1 1 1 17d -
Get the
egressgateway
pods from thekubeslice-system
namespace using the following command:kubectl get pod -n kubeslice-system --selector=istio=egressgateway,slice=bookinfo-slice
Expected Output
NAME READY STATUS RESTARTS AGE
bookinfo-slice-istio-egressgateway-7548b49659-9z4c5 2/2 Running 0 17d
What should I do when ingress is down?
ingress
is a deployment running in thekubeslice-system
namespace. You can verify the status
of`ingress`` on the corresponding worker cluster to further trace the issue.
To verify the status of ingress
:
-
Get the details of
ingress
using the following command:kubectl get deployment -n kubeslice-system --selector=istio=ingressgateway,slice=bookinfo-slice
Expected Output
NAME READY UP-TO-DATE AVAILABLE AGE
bookinfo-slice-istio-ingressgateway 1/1 1 1 17d -
Get the
ingressgateway
pods from thekubeslice-system
namespace using the following command:kubectl get pod -n kubeslice-system --selector=istio=ingressgateway,slice=bookinfo-slice
Expected Output
NAME READY STATUS RESTARTS AGE
bookinfo-slice-istio-ingressgateway-765fb4ddf-d52cs 2/2 Running 0 17dIf the
STATUS
is notRunning
, and2/2
pods are notREADY
, then that pod is down.
What should I do when slicegateway is down?
slicegateway
is a deployment running in the kubeslice-system
namespace. You can verify the
status of slicegateway
on the corresponding worker cluster to further trace the issue.
Slice gateways are always created in pairs.
To verify the status of slicegateway
:
-
Get the details of
slicegateway
using the following command:kubectl get deployment -n kubeslice-system --selector=kubeslice.io/pod-type=slicegateway,kubeslice.io/slice=bookinfo-slice
Expected Output
NAME READY UP-TO-DATE AVAILABLE AGE
bookinfo-slice-worker-1-worker-2-0 1/1 1 1 17d
bookinfo-slice-worker-1-worker-2-1 1/1 1 1 17dIf a pod is down, then it is shown under
READY
as0/1
. -
Get the
ingressgateway
pods from thekubeslice-system
namespace using the following command:kubectl get pod -n kubeslice-system --selector=kubeslice.io/pod-type=slicegateway,kubeslice.io/slice=bookinfo-slice
Expected Output
NAME READY STATUS RESTARTS AGE
bookinfo-slice-worker-1-worker-2-0-97748d58b-sqm7s 3/3 Running 0 17d
bookinfo-slice-worker-1-worker-2-1-8496454697-mw8cs 3/3 Running 0 17dIf the
STATUS
is notRunning
, and3/3
pods are notREADY
, then that pod is down.
Onboarded Application Namespace Issues
NSM containers are not injected in pods during deployments in the application namespace.
If NSM containers are not injected in pods during deployments in the application namespace, then check if that application namespace contains the KubeSlice label. If the label is not there, wait for the Slice Operator to label the namespace.
For example, run the following command to check the label:
kubectl describe ns iperf
In the command output below, kubeslice.io/slice=blue
is the KubeSlice label.
Name: iperf
Labels: hnc.x-k8s.io/included-namespace=true
iperf.tree.hnc.x-k8s.io/depth=0
kubernetes.io/metadata.name=iperf
kubeslice.io/slice=blue
Annotations: <none>
Status: Active
No resource quota.
No LimitRange resource.
In the command output, the iperf namespace contains the kubeslice.io/slice=blue
label.
This means that the namespace is already onboarded to the blue slice.
I face connectivity issues with the NSM interfaces present in an application pod.
When more than one NSM interfaces are present in an application pod, connectivity issues occur. This is due to the router having an older NSM interface, causing the connectivity disruption among the application pods.
You must reboot the application pod that has more than one NSM interface to restore the connectivity among the application pods.
kubeslice-cli
This guide describes troubleshooting scenarios that you could face during installing and while using the kubeSlice-cli
tool.
Unable to Install Kubeslice using the kubeslice-cli Tool on Ubuntu
During the installation of KubeSlice using the kubeslice-cli install -p=minimal-demo
command, if you get the following error message:
✓ Writing configuration 📜
• Starting control-plane 🕹️ ...
✗ Starting control-plane 🕹️
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged ks-w-2-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 137
Command Output:
2022/10/04 06:12:21 Process failed exit status 1
There could be a memory/disk space issue.
To resolve:
- Remove unused clusters (other than the ones used in the demo).
- Increase disk space/memory resources.
Unable to run the kubeslice-cli commands
After successfully installing KubeSlice using kubeslice-cli, if you are unable to use the commands:
kubeslice-cli get sliceConfig -n kubeslice-demo
Fetching KubeSlice sliceConfig...
🏃 Running command: /usr/local/bin/kubectl get sliceconfigs.controller.kubeslice.io -n demo
error: the server doesn't have a resource type "sliceconfigs"
2022/10/04 08:26:40 Process failed exit status 1
To resolve:
- Ensure you are on the controller cluster to run the commands:
kubectx -c
. - Export the configuration file using this command:
export KUBECONFIG=kubeslice/<path-to-the-kubeconfig-file>
.
Getting an Unverified Developer Error Message on macOS
When you try to install kubeslice-cli on macOS, you get the Unverified Developer Error Message
.
This error message appears when you try to install an application from a developer who is not registered with Apple.
To resolve:
Follow the instructions in Enabling the Application for macOS.