[Home] [Installation Information] [Docker Install] [Kubernetes Install] [Configuration Parameters] [Network Control via Ansible] [Operations] [Debuggging][QOS]
Debugging SiteRM Issues
If you encounter failures or issues with SiteRM, this guide provides debugging steps and potential fixes.
SiteRM Frontend
SiteRM provides a few ways to monitor your endpoints. Each SiteRM Frontend runs Frontend Web UI. You can identify Web UI URL from Git configuration, for example, here is for Caltech: FE-Config.yaml.
To access SiteRM Frontend, you need to have your Certificate DN Whitelisted or OIDC Authentication Service credentials.
- (This is currently default) In case of certificate, access permissions are controlled in Github repository for FE: Fe-Auth.yaml
- In case of OIDC, please contact your OIDC Issuer and administrator to create you an account or provide permissions.
SiteRM Frontend Web UI provides a way to monitor Site in real-time, track it’s status, connectivity, health of overall system at any given site. Key features:
- Topology - Displays the current network topology, that includes all known switches, servers, enabled pors, WAN connections.
- Frontend Configuration - Displays current runtime configuration for Frontend and any registered agents. In case of config update, or node removal, this page provides a way to reload configuration (force SiteRM to read new configuration from Github) or delete host.
- ServiceStates - Show information about each Service running at the site and it’s status.
- Models - View and show latest site Models in MRML, Turtle, Json-ld format. This is a format that represents Sites topology back to SENSE Orchestrator.
- Deltas - View and track state of all Deltas. Deltas are the submission by SENSE Orchestrator to make changes at the Site, e.g. Create vlan, remove vlan, set ip, etc…
- Active Requests - Shows SENSE Active requests in Json format.
- Debugging tools - Allow to request new debug action, like Ping, Traceroute, Arp, Ethr, FDT, Iperf3 tests.
- Change Active Start/End - Allows to change SENSE Active resource request start/end time. Useful if need to drain site and or some resources were not deleted correctly.
On the Top Right, Web-UI shows System status, by providing brief status overview of Alive,Ready,Readiness,Liveness. Additionally, it provides a link to a Swagger documentation for ALL APIs supported by SiteRM.
Autogole monitoring and Slack alerts
SENSE team maintains an Autogole monitoring, where each Site has a dedicated dashboard and automated Slack alerts for Site issues. You can find all Sites, and each site status information here: Autogole monitoring. This webpage requires either Github Authentication (send your gihtub username to SENSE team) or SENSE-O account.
Some of the errors are described below, but not all. We continue to improve the documentation and expand explanation of error and potential fixes.
Kubernetes install issues
Error Message
Error: INSTALLATION FAILED: create: failed to create: Request entity too large: limit is 3145728
Cause
This error occurs when the size of the Helm chart or the generated Kubernetes resources exceeds the default size limit set by the Kubernetes API server or ingress controllers.
Resolution
Use helm template ... | kubectl apply -f
command and not helm install
.
OR Use helm template ... &> deployment.yaml && kubectl apply -f deployment.yaml
Docker Startup Failure (Frontend)
Error Message
ERROR: Configuration file ../conf/etc/ansible-conf.yaml was not modified. SiteRM Will fail to start.
Cause
This occurs when the configuration file has not been modified. Even if your Frontend RM runs in raw switch mode, this file still needs to be modified, although it may not be used.
Resolution
Ensure the correct information is added to the ansible-conf.yaml
file. Refer to the documentation for proper configuration:
NetControl Ansible Documentation.
PluginException: Interface Not Found
Error Message
SiteRMLibs.CustomExceptions.PluginException: Interface enp1s0np0 was not found on the system. Misconfiguration
Cause
The specified interface is not present on the system, but it is configured in the sdn-sense/rm-configs configuration file.
Resolution
- Verify whether the interface exists on the system.
- If the interface is missing, check for kernel updates that may have renamed the interface.
- If a new interface name is required, update the configuration file and submit a pull request to rm-configs.
Interface Down
Error Message
Interface {interface} is not up. Check why interface is down.
Cause
The interface under SENSE control is currently down.
Resolution
Investigate the cause of the interface being down and take appropriate action to bring it back up.
Oversubscription Warning
Error Message
Interface {key} has no remaining reservable capacity! Over subscribed?
Cause
The node is oversubscribed due to too many requests.
Resolution
This is a warning. No new path requests can be provisioned on the affected node. Consider reducing load or increasing capacity.
VLAN Range Not Defined
Error Message
Interface {key} has no vlan range list defined!
Cause
A configured interface does not have a VLAN range defined.
Resolution
Ensure the VLAN range is properly set in the configuration file.
No Remaining VLANs
Error Message
No remaining vlans in vlan range list for interface {key}. All used?
Cause
All VLANs in the range are currently in use.
Resolution
This is a warning. No new requests can be processed until VLANs are freed up.
Port Not Added to Model
Error Message
SiteRMLibs.CustomExceptions.ServiceWarning: Warning. Port aristaeos_s0Ethernet23/1 not added into model. Its status not switchport. Ansible runner returned: {'bandwidth': 100000, 'duplex': 'duplexFull', 'lineprotocol': 'notPresent', 'macaddress': '44:4c:a8:55:2d:3c', 'mtu': 9214, 'operstatus': 'notconnect'}.
Cause
The port is missing the switchport: true
flag.
Resolution
- The port may have gone down. Check the status.
- The port may not have a switchport trunk defined. Refer to the device documentation for the correct configuration.
VLAN Manually Configured
Error Message
Vlan {vlan} is configured manually on {host}. It comes not from delta. Either deletion did not happen or was manually configured.
Cause
A VLAN allocated as part of the SENSE range is manually configured on the system.
Resolution
-
HOST: If manually configured, delete it using the following command:
ip link delete vlan.XXXX
-
SWITCH: If manually configured, delete vlan on the network device.
-
There might be cases, where SiteRM failed to delete or cancel the resource, check:
- For host, check the following log file inside agent container:
/var/log/siterm-agent/Ruler/api.log
- For switch, check the following log file inside agent container:
/var/log/siterm-site-fe/{LookUpService,ProvisioningService}/api.log
- For host, check the following log file inside agent container:
Errors in isAlias
or Switch Port Configuration
Error Messages
Interface {key} already has isAlias in git config (and we get it from Kube Labels. Which one is right?
Interface {key} already has switch port in git config (and we get it from Kube Labels. Which one is right?
Interface {key} already has switch in git config (and we get it from Kube Labels. Which one is right?
Interface {key} not found in Host. Either interface does not exist or NetInfo Plugin failed.
Interface {key} has no switch defined, nor was available from Kube (in case Kube install)!
Interface {key} has no switch port defined, nor was available from Kube (in case Kube install)!
Cause
A misconfiguration in Agent settings, possibly due to missing port configuration or conflicts between Kubernetes labels and the git configuration.
Resolution
Refer to the documentation for proper mapping of ports to resolve the issue.
SiteRM Ansible Debugger and Cleaner
The siterm-ansible-runner
script inside the SiteRM-FE container provides debugging and cleaning options.
Usage
# siterm-ansible-runner -h
usage: siterm-ansible-runner [-h] [--printports] [--dumpconfig] [--cleanswitch {exceptactive,onlyactive,all}] [--autoapply] [--fulldebug]
SENSE Ansible test runner
optional arguments:
-h, --help show this help message and exit
--printports Run ansible and print ports for rm-configs.
--dumpconfig Run ansible and dump configuration received from ansible.
--cleanswitch {exceptactive,onlyactive,all}
Clean switch VLANs:
- `exceptactive`: Clean all VLANs except active ones provisioned by SENSE.
- `onlyactive`: Clean only VLANs provisioned by SENSE.
- `all`: Clean all VLANs (Dangerous!).
--autoapply Auto-apply configuration to the switch (for `cleanswitch` only).
--fulldebug Run ansible with full debug output.
Notes
- The
exceptactive
option is safe. - The
onlyactive
andall
options are dangerous—use with caution. - Always validate configurations before applying changes.
Autogole Warning State
Error Messages
Autogole monitoring shows a warning state for a Service.
Cause
SiteRM executes multiple validation actions, and there might be many reasons for a warning. Please take a look at your Frontend Service states page, that will give a more detailed error message
Resolution
Detailed error message is available on Site’s Frotend Service states page.
Autogole Critical Errors
Error Messages
Autogole monitoring shows a critical error.
Cause
SiteRM executes multiple validation actions, and some errors are considered critical, like: Node down and agent not running, Disk/CPU/Memory overloaded, or many of the errors above (interface down, wrong link port, etc…)
Resolution
Look at Current active alerts/warnings graph and identify alert state. Autogole monitoring also provides ‘SiteRM Component Status (NOT OK)’ graph, that status of the Component over time.