TEA troubleshooting
Internal notes on how to troubleshoot more lower-level issues that can happen with the managed labs.
CML Infrastructure
Q: How do I log in to the CML backend infrastructure?
A: The machines have direct SSH access blocked, use Tunnel & Access. Make sure you have cloudflared
installed locally and add this to your ~/.ssh/config
:
Host *.labs.cfdata.org
User cloudflare
ProxyCommand /opt/homebrew/bin/cloudflared access ssh --hostname %h
You will still be prompted for a password when logging in via ssh, that is expected.
To log in to a bastion, use the pool name for the hostname, e.g.:
ssh apjc.ssh.labs.cfdata.org
ssh euwest.ssh.labs.cfdata.org
ssh uswest.ssh.labs.cfdata.org
ssh appsvc-origin-1.ssh.labs.cfdata.org
# ... etc
Q: How are the AppSvc origins set up?
A: Because we were having issues with load-balancer probes occasionally DDoSing the origins (especially whenever someone enabled probes from all colos), we separated the origins for the labs. There are now 4 origin servers.
Most labs actually use origin-3 and origin-4, behind a load balancer that's accessible on public IP 20.88.188.200
. End users never interact with these origins individually, and in order to not confuse the users and be aligned to the lab narrative, these origins return X-Origin: origin-1
and X-Origin: origin-2
.
The Load Balancer lab uses origin-1 and origin-2, accessible on public IPs:
origin-1: 40.121.134.173
origin-2: 40.76.65.119
Guacamole
Q: I need to restart the Guacamole service running on a bastion for some reason:
A: sudo systemctl restart tomcat9
Q: Guacamole sessions are timing out for all attendees in a specific pool.
A: We don't know why this is happening yet, but it already happened couple times. Reboot the whole bastion VM when this happens (restarting the guacamole service is not enough): sudo reboot
CML Authentication
The auth flow works in the following way:
- Any user is first check if they already exist in CML database, and if yes, we allow them to log in (altho, see Rockwell flow below for partner exceptions).
- If they don't, we check if their email ends with @cloudflare.com, and if so, we auto-provision that user in DB and allow them to log in.
- If they are not from @cloudflare.com, we check if that email exists in Rockwell (via Rockwell's API). If so, we auto-provision that user in DB and allow them to log in.
Rockwell authentication flow
The above flow is just a high-level flow, when partners are involved the flow gets a bit more complicated:
- Whenever we successfully check a user exists in Rockwell, we update
LastAuthenticatedAt
timestamp in the DB. - If an existing Rockwell user logs in, we first check the DB to see if they are a Rockwell user (
AuthProvider=PartnerPortal
) and check theirLastAuthenticatedAt
. - If
LastAuthenticatedAt
happened less than 2 weeks ago, we just allow the user to log in without checking again with Rockwell. - If
LastAuthenticatedAt
happened more than 2 weeks ago, we will query Rockwell and updateLastAuthenticatedAt
if successful. - If for whatever reason the Rockwell doesn't respond (outage), we will test if
LastAuthenticatedAt
happened less then 2 months ago. And if yes, we will still allow the user to log in for now, but will try to query Rockwell again on any subsequent attempt. - In all other cases we default-deny.
Q: My Rockwell (Partner University) users are not able to authenticate.
A: Find the user record in the CML_RESOURCES
database, users live in a table creatively named Users
.
Switch to Console tab and use the following query:
SELECT * FROM Users WHERE Email = '[email protected]';
- Check if the user exists and verify that
AuthProvider=PartnerPortal
, this means we will apply the above Rockwell flow on this user. - Check when we last check they exist in Rockwell (
LastAuthenticatedAt
). - Check they actually exist in Rockwell.