TEA troubleshooting

Internal notes on how to troubleshoot more lower-level issues that can happen with the managed labs.

CML Infrastructure

Q: How do I log in to the CML backend infrastructure?

A: The machines have direct SSH access blocked, use Tunnel & Access. Make sure you have cloudflared installed locally and add this to your ~/.ssh/config:

Host *.labs.cfdata.org
  User cloudflare
  ProxyCommand /opt/homebrew/bin/cloudflared access ssh --hostname %h

You will still be prompted for a password when logging in via ssh, that is expected.

To log in to a bastion, use the pool name for the hostname, e.g.:

ssh apjc.ssh.labs.cfdata.org
ssh euwest.ssh.labs.cfdata.org
ssh uswest.ssh.labs.cfdata.org
ssh appsvc-origin-1.ssh.labs.cfdata.org
# ... etc

Q: How are the AppSvc origins set up?

A: Because we were having issues with load-balancer probes occasionally DDoSing the origins (especially whenever someone enabled probes from all colos), we separated the origins for the labs. There are now 4 origin servers.

Most labs actually use origin-3 and origin-4, behind a load balancer that's accessible on public IP 20.88.188.200. End users never interact with these origins individually, and in order to not confuse the users and be aligned to the lab narrative, these origins return X-Origin: origin-1 and X-Origin: origin-2.

The Load Balancer lab uses origin-1 and origin-2, accessible on public IPs:

origin-1: 40.121.134.173
origin-2: 40.76.65.119

Guacamole

Q: I need to restart the Guacamole service running on a bastion for some reason:

A: sudo systemctl restart tomcat9

Q: Guacamole sessions are timing out for all attendees in a specific pool.

A: We don't know why this is happening yet, but it already happened couple times. Reboot the whole bastion VM when this happens (restarting the guacamole service is not enough): sudo reboot

CML Authentication

The auth flow works in the following way:

Any user is first check if they already exist in CML database, and if yes, we allow them to log in (altho, see Rockwell flow below for partner exceptions).
If they don't, we check if their email ends with @cloudflare.com, and if so, we auto-provision that user in DB and allow them to log in.
If they are not from @cloudflare.com, we check if that email exists in Rockwell (via Rockwell's API). If so, we auto-provision that user in DB and allow them to log in.

Rockwell authentication flow

The above flow is just a high-level flow, when partners are involved the flow gets a bit more complicated:

Whenever we successfully check a user exists in Rockwell, we update LastAuthenticatedAt timestamp in the DB.
If an existing Rockwell user logs in, we first check the DB to see if they are a Rockwell user (AuthProvider=PartnerPortal) and check their LastAuthenticatedAt.
If LastAuthenticatedAt happened less than 2 weeks ago, we just allow the user to log in without checking again with Rockwell.
If LastAuthenticatedAt happened more than 2 weeks ago, we will query Rockwell and update LastAuthenticatedAt if successful.
If for whatever reason the Rockwell doesn't respond (outage), we will test if LastAuthenticatedAt happened less then 2 months ago. And if yes, we will still allow the user to log in for now, but will try to query Rockwell again on any subsequent attempt.
In all other cases we default-deny.

Q: My Rockwell (Partner University) users are not able to authenticate.

A: Find the user record in the CML_RESOURCES database, users live in a table creatively named Users.

Switch to Console tab and use the following query:

SELECT * FROM Users WHERE Email = '[email protected]';

Check if the user exists and verify that AuthProvider=PartnerPortal, this means we will apply the above Rockwell flow on this user.
Check when we last check they exist in Rockwell (LastAuthenticatedAt).
Check they actually exist in Rockwell.