Replication issue after converting to https

John Sartoris

I’ve been working hard upgrading our fog servers to https mode. As we are stuck working from home currently with this Covid-19 quarantine, we have been focusing on our security settings and trying to gain remote functionality where possible.

I’ve made it through many steps, upgrading from Ubuntu 14 with Fog 1.5.7 to Ubuntu 18 with Fog 1.5.8, adding another offsite storage node for image testing, SSL certificate oversize issues, and content filter blocking iPXE using ocsp to verify our public cert.

I think my last problem comes down to the replication service. When the “Fog Configuration -> Fog Settings -> Web Server -> Web Host” is set to the IP address the replication works, however this causes the pxe files to generate with an IP and not the FQDN that matches the SSL Cert. If I change the setting to the FQDN to fix the pxe boot menus, then the replication log gets stuck with this repeating slowly over and over.


[04-21-20 2:28:00 pm] Interface not ready, waiting for it to come up: fogserver.xxx.org

The moment I change the setting back to the IP address the log floods with these interface ready messages for more lines than my putty buffer.

[04-21-20 2:09:03 pm] Interface Ready with IP Address: ntp.xxx.org
[04-21-20 2:09:03 pm] Interface Ready with IP Address: 10.2.xxx.yyy
[04-21-20 2:09:03 pm] Interface Ready with IP Address: 127.0.0.1
[04-21-20 2:09:03 pm] Interface Ready with IP Address: 127.0.1.1
[04-21-20 2:09:03 pm] Interface Ready with IP Address: ntp.xxx.org
[04-21-20 2:09:03 pm] Interface Ready with IP Address: 10.2.xxx.yyy
[04-21-20 2:09:03 pm] Interface Ready with IP Address: 127.0.0.1
[04-21-20 2:09:03 pm] Interface Ready with IP Address: 127.0.1.1
[04-21-20 2:09:03 pm] Interface Ready with IP Address: ntp.xxx.org

Where is the replication service looking to define the nic and how can I fix this?

John Sartoris

Wow, I just swapped back to the IP address to replicate a new image tweak, and this time I setup logging on my putty session. I received a flood of 26,000 lines , time stamped across 8 seconds as the Replication service registered the change and came online.

Sebastian Roth

@John-Sartoris I have this on my list to look at when I have a bit more time. Definitely sounds like an issue we should solve in the startup code of the service!

I shall find some more time on the weekend.

John Sartoris

@Sebastian-Roth Any chance you’ve been able to look into these settings? I’m starting to work on creating snapin packages but they don’t seem to work right without being on all servers. So I’ve been manually replicating the files around until this replication interface issue can be resolved.

Sebastian Roth

@John-Sartoris Sorry, I haven’t… Thanks for bringing it up again. I will find the time this weekend I am sure.

As you mention snapins… You should be seeing this issue as well: https://github.com/FOGProject/fogproject/issues/371

Sebastian Roth

@John-Sartoris said in Replication issue after converting to https:

I think my last problem comes down to the replication service. When the “Fog Configuration -> Fog Settings -> Web Server -> Web Host” is set to the IP address the replication works, however this causes the pxe files to generate with an IP and not the FQDN that matches the SSL Cert.

The certificates we generate for FOG have both - IP address and hostname as alternatives in them. So PXE boot works great with those. Though I understand that situations with custom certificates like you have are different. Usually you don’t get a certificate with IP address in them from your CA.

Trying to debug this issue I remembered that we fixed an issue in the “interface ready” checks after FOG 1.5.8 was released. Still it’s good you brought this issue up as I now also found what was causing the enormous flood of messages and pushed a fix.

Though I still was not able to replicate the issue as described using FOG 1.5.9-RC1. Usually I use the latest version for testing because we can’t afford the time to backport fixes to older versions anyway. So without having tested 1.5.8 explicitly I would think the replication issue using hostname instead of IP is already fixed. I may ask you to update to FOG 1.5.9-RC1 to see if I am right. Usually I advise people to have all nodes on the same version but I would think you’d be fine just updating your master node from 1.5.8 to 1.5.9-RC1 in this case. Best if you’d update all nodes to RC as we need people to test anyway.

John Sartoris

I’ve updated to 1.5.9-RC1 after a bit of work, I think I’ve gotten everything working now.

I have my Public signed wildcard cert working.
I’ve got ipxe configured to Trust the godaddy root cert. This was important as our content filter was again blocking what I think to be the validation attempts. This time it was not showing up as OCSP, but simple “web-browsing”.
edited /tftpboot/default.ipxe to use hostname, and added parameter to change screen resolution. Some of our newest machines have 4k monitors that make the menu tiny.
And lastly, Replication services, Image and Snapin, both are working. It final magic seems that the Replication services use “Fog Configuration -> Fog Settings -> Web Server -> Web Host” and cross reference it to the StorageNode names. It then takes the IP address and Interface configuration from there to determine if the nic is “UP”. It seems DNS resolution is not done on the IP address field here. I had the names and not the actual IP addresses. After setting “Web Host” to the FQDN, and adjusting the StorageNode name to match, and setting the StorageNode IP to the ip, and finally restarting the replication services, things started to all work. Sorry, that one was wordy.

@Sebastian-Roth Thanks for your help. I’ll post new topics if I find anything 1.5.9-RC1 related.

Replication issue after converting to https

86

12.2k

17.4k

155.5k