Posts made by markbam

markbam

Nothing definitive yet. The ftp logs aren’t showing anything out of the ordinary.
I’m now thinking somehow my network backend between the two sites could be a culprit. I’ve put in a request for more bandwidth and am waiting for that to kick in.

markbam

I’ve been noticing some odd things happening:

“Test1.zip” starts an lftp command to transfer to /opt/fog/snapins on the Storage node as expected. Then at ~100mb transferred, the file “Test1.zip” disappears from /opt/fog/snapins on the Storage node BUT the ftp command is still active and transferring. The vsftp processes still have cpu and network activity.

It seems that the file is still stuck transferring to memory but, since it ceases to physically exist, lftp can’t perform a clean termination (the chmod command).

I am able to reproduce this myself using just the command I pulled from Fog. This is run on from the Fog Server:
lftp -e ‘set ftp:list-options -a;set net:max-retries 10;set net:timeout 30; set net:limit-rate 0:128000; mirror -c --parallel=20 -R -i “Test1.zip” --ignore-time -vvv --exclude “.srvprivate” “/opt/fog/snapins” “/opt/fog/snapins”; exit’ -u fogproject,‘xxxxxx’ xxx.xxx.xxx.xxx

With this I can see a progress bar continuing even after the file disappears from the Storage Node at ~100mb.

markbam

@Sebastian-Roth

Yup plenty of space. Only 34 GB of 2TB used.

markbam

I think I’m on to something. Restating the problem: The Fog Snapin Replication log shows that the snapin transfers are successful but then fails to chmod and the snapins are deleted from the Storage Node.

But, even though Fog records the transfer as successful, it looks like the snapins don’t actually finish their copy. The snapins only transfer ~100MB, then something goes wrong. Fog logs the transfer as successful anyway and tries to chmod which fails because the file isn’t completely there.

So I’m guessing I’m either dropping a connection or hitting a FTP timeout somewhere?

markbam

This may or may not be related:

To update Fog, the installer tells me I need to delete the user account fogproject. When I do so, it changes the user:group of the files in my snapins and images folder from fogproject:fogproject to fogproject:www-data.

So now I know where the www-data is coming from.

markbam

@Sebastian-Roth
I’m not sure I understand. The failing snapins do not exist on the storage node so I’d have nothing to adjust ownership on.

markbam

@Tom-Elliott

This was my initial thinking and why I started over from scratch on both the Server and Storage Node.
The failing snapins are not present on the storage node. For troubleshooting, I’ve even deleted all items in the snapins folder to try and discover a pattern to the failures. It does not seem to be consistent.

markbam

I’ve started with fresh installations of Ubuntu and a fresh installs of both FogServer and Storage Node. What I’m seeing is that all snapins now upload as fogproject:fogproject.

However, when it goes to replicate, only about 70% are successful. The rest continue to experience the same error: “chmod: Access failed: 550 SITE CHMOD command failed”

Permissions and user/groups are the same for every item in the snapin folder. 777 fogproject:fogproject

markbam

That log was from the Snapin Replicator log from the Fog Log viewer.
I’m not exactly sure which machine the chmod command is being run. Is it FogServer sending the command over the network or the StorageNode issuing the command locally?

Server side shows all snapins as rwxrwxrwx.
Node side shows some as rwxr-xr-x but the rest are rwxrwxrwx.

Correct, the chmod fails after the transfer and the file is removed from the storage node.
The only way I’ve been able to get it work is to change the ownership group from fogproject to www-data on the server.

So my particular issue is figuring out why Fog’s FTP once uploaded snapins as fogproject:www-data but now uploads as fogproject:fogproject. Or figuring out why the chmod wants the permissions associated with www-data instead of fogproject.

markbam

The snapin log errors:

[11-04-19 7:52:01 am] | Started sync for Snapin ExampleZippedSnapin - Resource id #859274
chmod: Access failed: 550 SITE CHMOD command failed. (./ExampleZippedSnapin.zip)
[11-04-19 7:59:59 am] | Sync finished - Resource id #859274

It then deletes the file from the node and starts trying to sync again.

As I look again, I do see that the uploads on the server do have the correct permissions of rwxrwxrwx. But when they are replicated to the node they show rwxr-xr-x.

markbam

@Sebastian-Roth

Yes, the r/w permissions look the same (777) for all new and already uploaded snapins. Only difference is the group.

In an working setup, what should the user:group be? fogproject:www-data or fogproject:fogproject?

If I understand correctly, the uploaded snapin’s user:group is determined by the account the server’s FTP is run under. I’ll take a closer look at the group permissions that fogproject runs under and if anything looks off on my set up.

markbam

Recently some of my images and snapins have been failing to replicate due to group ownership errors. I’m certain my install got messed up from migrating it across various servers/VMs.

When a snapin is uploaded via the webgui, it’s assigned “fogproject:fogproject”. This will fail to replicate.
My solution: run “chown fogproject:www-data uploadedfile.exe” and it will successfully replicate.

Is there somewhere in the configs that I can change the group back to www-data for new uploads?

Thanks

markbam

Unfortunately neither I nor any of my coworkers know PHP. I mainly write in PERL and, while a bit similar to PHP, my code is by no means production quality but I can try to help however I can.

I’ve looked at the FOGPingHosts.service and a quickly thought of trying to create persistence based on time by adding a field into the SQL database that records that last successful ping time.

Touching the database seems a bit overreaching but maybe a field like this already exists. I haven’t had the chance to dump it yet.

Pseudo Code: FogPingHosts.service

foreach(Host)
{
    getCurrentTime=CurrentTime;
    timestampFromSQL= read_sql(timestampSQLField);
    timeSinceLastSuccessfulPing=getCurrentTime - timestampFromSQL;

    if(timeSinceLastSuccessfulPing >  180 Seconds)
    {
        if(PingHost == successful)
        {
           write_sql(timestampSQLField, getCurrentTime);
        }
    }
}

markbam

I finally had a chance to go through this.

From what I can gather, it does appear to be conflicting. One server will ping and set the flag to 0, then anther will ping and set the flag to 6 and then back and forth.

mysql_hosts_update.log

markbam

The FOGPingHost.service shows as active on all the servers.
With wiresharks on each subnet, I see that FOGPingHosts is trying to ping all of the hosts.
On each subnet, I can see the pings return successfully when it hits a host that is alive.

From there is where the issue manifests: The FOG database only reflects the state of the hosts that are solely on one subnet. Usually the subnet of server that is turned on first (but I can’t 100% reproduce this).

So if a storage node is powered on first, usually(but not 100% of the time) it’s subnet’s hosts will show as active.
If the FOG Server is powered on first, usually(but not 100% of the time) it’s subnet’s hosts will show as active.

markbam

Yup, I’m familiar with nmap.

I’m definitely seeing a lot of inconsistencies in the results I’m getting. Due to some unknown circumstance, FOG will log the pings from one server and ignore the returns from the others. I haven’t been able to reproduce exactly what triggers it to switch which server it decides to listen to but seems related to startup order.

This effort started as merely “nice to have”. I think I’d have to re-evaluate my network topology into something a bit less complicated in order to get any definitive answers though. I’ll probably revisit this sometime down the road.

Thanks for your time!

markbam

I’ve been using the standard ICMP ping command to test if the hosts are even visible to the servers that I’m using.

markbam

I’ve installed wireshark and I’m seeing FOGPingHosts fail at pinging the hosts on the FOGSERVER subnet. What’s odd is that I can successfully manual ping the hosts from the FOGSERVER.

For a sanity check, with wireshark, I can see successful pings on the FOGSTORAGE subnet with FOGPingHosts.

markbam

Yup, I understand that it’s just a snapshot in time as it’s a service polled at a specific interval. But I wonder now if the results of the services are conflicting?

How I’m imagining the pseudo logic:

10:30am FOGPingHosts(FOGSERVER) Active
10:30am FOGSERVER(10.0.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to Green.
10:30am FOGPingHosts(FOGSERVER) Sleeps

10:32am FOGPingHosts(FOGSTORAGE) Active
10:32am FOGSTORAGE(10.20.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to Red.
10:32am FOGPingHosts(FOGSTORAGE) Sleeps

While the services are now asleep, this is the time when I’m viewing the host list from the GUI and only seeing the results of the subnet that was last pinged.
The cycle repeats…

10:40am FOGPingHosts(FOGSERVER) Active
10:40am FOGSERVER(10.0.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to green.
10:40am FOGPingHosts(FOGSERVER) Sleeps

10:42am FOGPingHosts(FOGSTORAGE) Active
10:42am FOGSTORAGE(10.20.0.1) pings WinMachine001(10.0.0.5). FOG database changes host WinMachine001 to red.
10:42am FOGPingHosts(FOGSTORAGE) Sleeps

markbam

An update:

I brought down and up my entire FOG cluster and now the Storage Node’s FOGPingHosts service is running successfully. However, a different set of hosts show as green.

I restarted a few times and varied the order on which servers start up first. This seems to play a role in which hosts decided to show as green.

It seems like the hosts on the subnet that is first powered on will report to FOG. The rest will not.