Foglalt

Foglalt

@sebastian-roth I have absolutely no problem about what you said and agree on upgrade. I only did the post to see if we may have a “bad habit situation” on updating the host informations. So, I will move onto stable (well, i dont like to risk productivity, so in production environment i will stick to stable release).

About pending macs. We dont use fog client at all but sometimes on fog pages i see we have pending macs and it wants me to decide about it. I somehow linked this to the issue in thoughts this is why i asked. ( will do upgrade as soon as possible with a junk-removed state of db then see what happens)

And the most important thing, BIG thanks for you folks who volutarily do such a huge and very very good job! We cant be grateful enough for your work!

Foglalt

here it comes again now, i have a log file that your script created and i think this is not the usual situation. here is what happened now:

task was deploying and DURING that task the host data was updated with new mac address (i am aware that it is not best usage it was not me)

so i have the log, where to put it? mail maybe? it has many info about our setup, maybe not best to have it on a forum ,) let me know how to share with it with you.

Foglalt

Final case: The debugging of this state closed. No solution found with those machines in testing. I ended up with a fully new hardware architecture, a lot newer machine. Fully reinstall again, now on Ubuntu 14.04.2 LTS, 64bit. Fun fact, issue never again happened. Maybe candid camera was removed from room… damn, it was a bit frustrating haveing such insane malfunction with a normally working software. A week passed by, no issue anymore. Strange.

Thx for replies and tries in helping!

Foglalt

@wayne-workman hehe, sanitize i will, well, i already started reading it but work came in the office in waves, so i postponed it to tomorrow. I will send or post it here from work. btw i kept your address from last db transaction we did together.

Foglalt

@pabloinza

No, not yet. We have some other issues what makes us delay it a bit, network staff has issues cos of powerfailures, things are not that happy I will make them continue tests.

Foglalt

Finally, done. Solution: few switch was not ok, network config was changed in a way we was not aware (other device needed different setup and was not properly separated. Now is ok! And, as for full solution multicast tables needed manual clearing. (when we had solved testing, clients seemed stuck, now they dont.)

all ok now, mass tests will come soon for performance (as now multicast speed a bit low…

Foglalt

Take your time, as I have learned to solve it for me is not urgent. You told me if i think it needs further investigation, let it be posted in separate thread, so i did

Foglalt

@george1421

The actual hardware part was not identified, we had not enough time and the computer was not available to disassamble (it is a brand new laptop, which has garantee issue if opened up). The process where the slowness was noticed was various. Some of the ipxe boot process (like downloading bzimage), some adjusting the disk (partition writing, mbr saving, etc). Fun fact that the actual disk loading with image was not hindered with slowness. Considering that the image deploy is a massive amount of data writing it is strange. Compared to the bzimage’s tiny size, it is not clear what caused the actual slowness.

It was during disk io and can be during network traffic. During the process we couldnt detect more error than actual slowness only. All done, but insanely slowly. When we switched back to legacy mode, it was like a charm. Fast and easy.

Foglalt

We did a few more investigations and came to a seemingly working solution. Sorry for the long wait, but in virus situation we have limited hw access and those are only on fixed days.

Part one of the case: high cpu load. This was the easiest. It was a disk issue (smart showed no valid error, but I insisted on a test with a new disk. faster, bigger etc. It was a time to buy and make it running). So, the load became ok (back to zero or 0.1, as was normally).

Part two, the slowness. We had a massive new hw to image, most of them are almost same hw, but we found out the the slow ones has some undocumented hw difference maybe. (Meaning they should be identical, but actually they are not). Solution: we disabled the uefi mode and now it is now properly working (drawback it needs some finishing at the end… but doable). I dont know what is the true hw that gives this error, but legacy mode seems ok atm.

For future investigations, or for the logs, here is the actual hw that we found guilty in a few percentages:

c30e8a81-908a-4f3a-9300-edb5d350d75a-kép.png

It is a “hp elitebook” laptop. All should be same, but somehow somewhere they are not. The problematic hw fails on many place in speed. Sometimes even the bzimage download is “visible” (normally it is 100% ok at once), sometimes disk partitioning is stuck for a long time, etc).

I think and I hope the case is closed. How can I mark it “solved”?

(oh, forgot to mention: we did changes in kernels, no difference with those some)

Foglalt

@Tom-Elliott
Ah, sorry, I forgot to answer this question of you:

bzImage Version: 4.19.118
bzImage32 Version: 4.19.118

And no, atm we have zero special hosts, so no need special kernel. In old days we had, but atm no, only one.

Foglalt

@Tom-Elliott

Our production network is a strange thing nowadays. Almost all computer has an ip what is considered public address, but ofc it is not true in reality. First of all, our pcs can see the outside world, can communicate. The world outside cant communicate with us only if allowed explicitely. So, the network is theoretically public, but actually it is not public. Traffic is routed and walled if needed. If no anomalies found, no traffic from the outside at all to normal machines or servers. This setup is because of many of our projects need public things with less or no routing.

The imaging worked fine since 1.5.7 version, we had this kinda issue with 1.4.x when i had to change the client kernel for a strange “delay during saving mbr” issue. And now, it “sounds” same. This is why I first changed a few kernels for testing purposes to see what happens. (I still do tests with previosly used kernels)

As for the load. The load is, you are right, not surely the reason for the issue. I just wanted to give as many details as possible. As we had zero tasks running the load was really strange. the same os version with almost same service setups (web, some php, mysql data backend, zero high throughput data, like in a iddle fog server) does 0.1 0.1 0.1 load (even with an lot older machine, less memory).

And I am practically sure that the actual fog machine had a lot less load previously. So something happened, or happening, I still need to discover what it is. During my tests I forgot to start the stopped apache and the load fell to 3.0 somethings from the 7.0 somethings. Normally fog consist of not infinite amount of services but the true shock for me what that it does something but hides it no running or stuck process but load…

One of my thoughts was a failing hdd or some, but smart says it is ok. Not intact and virgin, but is ok.

So, Elliott, slowness is not the result of the load (especially that the gui or ssh is responsive and fast). I agree. And here comes the “but” part Any suggestion? (I will try to upgrade to current version, but less options in covid situation. I dont want to kick the table from under my colleague who have to be in building if I dont have to

Foglalt

@Tom-Elliott

99421084-7d85-4958-b2d0-441b4de21497-kép.png

it is not a virtual machive, an actual machine.

Private network or not, it is not my domain to decide. I kicked some ass today seeing that even echo is alloved to that machine from the outside (btw they said, netadmins, other traffic is not allowed. i will need to seel some ass to make sure it is…)

What other information do you suggest to collect?

Foglalt

Hello guys!

We have noticed an issue what we try to investigate but I need a bit of information and help as I am stuck ATM. We found out that some imaging (upload and download, too) became insanely slow. The deploy part seems (ATM!) that is target hw dependent, which is good, as it is not on our side. The problem became more interesting when it seemed that even the “master imaging” became related to the issue. The master image is created regulary basis on a virtual machine (win default one, not virtualbox or such; hyperV). Uploads are slooow.

We have discovered that “preparation stages are slow, too, not the actual imageing” (so not only cloning process, but the scripting ones).

At a point we had an issue about dealing with filesystem, partitioning. Once we had such, that was a kernel issue, now it is not, or a different one, as I tried few versions.

Server officially does practically nothing. Sole job is the foging. As with no task it shows strange issues, I started another investigations. Found out the it has HIGH loads, and that load is practically permanent. With not too much done on the server.

b52cf205-c3a6-4967-8447-b04b29e5df61-kép.png

The load is almost all time at 7.0. Very little oscillation. Can we somehow locate the reason fog-wise? (another point is a possible hw failure ofc, but atm I dont really think so I can detect that as virus broke physical bonds with machines so I try to detect software issues at first.

Can I have suggestions? My problem is that I am not a performance tuner type, I have not too much experience in such. All I see that web or ssh side server has zero delay to respond. But the imaging process has delays at points where previously it has zero. I dont know that the high load is the issue, or it is a result. I can collect all data if needed, as I have access remotely, but atm I am stuck atm.

We have 3 issue:

high and no-reason load permanently, with zero task on server
in the imaging between actuly data load we have steps where things became “stuck” for 10s of minutes (partitioning, mounting fs, post scripts entry points)
and I see network traffic on a permanent basis what was not there if i recall well.

I made a few screenshots if it may help.

00aeab23-0b16-41b7-a574-4b8cffd68180-kép.png
here it stucks for a long time.

2ac0ce50-ea03-4e5a-a0c0-f1626864611f-kép.png
here, the partition deployment is fast, then the start of the next partition is delayed a LOT (win10, 3 small and 1 big data partition, as it creates them). Normally the 3 little ones are only “blinks of an eye”, not we can even drink coffee between them). Data throughput is fast.

df2ec345-6650-43f7-947d-d037da4a01a0-kép.png
Here it does something, again strangely slow.

36669a4f-0842-46ac-9d95-e5bc0cfe80bb-kép.png
Here is the truely slow one. (this is the one that i mentioned that previously it was a kernel issue with other fog versions, what I could solve with new kernels).

We currently run the 1.5.7 fog version, on a debian (stretch, 9.12). The machine has 8G of ram, 500G of disk for active things, and 1T disk for “prebackup” (mounted only in case, otherwise inactive)

(UPDATE: I see things in dmesg what I havent seen before:
[456483.853291] rpc-srv/tcp: nfsd: sent only 18600 when sending 32900 bytes - shutting down socket

And many of this… I found a bug that caused such years ago, i dont think that is it, (i have version: ii nfs-kernel-server 1:1.3.4-2.1 )

Any suggestion what does this poor machine does in his free time that need to be killed? ATM smart show no disk error, what I know).

Foglalt

I really appreciate this, really detailed and helpful. Most of this was in my head cos of many reading and usage, but I think I would messed it up with failing memory, forgetting 1-2 things out. You guys, are still in any kind of support!

Foglalt

Hi!

Can I ask for a help? We need a network help to improve our fog possibilities (we had issues, what is not fog based, but network based what prevents some things to happen during fog). The support (outside help) is not a fog specialist, so asked us to get as many information as possible to detect the problem. As a part of it, I would like to give them the info about the fog booting process. From ipxe to partcloning.

Can I ask for the steps of the boot process? I dont need bit depth, but to detect the problem, I would like to give details as much as I can. Like: “network boot, ipxe dhcp starts, then…” and so on.

(we have issues of these steps with some hw, before image starts to deploy the actual hdd, but it is strange, and not fully global. maybe some of our hw is not properly working in our network, or idk).

I would appreciate any details and helps. Thanks in advance

Foglalt

@george1421 I was thinking about what is best. New thread or ask in between. I chose the later cos I only need a step-by-step “what happens in stages of fog imaging”, cos network support needs it to find the guilty part. My test revealed that the problem is not the fog itself in this case, but was the network devices in our network somewhere. Still suggets to open a new thread? If so, I will do to make things clear.

Foglalt

We has problems of such, but was not able to filter out a proper method that worked. We had to find other methods (if possible, even sdd out, and into other machine, etc). Now we have a very few hw what has such problem. At the moment we are scheduled to use network support time to figure out some other problem, what caused almost same issue (separated fog is working with stupid switch, but built in wired system it fails with exact same “errormessages”. To make a solution closer, may I ask the Fog experts here?

the detailed steps of fogging can be named 1 by 1? since ipxe boot to partclone section. Somewhere there is a solution for our hw, but our network support leaks somehow and misses the solution. Now we have outsourced methods, but we need to feed them with a many details, as possible.)

Can I ask a humble favour to have the detailed information I need? step by step at fog vesion 1.4 to actual versions if we have differences at all. My knowledge is farm from hw or firmware based details of the process

Foglalt

@Foglalt

Best posts made by Foglalt

Latest posts made by Foglalt