image file integrity?
-
@johnomaz There’s a thing called data-rot that happens to old data, or data on old drives… https://en.wikipedia.org/wiki/Data_degradation
Just an idea…
It could also be your network dropping packets… (far more likely). It could be a loose patch cable, a kinked patch cable, interference on a copper line that was ran a little too closely to a high-voltage inverter for florescent lighting. If your copper lines run close to HVAC or larg-ish electric motors you’ll have a lot of interference too… Don’t have a microwave by the server do you?
-
@johnomaz If you’re interested, it’s most likely very easy to do a basic MD5 checksum type thing in the evenings and pile the results up in a file in the web root for viewing in a web browser… it would be a crontab event.
-
@Developers I couldn’t resist. I’ve been working A LOT with crontab and bash scripting lately! It’s super fun!
It took my virtualized dual core VM running on SATA1 drives on an old computer a grand total of…
8 minutes
to produce a MD5 checksum for all files in my /images directory, which is a total of…
18GB.
More powerful systems can expect much greater performance.
Here’s a command that will generate a file with the date on it for every file in /images. It puts the file in whatever directory you’re
pwd
is.now=$(date +\%m\%d\%Y);filename=checklist_$now.chk;find /images -type f -exec md5sum "{}" + > $filename
Results should look something like this.
cat checklist_12_08_2015.chk
e8ca919a5cf891c1444bef848ba0826a /images/postdownloadscripts/fog.postdownload d41d8cd98f00b204e9800998ecf8427e /images/dev/.mntcheck b026324c6904b2a9cb4b88d6d61c81d1 /images/CentOS7Optiplex745UpdateBASE/d1.fixed_size_partitions 989163439cdf3d881c5c47ec26a3549b /images/CentOS7Optiplex745UpdateBASE/d1.partitions e908d53a4e858480aade0428022c2b79 /images/CentOS7Optiplex745UpdateBASE/d1.original.fstypes d41d8cd98f00b204e9800998ecf8427e /images/CentOS7Optiplex745UpdateBASE/d1.original.swapuuids d41d8cd98f00b204e9800998ecf8427e /images/CentOS7Optiplex745UpdateBASE/d1.has_grub 5b3cc4be3250658e7832435bf51cfd19 /images/CentOS7Optiplex745UpdateBASE/d1.mbr 0f52ff40ada8c1a3f585c045c046f90e /images/CentOS7Optiplex745UpdateBASE/d1.minimum.partitions 9b7ab9244c3fbb3ed591c2b139e08289 /images/CentOS7Optiplex745UpdateBASE/d1p1.img 653b56dddbbe995df1a02fa4ce3101a6 /images/CentOS7Optiplex745UpdateBASE/d1p2.img 26ab0db90d72e28ad0ba1e22ee510510 /images/Fedora22LivingRoom/d1.fixed_size_partitions e547d023f0de8960668c29a92c97ea64 /images/Fedora22LivingRoom/d1.partitions b45db809f4e6ab60ae53acb4bf4814db /images/Fedora22LivingRoom/d1.original.fstypes ece7afe263274485bdcd95a563a8339f /images/Fedora22LivingRoom/d1.original.swapuuids d41d8cd98f00b204e9800998ecf8427e /images/Fedora22LivingRoom/d1.has_grub df83ab31a77bc51f3dceb345375c81ed /images/Fedora22LivingRoom/d1.mbr db56d2974bf892b86f351fc4f92b0537 /images/Fedora22LivingRoom/d1.minimum.partitions a296046f409bc5d4dfabd051dafa507d /images/Fedora22LivingRoom/d1p1.img b026324c6904b2a9cb4b88d6d61c81d1 /images/Win7/d1.fixed_size_partitions 2cfd36a0e381b432a55b929f796ed3d4 /images/Win7/d1.original.fstypes d41d8cd98f00b204e9800998ecf8427e /images/Win7/d1.original.swapuuids 2523e3aca9e856933d1fb1e2b0d2cd6e /images/Win7/d1.original.partitions 6240be7065ce983bdcb63f8e9c1a5096 /images/Win7/d1.minimum.partitions 3a673e2c1b7d279c1381e4a6b4303175 /images/Win7/d1.mbr 5dddb3b8c46be183b0a209e32ee912ef /images/Win7/d1p1.img 71c96624ae151ebeaa7fef342b1436a6 /images/Win7/d1p2.img d41d8cd98f00b204e9800998ecf8427e /images/.mntcheck 606b8f0aed33cd36c9fb077a904c5865 /images/checklist.chk
You can easily just append the file name to point to /var/www/html/md5 in order to start the command and then later view results in a web browser.
The command can easily be made into a crontab event for root by simply finding out where md5sum is located on the machine and pathing to it directly. For Fedora 23 Server Minimal it’s here:
/usr/bin/md5sum
The command in a crontab event should therefore look like…
if [ ! -d /var/www/html/checksum ]; then #Make a web directory called checksum if it's not already there... mkdir /var/www/html/checksum chown apache:apache /var/www/html/checksum chmod 744 /var/www/html/checksum fi now=$(date +\%m\%d\%Y) filename=/var/www/html/checksum/checklist_$now.chk find /images -type f -exec /usr/bin/md5sum "{}" + > $filename chown apache:apache $filename chmod 744 $filename
That’s a script I stuck in my root user’s home folder here:
/root/checksum.sh
you have to make it executable obviously (after it’s created) withchmod +x /root/checksum.sh
The crontab entry for root to do this every day at 10pm (overkill) would be…
0 20 * * * /root/checksum.sh
and for the 1st of every month (recommended) at 10pm, it would be0 20 1 * * /root/checksum.sh
(in Fedora / CentOS / RHEL, to create a crontab event for root, first switch to root with
su root
and then executecrontab -e
and add the entry)Sample output in a browser after the task ran for the first time:
Tagging this for the #wiki
Thread solved. -
@johnomaz if an old image on a different site worked fine or an image copied to another site wouldn’t that raise more suspicion in the disk storing the image has some problem with it? Failing randomly would also lean more towards an escalating problem (one that isn’t fully prevelent right now) and indicate more of an issue on the device they’re currently stored? If it failed every time in the same spot, I would say the image is corrupt, but because multiple systems pulling images from this particular device and the fact the failing happens at different points still looks to me like there is a problem with the disk the images are currently stored on.
-
Another thing to test would be the memory on the FOG server.
From experience, undetected memory errors can lead to the symptoms you describe, i.e.random fails of CPU/memory intensive operations.
If this were the case I would also not be surprised if you were getting strange errors reported by the OS, e.g. sigsegv -
@Tom-Elliott I’m currently copying the images off the 1TB to a brand new 3TB drive as a backup. I’m going to look at the MD5 checksum info Wayne posted below after the copy goes through.
-
Bumping this thread, I feel it has real utility for comparing files across storage group members. When I write the wiki article on it, I will gear it towards that.
-
Just posting what I’ve worked on this weekend. This is not finished, the backend-script is not CRON ready yet and there is no installer yet either.
FOGFileChecksum.php
<?php $servername="localhost"; $username="wayne"; $password=""; $database="fog"; // Create connection $link = new mysqli($servername, $username, $password, $database); // Check connection if ($link->connect_error) { // Couldn't establish a connection with the database. die($SiteErrorMessage); } $fileLocation = array(""); $fileSum = array(""); $sql = "select DISTINCT(fileSum),fileLocation from fileChecksums order by fileLocation"; $result1 = $link->query($sql); while($row1 = $result1->fetch_assoc()) { $aSum = $row1['fileSum']; $aLocation = $row1['fileLocation']; array_push($fileSum, $aSum); array_push($fileLocation, $aLocation); } $arrlength = count($fileLocation); echo "The below table only shows detected changes in a file.<br><p>"; echo "New files and files that have not been changed since creation are not listed. There is no consideration for storage groups or when images were uploaded.<br>"; echo "If a change in a file is detected, a set of relevant records from all storage nodes concerning the file are displayed.<br><p>"; echo "You should see changes in files when an image is updated, when snapin files are updated, or when replication does not occur for a updated image or snapin, or when the storage node's hardware (hdd mostly) is failing.<br><p>"; echo "<table border=\"1\" style=\"width:100%\">"; echo "<tr>"; echo "<td>Hash Checksum</td>"; echo "<td>When this record was recorded</td>"; echo "<td>Host</td>"; echo "<td>File</td>"; echo "</tr>"; for($x = 1; $x < $arrlength; $x++) { if ($fileLocation[$x-1] == $fileLocation[$x]) { $sql = "select distinct(fileSum),fileLocation,fileHost from fileChecksums where fileLocation = '$fileLocation[$x]'"; $result1 = $link->query($sql); while($row1 = $result1->fetch_assoc()) { $aSum = $row1['fileSum']; $aLocation = $row1['fileLocation']; $aHost = $row1['fileHost']; $sql = "select * from fileChecksums where fileLocation = '$aLocation' and fileHost = '$aHost' and fileSum = '$aSum' order by fileTime ASC LIMIT 1"; $result2 = $link->query($sql); while($row2 = $result2->fetch_assoc()) { $aSum = $row2['fileSum']; $aLocation = $row2['fileLocation']; $aHost = $row2['fileHost']; $aTime = $row2['fileTime']; $aTime = gmdate("l jS \of F Y h:i:s A", $aTime); echo "<tr>"; echo "<td>$aSum</td>"; echo "<td>$aTime</td>"; echo "<td>$aHost</td>"; echo "<td>$aLocation</td>"; echo "</tr>"; } } } } echo "</table>"; $link->close(); ?>
FOGFileChecksum.sh
#-----Variables-----# files=/root/files.txt fogsettings=/opt/fog/.fogsettings ipaddress="$(grep 'ipaddress=' $fogsettings | cut -d \' -f2 )" snmysqluser="$(grep 'snmysqluser=' $fogsettings | cut -d \' -f2 )" snmysqlpass="$(grep 'snmysqlpass=' $fogsettings | cut -d \' -f2 )" snmysqlhost="$(grep 'snmysqlhost=' $fogsettings | cut -d \' -f2 )" #-----Connect to mysql and querry all nodes that have the IP-----# if [[ $snmysqlhost != "" ]]; then imagePaths=$(mysql -s -h$snmysqlhost -u$snmysqluser -p$snmysqlpass -D fog -e "SELECT ngmRootPath FROM nfsGroupMembers WHERE ngmHostname = '$ipaddress' ORDER BY ngmID") snapinPaths=$(mysql -s -h$snmysqlhost -u$snmysqluser -p$snmysqlpass -D fog -e "SELECT ngmSnapinPath FROM nfsGroupMembers WHERE ngmHostname = '$ipaddress' ORDER BY ngmID") elif [[ $snmysqlpass != "" ]]; then imagePaths=$(mysql -s -u$snmysqluser -p$snmysqlpass -D fog -e "SELECT ngmRootPath FROM nfsGroupMembers WHERE ngmHostname = '$ipaddress' ORDER BY ngmID") snapinPaths=$(mysql -s -u$snmysqluser -p$snmysqlpass -D fog -e "SELECT ngmSnapinPath FROM nfsGroupMembers WHERE ngmHostname = '$ipaddress' ORDER BY ngmID") else imagePaths=$(mysql -s -D fog -e "SELECT ngmRootPath FROM nfsGroupMembers WHERE ngmHostname = '$ipaddress' ORDER BY ngmID") snapinPaths=$(mysql -s -D fog -e "SELECT ngmSnapinPath FROM nfsGroupMembers WHERE ngmHostname = '$ipaddress' ORDER BY ngmID") fi #-----Find all files on all local storage nodes-----# if [[ -e $files ]]; then rm -f $files fi for i in ${imagePaths[@]}; do find ${i} -type f >> $files done for i in ${snapinPaths[@]}; do find ${i} -type f >> $files done IFS=$'\n' read -d '' -r -a allFiles < $files #-----Checksum all files, insert into database-----# for i in ${allFiles[@]}; do md5sum_space_file_space_time="$(sha1sum ${i}) $(date +%s)" N=1 fileSum=$(echo $md5sum_space_file_space_time | awk -v N=$N '{print $N}') N=2 fileLocation=$(echo $md5sum_space_file_space_time | awk -v N=$N '{print $N}') N=3 fileTime=$(echo $md5sum_space_file_space_time | awk -v N=$N '{print $N}') if [[ $snmysqlhost != "" ]]; then mysql -s -h$snmysqlhost -u$snmysqluser -p$snmysqlpass -D fog -e "INSERT INTO fileChecksums (fileHost,fileTime,fileSum,fileLocation) VALUES ('$ipaddress','$fileTime','$fileSum','$fileLocation')" else mysql -s -D fog -e "INSERT INTO fileChecksums (fileHost,fileTime,fileSum,fileLocation) VALUES ('$ipaddress','$fileTime','$fileSum','$fileLocation')" fi done
FOGFileChecksum.sql
USE fog CREATE TABLE fileChecksums( fileChecksumsID int NOT NULL AUTO_INCREMENT, fileHost VARCHAR(255) NOT NULL, fileTime int NOT NULL, fileSum VARCHAR(40) NOT NULL, fileLocation VARCHAR(255) NOT NULL, PRIMARY KEY (fileChecksumsID) );
-
I’ve updated one image at home since I started tracking the checksums, I have interesting results…
For those that are interested, here is the complete table output:
0_1456542756352_fileChecksums.txt -
Looking at the results, I’ve concluded that replication did not happen for the
/images/Win7/d1.mbr
file.And, I have either I have a corrupt HDD or replication transfered incorrectly for the
/images/Win7/d1p1.img
file.The
/images/Win7/d1.minimum.partitions
is also of concern.I’m going to delete everything in the /images directory of the slave node and let everything re-replicate and see what happens…
-
I’ve made this into a GPLv3 project on github. You can follow along here:
https://github.com/wayneworkman/FOGFileChecksum -
I’ll be working to convert this project from shell-script to pure PHP, I’ll be developing in PHP 5.5, but I’ll also ensure compatibility with PHP 7.0. I’ll also make a much nicer front-end. I get a little better at web GUIs each day I think.
-
Bumping this, converting the integrity checking stuff to PHP and going to try to work out a plugin for it.
-
Got the PHP version functioning. Just need to polish it up and work out how scheduling is going to work.
-
Update on this project, @Tom-Elliott has taken the PHP backend and integrated it partially into FOG Trunk. It’s available as a plugin, but not fully functioning just yet. We still need to create scheduling for it. Tom has already written a way to display everything in the checksum table, and a way to export those if people wanted to use a 3rd party app to analyze the results if they wish.
I’ll be writing some intelligent code that will analyze the table’s contents to display concerning entries. The analysis will follow some basic principals.
-
Makes decisions based on data in the DB.
-
A storage group’s files should always match across all nodes in the group, both images and snapins.
-
Images shared between storage groups should always match between those groups masters.
-
If no image upload occurred between the last and current check, images are expected to match across that time period.
-
If an image upload does occur, the files are expected to change.
Results of the intelligent analysis should display concerns, following the rules above, and the user should be able to “dismiss” individual file concerns so they don’t show anymore.
The integrity table will have a column that will operate similar to the pending hosts column in the hosts table. blank or zero should be unchecked or false (bad or unprocessed), 1 should be good or dismissed.
This column in the table will be administered by the intelligent checking, and by the user’s “dismiss” clicks. Once an entry for a file is marked as “good” by either no problems being detected or being dismissed by the user, that entry is forever good. If it’s blank, when analyzed it will be marked 0 or 1 respectively.
I’ll be working on this as I have time.
-