Odd intermittent issues with imaging El Capitan
So we have been using Clonedeploy (Crucible) for a while now and always has worked flawlessly with Yosemite images. We have a script we wrote that once we image we log in with a basic account on the machine and launch the script it resizes the drive. This has worked 100% in Yosemite. Well this summer we are imaging all our 2010 White Macbooks with El Captian but we are seeing some odd behavior of some imaged machines. It seems like maybe every 15 or 20 Macbook's we image the script fails when resizing the drive and end gives a file system consistency error. When we try to boot into single user mode on those clients ad run an fsck_hfs on the drive we get invalid b-tree node size (3 0) errors and Disk Utility also fails when trying to repair the volume. Once a client has this issue it almost seems impossible to re-image the same machine without getting the same result and error message. Like i said earlier, none of our Yosemite images have this issue. No matter how many new El Capitan images we recreate from scratch we run into this same odd imaging issue. Is there something about El Capitan and the way clonedeploy deploys the image to the drive that is making these clients get these invalid b-tree node size (3 0) errors? Any help would be great as we are at a point where we are putting aside almost every 15th macbook as it will not image. One more note: If we image the machine that has this issue with Yosemite, it is fine and no filesystem errors. But if we try imaging it again with El Capitan after we just imaged it with the Yosemite image, that SAME computer still gives the same "invalid b-tree node size (3 0)".
Another question, is there anyway to manually format the drive while booted into the PXE environment using command line before imaging it? I noticed when the FSCK runs during the imaging process on /dev/sda2 it gives the same "invalid b-tree node size (3 0)" ? I was thinking that maybe i could format the drive manually while in the PXE environment since i'm getting the same error during the imaging process, which might help? Seeing that error during the imaging process almost tells me that Clonedeploy is trying to write over the /dev/sda2 that is in place rather than format it and make a new one?
I think we could do the format by creating a new filesystem which should format the partition. To test if that works, wouldn't it be easier to do a full format of that drive within os x recovery or whatever, and then try imaging it again. Should be the same theory.
It seems like there should also be some common denominator with these macs. There must be a reason it is only a problem for some and not all. Could you possibly attach a deploy log for one that worked and one that didn't just to see if anything jumps out.
Here is a log file of a client that is having the issue. When i get a chance i will post a log file of working file. Should i be using this "Gdisk" tool to fix the drive before imaging based on this error?
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
Command (? for help): b back up GPT data to a file
c change a partition's name
d delete a partition
i show detailed information on a partition
l list known partition types
n add a new partition
o create a new empty GUID partition table (GPT)
p print the partition table
q quit without saving changes
r recovery and transformation options (experts only)
s sort partitions
t change a partition's type code
v verify disk
w write table to disk and exit
x extra functionality (experts only)
? print this menu
Command (? for help): Warning! Secondary header is placed too early on the disk! Do you want to
correct this problem? (Y/N): Have moved second header and partition table to correct location.
Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
Do you want to proceed? (Y/N): Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.
Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
on the recovery & transformation menu to examine the two tables.
Warning! One or more CRCs don't match. You should repair the disk!
Here is a successful image. I do see in this comparison the successful one was a 300gb drive and the one that has the node tree issues is a 500gb drive judging by the logs.
That is about the only difference I see. I also see that this 300gb is an exact match for what the image was made from, which typically changes how the drive is partitioned during imaging, but since you are forcing dynamic partitions I don't see that as a factor. The only other thing I see is on the one that failed, it was previously using core storage, before the new image was written. Perhaps like you said something is not being erased completely. Have you tried doing a full format and pushing the image again?
So we tried completely wiping the drives manually and had the same issue...BUT! i just tested something with interesting results...so looking at the errors and seeing the different size drives i had an idea...why not make a new profile for the 500gb drive, enable "Only upload schema" then upload from the 500gb. Well...what happened was it created the 500gb profile, so now i have Default and one called 500gb. When i uploaded the schema however, it deleted the image off the server? But! when i try imaging using that "image" using the 500gb profile on the 500gb machines...it doesn't do any imaging as the image got deleted...but however it does run the FSCK and oddly enough repairs the drive successfuly with no node tree errors. Then when i boot into the OS my script runs fine and resizes the drive. So is what i'm seeing a "bug"? 1) why when i check off only upload schema does it delete the image on the server and just replace it with a file named "schema" under the image folder under the folder images? 2) It seems like to me for any machine that has a different size HD, a new profile will need to be made and uploaded for that specific machines HD size. Is this something that could be coded to have it dynamically be done? i.e. clonedeploy says "This image is from a 350gb HD but this drive i'm detecting is a 500gb drive...i'm going to write the volume header to where it should be for a 500gb then instead of for the 350gb the image used when uploaded".
This makes absolutely no sense. Before I even try to dissect this, can you attach this deploy log. I'm surprised it event tried to do anything without the imaging files.
Here is the log from yesterday of the event i spoke about.
Image profiles are not designed to use different schemas based on different sized hard drives. You should never need to create the image specifically for different sized hard drives, that was one of my main goals in designing CloneDeploy. I believe what you are seeing is just by chance that it seemed to repair the computer, and has nothing to do with the schema matching the hard drive size.
The upload schema only option was never meant to add on to an existing image. It is there to give you control of how you want to upload the image, for example if you only wanted to upload only one partition, you would upload only the schema first, select that partition, remove the upload schema only and finally do a normal upload. When uploading the schema only everything is deleted because you should be eventually uploading a new image.
Based on the log, only two things actually happened.
1st. All of the information before the start of the first partition was erased. In this case the first 40 sectors were erased. This is normal, and typically gets restored from the file called table, but you no longer had a file to restore, so that information remained blanked out until the partition table was recreated. Is it possible something there was causing the fsck to fail, seems unlikely but who knows.
2nd. The fsck simply ran again. It's possible for some reason that just running the fsck again from within the CloneDeploy boot image magically works the second time, or maybe only after booting in OS X one time and then back to the CloneDeploy boot image.
I would test the second theory. Boot one of the broken Macs to the CloneDeploy Client Console. And try running the fsck on the last two partitions.
[code]fsck.hfsplus -f /dev/sda2
fsck.hfsplus -f /dev/sda3[/code]
If that still doesn't work, I'll show you how we can test the first theory.
I sent you an email with some files to look at that may help better show what i'm seeing. I tried fsck.hfsplus -f /dev/sda2
fsck.hfsplus -f /dev/sda3
But as you will see in the pic it still errored out. I did another test with a 1TB imac making an "image" that only had an uploaded schema, deployed it back down to the same machine and it fixed the drive.
At this point I'm convinced that it's either the table file not being restored or the repartitioning that is fixing this. With your good image the one that actually has files. Rename the table file to table.bak then try to deploy it to one of the broken macs. This will replicate the idea of not restoring the table file
I tried your idea of adding .bak to the file name but had the same result, invalid node tree 3,0. As a test i then imaged the machine just using the blank "1TB imac" image with just the schema in the image folder and again it didn't image the machine but did run the FSCK and once again fixed the drive. Did you get the video and screen shots i emailed you as reference by chance?
yea I got them, they show exactly what you describe. I still don't think this is a schema issue. I suspect that even if you just took the schema file from the original 300gb image or whatever it was, made a new image and just copied the schema file over and deployed with that it would also fix it. Can you try?
Ok so file this under Bizarre...I decided to delete the schema file within the image folder containing the good image. I then copied the "1TB" schema file that was repairing the drive and pasted it into the known good image folder and low and behold....all my clients are imaging properly now and have no node tree errors. I have a couple of thousand Macbooks we will be imaging over the next few weeks so i'll see if this fix holds true for all of them. Thanks for your help man. Still all very odd why I needed to do this though? Thanks!
I would double check to make sure you can image drives smaller than 1TB with that schema
Yep, that is what I tested. So far a 350gb and a 500gb have imaged great with no issues.
Ok, i'm officially lost, I have no idea whats happening. As long as it's working I guess. If you get a few mins can you attach both schemas? I'm still interested in why this works.
Sure thing here you go. I had to add .log to the extension for it to let me upload. The working one i added working to the name so you can tell them apart at first glance.