Estimating How Much of a File was Copied
@August 9, 2023
We recently encountered an issue where files were copied but it wasn’t certain that the transfer was uninterrupted.
The Windows box copies large files (>50Gb) off-site over a relatively slow link. We thus needed a quick estimate from the copying machine if any files were obviously not copied completely.
A Quick Warning on Comparing Files by Size
Kindly allow us to introduce a warning here for anyone that assumes that this is the proper way to compare even files that are assumed to be the same. Every single bit could have been toggled in the copying process resulting in a file of the same size that is different - whether this was done intentionally or not does not matter if you need an exact replica.
In this case, we do also copy SHA256 checksums for each file in a separate .sha256 sidecar file, but validating those from the source of the copying machine requires copying the whole data back for verification purposes. Validation is thus a task better suited to the destination platform. Also, our 7z file type this case integrates CRC32 checksums for each individual file contained in the archive. Validation can then be performed on a single file even without copying the whole archive but alas, CRC32 is not cryptographically secure and could be fooled by a malicious actor - also validating the archive file-by-file is of course much slower than just copying the whole archive back for comparison.
Establishing the Copied Amount
Now, since Windows “lies” about the true file size and only reports the allocated file size, we can’t just compare those sizes. However, when right-clicking on a file and Explorer will tell us both the Size and the “Size on disk” which is the part that was actually copied.
Comparing from Explorer was not a practical option due to the number of files, thus PowerShell comes to the rescue.
We also noted that the date was left in 1980 - not quite sure why this date, since the UNIX epoch starts in 1970 (the target system is running Linux) and, if I’m not mistaken, the Windows epoch starts in 1601. Thus, checking if the date was too far back may have been another option, but did not feel as confident and also does not say how much of the file was left uncopied.
The Scripted Solution
Since PowerShell’s Get-Item
only supports the Length
parameter (”Size” in Explorer) but no “LenghOnDisk
", we were looking for a way to access that property. Fortunately someone was kind enough to post the PowerShell code needed to access that information (link at bottom.)
We thus now boldly compare if the “Size on disk” is greater than the file size (which must be the case due to e.g. the cluster size)
# Needed for Get-IsFileFullyCopied()
add-type -type @'
using System;
using System.Runtime.InteropServices;
using System.ComponentModel;
using System.IO;
namespace Win32Functions
{
public class ExtendedFileInfo
{
public static long GetFileSizeOnDisk(string file)
{
FileInfo info = new FileInfo(file);
uint dummy, sectorsPerCluster, bytesPerSector;
int result = GetDiskFreeSpaceW(info.Directory.Root.FullName, out sectorsPerCluster, out bytesPerSector, out dummy, out dummy);
if (result == 0) throw new Win32Exception();
uint clusterSize = sectorsPerCluster * bytesPerSector;
uint hosize;
uint losize = GetCompressedFileSizeW(file, out hosize);
long size;
size = (long)hosize << 32 | losize;
return ((size + clusterSize - 1) / clusterSize) * clusterSize;
}
[DllImport("kernel32.dll")]
static extern uint GetCompressedFileSizeW([In, MarshalAs(UnmanagedType.LPWStr)] string lpFileName,
[Out, MarshalAs(UnmanagedType.U4)] out uint lpFileSizeHigh);
[DllImport("kernel32.dll", SetLastError = true, PreserveSig = true)]
static extern int GetDiskFreeSpaceW([In, MarshalAs(UnmanagedType.LPWStr)] string lpRootPathName,
out uint lpSectorsPerCluster, out uint lpBytesPerSector, out uint lpNumberOfFreeClusters,
out uint lpTotalNumberOfClusters);
}
}
'@
function Get-IsFileFullyCopied {
param(
[string]$FilePath
)
$fullSize = (Get-Item -Path $FilePath).Length
$diskSize = [Win32Functions.ExtendedFileInfo]::GetFileSizeOnDisk($file);
if (($fullSize - $diskSize) -gt 0) {
# On-disk size is slightly larger than file-size (e.g. due to cluster size.)
# The result should thus be <= 0 for files/folders that aren't compressed additionally by the filesystem.
Write-Host "Not fully copied: ${FilePath}: $([Math]::Round(($fullSize - $diskSize) / 1MB, 3))Mb missing"
return $false
}
return $true
}
Using the function Get-IsFileFullyCopied -FilePath "R:\Foo\file.7z"
on the remote shared drive, we can now quickly gauge how much was actually copied. The function returns a boolean but also prints out for which files the sizes differ. It is thus suitable for a loop to check the whole directory including sub-directories:
# Define the directory to search
$directoryPath = "R:\Foo"
$files = Get-ChildItem -Path $directoryPath -File -Recurse
foreach ($file in $files) {
$isFullyCopied = Get-IsFileFullyCopied -filePath $file.FullName
if (-not $isFullyCopied) {
# Handle error on $file
}
}
Thanks to StackOverflow User CB., the original code contributor to [Win32Functions.ExtendedFileInfo]::GetFileSizeOnDisk