Asked  7 Months ago    Answers:  5   Viewed   276 times

I'm using iTextSharp to read the text from a PDF file. However, there are times I cannot extract text, because the PDF file is only containing images. I download the same PDF files everyday, and I want to see if the PDF has been modified. If the text and modification date cannot be obtained, is a MD5 checksum the most reliable way to tell if the file has changed?

If it is, some code samples would be appreciated, because I don't have much experience with cryptography.

 Answers

77

It's very simple using System.Security.Cryptography.MD5:

using (var md5 = MD5.Create())
{
    using (var stream = File.OpenRead(filename))
    {
        return md5.ComputeHash(stream);
    }
}

(I believe that actually the MD5 implementation used doesn't need to be disposed, but I'd probably still do so anyway.)

How you compare the results afterwards is up to you; you can convert the byte array to base64 for example, or compare the bytes directly. (Just be aware that arrays don't override Equals. Using base64 is simpler to get right, but slightly less efficient if you're really only interested in comparing the hashes.)

If you need to represent the hash as a string, you could convert it to hex using BitConverter:

static string CalculateMD5(string filename)
{
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(filename))
        {
            var hash = md5.ComputeHash(stream);
            return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
        }
    }
}
Tuesday, June 1, 2021
 
aurelijusv
answered 7 Months ago
82

Convert the file content into string & use the below method:

public static String getMD5EncryptedString(String encTarget){
        MessageDigest mdEnc = null;
        try {
            mdEnc = MessageDigest.getInstance("MD5");
        } catch (NoSuchAlgorithmException e) {
            System.out.println("Exception while encrypting to md5");
            e.printStackTrace();
        } // Encryption algorithm
        mdEnc.update(encTarget.getBytes(), 0, encTarget.length());
        String md5 = new BigInteger(1, mdEnc.digest()).toString(16);
        while ( md5.length() < 32 ) {
            md5 = "0"+md5;
        }
        return md5;
    }

Note that this simple approach is suitable for smallish strings, but will not be efficient for large files. For the latter, see dentex's answer.

Wednesday, June 9, 2021
 
cegfault
answered 6 Months ago
95

You can compute the MD5 checksum in chunks, as demonstrated e.g. in Is there a MD5 library that doesn't require the whole input at the same time?.

Here is a possible implementation using Swift (now updated for Swift 5)

import CommonCrypto

func md5File(url: URL) -> Data? {

    let bufferSize = 1024 * 1024

    do {
        // Open file for reading:
        let file = try FileHandle(forReadingFrom: url)
        defer {
            file.closeFile()
        }

        // Create and initialize MD5 context:
        var context = CC_MD5_CTX()
        CC_MD5_Init(&context)

        // Read up to `bufferSize` bytes, until EOF is reached, and update MD5 context:
        while autoreleasepool(invoking: {
            let data = file.readData(ofLength: bufferSize)
            if data.count > 0 {
                data.withUnsafeBytes {
                    _ = CC_MD5_Update(&context, $0.baseAddress, numericCast(data.count))
                }
                return true // Continue
            } else {
                return false // End of file
            }
        }) { }

        // Compute the MD5 digest:
        var digest: [UInt8] = Array(repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))
        _ = CC_MD5_Final(&digest, &context)

        return Data(digest)

    } catch {
        print("Cannot open file:", error.localizedDescription)
        return nil
    }
}

The autorelease pool is needed to release the memory returned by file.readData(), without it the entire (potentially huge) file would be loaded into memory. Thanks to Abhi Beckert for noticing that and providing an implementation.

If you need the digest as a hex-encoded string then change the return type to String? and replace

return digest

by

let hexDigest = digest.map { String(format: "%02hhx", $0) }.joined()
return hexDigest
Monday, August 2, 2021
 
francadaval
answered 4 Months ago
57

Here is a good article on how to calculate and check Blob MD5 checksums.

I have faced this before, and I don't know why, but you can'T just do md5.computeHash(fileBytes). For Azure Blobs, it uses the following path to get the hash:

// Validate MD5 Value
var md5Check = System.Security.Cryptography.MD5.Create();
md5Check.TransformBlock(retrievedBuffer, 0, retrievedBuffer.Length, null, 0);     
md5Check.TransformFinalBlock(new byte[0], 0, 0);

// Get Hash Value
byte[] hashBytes = md5Check.Hash;
string hashVal = Convert.ToBase64String(hashBytes);

and it works...

And yes, as Guarav already mentioned - MD5 hash is saved as base64 string.

Friday, August 6, 2021
 
jsuissa
answered 4 Months ago
62

PBKDF2

You were really close actually. The link you have given shows you how you can call the Rfc2898DeriveBytes function to get PBKDF2 hash results. However, you were thrown off by the fact that the example was using the derived key for encryption purposes (the original motivation for PBKDF1 and 2 was to create "key" derivation functions suitable for using as encryption keys). Of course, we don't want to use the output for encryption but as a hash on its own.

You can try the SimpleCrypto.Net library written for exactly this purpose if you want PBKDF2. If you look at the implementation, you can see that it is actually just a thin wrapper around (you guessed it) Rfc2898DeriveBytes.

BCrypt

You can try the C# implementation named (what else) BCrypt.NET if you want to experiment with this variant.

Disclaimer: I have not used or tested any of the libraries that I have linked to... YMMV

Saturday, August 7, 2021
 
Brian
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share