adrift on a cosmic ocean

Writings on various topics (mostly technical) from Oliver Hookins and Angela Collins. We have lived in Berlin since 2009, have two kids, and have far too little time to really justify having a blog.

Revisiting C++ for Image Deduplication

Posted by Oliver on the 11th of April, 2020 in category Tech
Tagged with: programmingC++codingflickrlanguagessoftwareQtimagemagick

Hi everyone, and welcome back to my exceedingly sparse tales of technology adventures! Strap in for an excessively long introduction to this one, it's boring and tedious as hell.

I'll skim over the introduction to the introduction, but like many of you, I've had a gradually growing collection of digital images. Starting in roughly 2001 when my family had a Kodak DC120 1MP Digital Camera (yes, one megapixel) and I took a bunch of photos on a family vacation, and in the following years as digital mobile phones acquired (very low quality) cameras, I also began taking photos with them and managed to keep pretty much all of them. At one point once I was working at a web hosting company, I started storing them on a webserver running some PHP-based gallery software. Later on once I was working at Nokia, I moved them to the fated "Ovi" platform that had an (acquired) photo hosting product. Later, once that was discontinued, I decided to move all the photos to Flickr (still a viable platform at that time) and even wrote a bit about the process of getting them there. I remember at the time having to make some kind of automation to download everything from Ovi using a mouse macro, since there was no API available to download the photos. I wrote an abomination of a shell script at least to upload them to Flickr, and that's where they stayed for quite some time.

Cut to the last (or maybe second last) time that Flickr changed hands and revised their pricing plans - I decided that I needed to get my photos off Flickr again. I had been maintaining originals of all of my uploaded photos (or so I thought) on my home storage server so I was reasonably confident that I still had all of the photos that had been uploaded over the years. Nevertheless, just to be safe I downloaded the archive that Flickr made available, and closed my account. Some time before that I had done something similar with Facebook as I had been uncertain about their user policy relating to their use of your uploaded photos. We'd wanted to continue sharing photos with family members but of course that hadn't quite survived the jump from Facebook to Flickr and it was no longer a big requirement for us.

Cut forward to late last year where I finally replaced our custom Linux-based home server with a small Synology device and was migrating all of our files over. I realised that in addition to hundreds of gigabytes of photos, I also had archives from Facebook and Flickr which potentially contained duplicates of the originals. Deleting the archives would have saved quite a bit of extra space (not in excessive demand, but the huge 2TB drives I had purchased years ago are not so big anymore, especially as we've continued adding files). Being my usual obsessive self, I decided I needed to solve this problem - however, there were close to 40,000 photos on the storage server and several thousand in the downloaded archives. Performing this comparison manually would be doable but exceedingly time-consuming.

The Problem Statement

Given the corpus of roughly 40,000 reference images, and a set of possible duplicates, find the duplicates and delete them. There were several sticking points right away:

  • EXIF data was not consistently present in the downloaded images, or had potentially been modified.
  • Filenames of the downloaded images were changed from the originals - sometimes embedding part of the original filename, and sometimes completely different depending on how I uploaded it in the first place.
  • Timestamps were not consistently correct on downloaded files.
  • I couldn't guarantee that downloaded images were the exact same image as I originally uploaded (although I assume that Flickr gave me as close to the original file as possible, and not all of the different-sized thumbnails that they generated).

Solution 1: Hashing

My usual go-to for detecting duplicate files was therefore mostly out of the question. The simplest (although CPU- and I/O-intensive) option is to just generate a hash of all of the files and then compare hashes - this gives you a list of all files that have duplicates. Return only the filenames in the directory of downloaded photos and you can just delete them straight away. This unfortunately didn't yield any matches, due to the EXIF information being changed or removed from the files. I could have arguably stripped the EXIF header out of all files and then compared the remaining binary data but I didn't like my chances.

Solution 2: Artificial Intelligence!

At this point you can very easily jump on the hype train and decide "I'm going to use Machine Learning". I chatted with a team-mate who has experience with this very field around images and ML, and he quickly put me off the idea. As well as it being far outside my set of technical experience, the whole process of training and poor accuracy detection didn't appeal to me.

Solution 3: OpenCV

This was just an idea that I didn't explore at all in practice. A few years ago now I started dabbling in OpenCV with the Python bindings (and even put some of the code up somewhere), so I knew OpenCV can do some kind of image recognition. I assumed it would be able to fingerprint the images in some way and then I could compare fingerprints. I still have no idea if it can do it, but some day I'll come back to OpenCV and give it a try.

Solution 3: Some sort of ... other image comparison!

The simplest possible solution I could therefore think of was to just take two images, iterate through them one pixel at a time comparing the colours (which should just be three 8-bit numbers). If, by the time the end of the file is reached, there have not been any failed comparisons, the two images must be identical! However, this approach really doesn't scale. Even skipping any pair of images that are clearly not the same (failing early at the first failed comparison, or not comparing two images that have different dimensions) would still require far too much processing time. Furthermore, there was a good chance (I theorised) that some of the downloaded images might actually be the same, but slightly downscaled or resampled. These lower-quality images should be marked as duplicates and deleted.

My new solution, therefore, was to generate small "fingerprints" - 100x100 pixel downscaled versions of the originals, and only compare them. In many cases this meant resizing with a disregard for original proportions, but I figured that at least duplicates of different sizes should end up the same size and (hopefully) with the same (or similar) fingerprints. I just happened to pick ImageMagick for this as it was an obvious choice for resizing images.

The choice of using C++ came in by accident. While generating fingerprints for my collection of images could be easily automated with a shell script, I couldn't imagine building a workable solution for the comparison stage also in shell - this seemed to require a bit more complexity in the implementation. Since I've been itching to use C++ a bit more for the last few years I decided this was a worthy case, and fortunately ImageMagick has a set of C++ headers that makes this possible. Documentation is not the best, but it was an interesting challenge. Ostensibly, I should have learned C++ in my second year of university but it was one of the many subjects I barely paid any attention to, and all I managed to do was write non-functional programs. It was time to revisit C++ and actually use it in anger (I find that phrase very strange, but apparently it's a thing).

Results

Did it work? Well, yes. You can even have a look at the code here. There were many learnings along the way. On the tooling side I use cmake again, which I'm getting slowly accustomed to. Some of the images I had in the originals were in Canon's CR2 "raw" format so that required the use of ufraw as well (which also meant very slow resizing operations).

Fingerprints

The first iterations involved just generating the fingerprints and getting the fingerprint format right. Having been spoiled with more modern languages like Ruby, JavaScript ES6 and Go in recent years I was surprised at needing to pull in external libraries like Boost even for filesystem traversal operations. Already in the early iterations of generating fingerprints, I could see that scalability was going to be a concern - I was traversing my entire directory of photos in a single thread and generating the fingerprints in that same thread. This gave rise to a refactor to make it possible to start generating fingerprints as soon as the program had figured out where the first image was located.

Once again I found myself missing Go's channels and goroutines, and having to actually build something with queues (not concurrency-safe either!) and threads was something quite different! Later I replaced my lock-protected queue with a Boost lock-free queue, more for programming convenience than for performance concerns. I didn't quite make it as far as using a thread pool as my static set of workers ended up being sufficient.

Comparisons

Comparing the fingerprints is where it got interesting (and where I spent most of the time). ImageMagick is able (via many different algorithms) to compare images and give you the result - same or not, how many pixels are different, the percentage of error over the entire image, and so on. Just picking an algorithm was a difficult choice. I started by generating my own down-scaled thumbnails and comparing them against the originals as a way of understanding the "ideal" comparison case. The results were confusing, and not very illuminating. I started with the raw count of pixels that were different, which generated a depressingly-low set of suspected duplicate images. Despite tweaking the "fuzz factor" (yes) to allow a bit more error in between pixels judged to be the same colour, I didn't get very good results.

One of the overloaded methods of Image::compare allows you to provide a reference to a Magick::Image, into which it will produce a "diff" image between the source and target images. I briefly replaced some of my comparison calls with this method, and started looking at some of the actual results by hand. The outputs were completely confusing - some of my reference image pairs showed large discrepancies. I tried several different comparison algorithms and usually found either huge differences or none (which was the most suspicious). I had largely put this down to my misunderstanding of the algorithms, but a desperate post to the ImageMagick forums guided me in the direction of possible colour-space, floating-point depth or HDRI/non-HDRI mismatches.

I made a few tweaks here and there, and ultimately settled upon the "root mean-squared" error metric for the comparison. Don't ask me what it means. Combined with a couple of heuristic cut-off error rates (1% and 2% respectively) I was able to produce a set of images that were probably based off the same image. Ultimately, the amount of time I was spending in this phase started to feel like I should have just looked at the damn images and figured out the duplicates manually, so I decided a new tactic was necessary. I'd cut down the set considerably after my amazing wife suggested I look at the time the photos were taken - I actually had the timestamp in the original photos, and many of the downloaded Flickr set had the timestamps in separate JSON metadata files for each image. After coding up some more of this logic in the C++ program, I could remove duplicates by simply determining if they had been taken at the same time - they must be the same photo (or at least very similar, if taken seconds from each other).

This left the problem of the remaining ~3500 images that I was less confident about.

Manual Comparison

I decided I did want to look at the photos after all. Once I had a list of photos that were probably the same, I wanted to check them just to make sure - then delete the duplicate once I was confident I was deleting the right one. Again, a shell script was probably out of the question. If I was going to do this for thousands of images, I wanted a really fast process.

Perhaps as a homage to my former employer (Nokia), I decided to look into Qt framework. At one point this was THE framework running on all of the newest Nokia devices and had a vast amount of developers working both on it and with it. To me, in 2019/2020 it represents a way to further my learning on C++ and also could be used to build apps across several different platforms. The downsides appear to be the licensing aspects, which are complicated and costly. A couple of years ago when I'd wanted to create a simple native UI application I'd looked into it and unfortunately not really understood how to use it. This time around I bought an online course around Qt to get myself started. Starting from a simple layout and building the functionality around it was actually quite straightforward.

qt image duplicate deleter

This image shows the UI layout I ended up with. The goal was to have the images displayed side-by-side (proportions not really respected, as you can probably tell), and three buttons to control the action: delete the left image, delete the right image, or skip the pair in case they don't have a duplicate among them. These choices swiftly acquired hotkeys to ensure the process was as fast as possible. I was loading the images from the network and worried that I might have to pre-fetch images to ensure they are displayed fast enough, but in most cases this wasn't necessary.

The code in the end wasn't very complicated or verbose. Recent versions of Qt use at least C++11 and so the slots and signals can be connected via lambda functions - making most of the interactive code quite concise. The documentation is good, and the functionality you get from the framework covers just about anything you could need. Of course, this comes with the side-effect that even basic stuff like strings or exceptions become QStrings and QExceptions, but these are minor issues. I think they may have changed some of this in the upcoming Qt 6 release.

Conclusion

Again, did it work? Ultimately, yes. I suspect I still have some duplicates in the image collection but not so many that I worry about (or can continue spending time on). I'd like to revisit the automatic duplicate detection with ImageMagick (or another library) in the future as it should in principle work a lot more straightforw thanit did in the end. Would I use Qt again for something? Absolutely - I have a lot more of that framework that I could explore and since building UIs (let alone native UI desktop applications) is not a skill I've used at all in my professional career, I'd like to explore and learn much more about it. The licensing of Qt does scare me a little, but since I'm not planning on selling anything, I'm not worried too much about it.

Let me know if any of this was interesting to you, or if you have had the same problem and dealt with it in a very different way! Always happy to chat about any of the technologies used or any of the fun problems we have to solve along the way.

© 2010-2018 Oliver Hookins and Angela Collins