Hundreds of thousands of duplicate image files are clogging the digital archives of Melbourne's public institutions, according to an audit process currently underway across several Victorian government agencies. The problem costs staff time, inflates storage costs, and — in heritage and arts collections — risks the wrong version of an image being published or permanently archived.
The issue has landed on desks at a particularly fraught moment. Victorian government agencies are mid-way through a digital transformation push tied to the state's 2025–2030 Digital Strategy, which explicitly targets data integrity across public sector holdings. Duplicate and unverified image assets directly undermine that goal. For institutions like the State Library of Victoria on Swanston Street and the Public Record Office Victoria based in North Melbourne, where digitisation of physical collections has accelerated since 2020, the challenge is concrete and costly.
The Scope of the Problem, in Plain Figures
Digital storage is cheap — until it isn't. A single institution running a mid-scale photographic archive can accumulate duplicate image rates of between 15 and 30 per cent of total holdings, according to published benchmarking data from the Digital Preservation Coalition, a UK-based body whose research is widely referenced by Australian archivists. At the State Library of Victoria, which holds more than 800,000 digitised images across its Pictures Collection, even a conservative 15 per cent duplication rate would mean upward of 120,000 redundant files consuming server space and staff attention.
Cloud storage costs for large TIFF files — the format used for high-quality archival images — run at roughly $30 to $50 per terabyte per month for enterprise-grade services. A single uncompressed TIFF from a heritage scan can exceed 500 megabytes. Multiply that across thousands of duplicates and the storage bill compounds quickly. Smaller councils are not immune. The City of Melbourne's own digital asset management system, which supports everything from planning documents to public art records, has been flagged internally as an area requiring regular deduplication reviews, though the council has not publicly disclosed the scale of any redundancy in its holdings.
Digitisation programs accelerated sharply during the COVID-19 period, when physical access to reading rooms was suspended. The State Library paused in-person services from March 2020, pushing staff to prioritise batch scanning of collections. Batch processes, particularly when run by multiple contractors or across different software platforms, are a known generator of duplicate files — the same image ingested twice under different metadata tags, or uploaded from both a working drive and a backup simultaneously.
What Institutions Are Actually Doing About It
Deduplication is not glamorous work. It involves running automated hash-matching tools across file libraries, then manually reviewing flagged pairs where the algorithm is uncertain — a process that blends software efficiency with old-fashioned curatorial judgment. The Australian Institute for the Conservation of Cultural Material, which has members working across Victorian institutions including Museums Victoria at Carlton, has published guidance on establishing version-of-record protocols to prevent duplication from recurring after a clean-up.
Museums Victoria, which manages roughly 17 million objects across its collections including the Melbourne Museum on Nicholson Street, has been among the more transparent institutions about its digital asset challenges, acknowledging in its annual reports that ongoing collection digitisation requires sustained investment in metadata standards. Those standards, when poorly enforced, are frequently the root cause of duplicate entries.
For smaller organisations — community archives, local history societies in suburbs like Footscray and Brunswick, or arts organisations in the Collingwood cluster — the duplication problem is often invisible until a storage bill spikes or a wrong image goes public. Free tools like dupeGuru or open-source scripts built on Python's hashlib library offer a starting point, but without metadata discipline on the front end, duplicates return.
The practical advice from archivists is blunt: audit before you migrate. Any institution planning to move its image library to a new content management system — a common step during the current digital strategy rollout — should run a full deduplication pass first. Migrating clean data costs less and creates fewer long-term problems than cleaning up a mess on the other side of a system switch. That lesson, at least, is not expensive to learn.