State Library Victoria confirmed this week it is accelerating a structured audit of its digitised photographic holdings after an internal review identified tens of thousands of duplicate image files across its online catalogue — redundant scans, re-uploaded surrogates and near-identical derivatives that have compounded over more than a decade of digitisation projects. The audit, which the library says will run through to the end of 2026, targets collections held across its Swanston Street reading rooms and its off-site storage at Ballarat Road, Maidstone.
The issue matters now because Victoria is deep in a broader push to make its cultural heritage collections genuinely searchable and accessible online. Duplicate records don't just waste server space — they fragment discovery, return garbled search results and create real confusion for researchers trying to establish which version of an image is authoritative. With the state government having committed funding to expand the Library's digital infrastructure as part of Creative Victoria's 2025–2028 cultural investment framework, getting the underlying data right has become a precondition for further development.
From Swanston Street to Southbank: Who's cleaning up
State Library Victoria is not alone. The City of Melbourne's own digital archives team, based at the Melbourne Town Hall annex on Little Collins Street, has been running what it internally calls a "de-duplication sprint" since June 2, targeting images held in its heritage photographic register — a collection that spans planning records, infrastructure photography and community documentation accumulated since the early 2000s. Council's digital records unit has flagged the problem as partly a legacy of COVID-era digitisation, when remote working meant multiple staff sometimes scanned and uploaded the same physical items independently.
The Australian Centre for the Moving Image on Federation Square has faced a related but distinct version of the problem. ACMI's collection management team has been reconciling still-image stills extracted from digitised film reels — frames that automated systems flagged as unique but that turn out to be near-duplicates from adjacent moments in a sequence. The centre declined to provide specifics on the scale of its backlog, but its collection database, built on the open-source Axiell EMu platform, has been updated with new similarity-detection parameters as of this financial year.
Why duplicates pile up — and what fixing them costs
Digital preservation specialists point to a predictable pattern. Institutions digitise collections in project-based bursts, often under grant funding with fixed end dates, then move on without systematic reconciliation of what was already online. A 2024 survey by the Digital Preservation Coalition, which has member institutions in Australia including the National Library, found that duplicate or near-duplicate records accounted for between 8 and 22 percent of holdings across mid-sized cultural collections — a range wide enough to suggest the problem is common but poorly measured.
Fixing it is not cheap. Software tools for perceptual hashing — the standard technique for identifying visually similar images even when file names or metadata differ — typically cost between $8,000 and $40,000 annually for enterprise-grade licensing, depending on collection size. Open-source alternatives exist but require significant staff time to configure and maintain. For smaller institutions like the Footscray Community Arts Centre, which has been digitising its thirty-year photographic archive with support from the Maribyrnong City Council, the practical answer has been manual review by trained volunteers working through the collection file by file — slower, but less reliant on infrastructure budgets that don't exist.
The Royal Historical Society of Victoria, headquartered on William Street, announced on Tuesday that it would partner with the University of Melbourne's School of Computing and Information Systems to pilot an automated deduplication workflow on a subset of its glass plate negative scans. The pilot, involving roughly 4,000 images, is expected to run for three months and produce a publicly available methodology report by October 2026.
For researchers and members of the public who use these collections, the immediate practical advice is straightforward: when searching state or council image catalogues online, cross-check accession numbers rather than relying on thumbnail previews alone, since duplicates often carry different identifiers. Institutions working through audits have also flagged that some records flagged as duplicates will ultimately be retained as deliberate preservation copies — so not every redundant-looking result is an error. The cleanup is ongoing, and the catalogues will remain imperfect through the second half of this year.