In 2005, Hurricane Katrina slammed into the Gulf Coast of the United States. In the chaos that followed (flooded streets, collapsed infrastructure, overwhelmed emergency services) one of the most effective response tools was a simple mashup. Volunteers took NOAA’s freely available satellite imagery and overlaid it with street maps and rescue reports, creating a real-time picture of which areas were flooded and where people were stranded.
Nobody asked permission to use the satellite data. Nobody needed to negotiate a licence. The data was open (published by a government agency, freely available, machine-readable) and when disaster struck, people could build on it immediately.
That’s the promise of open data. Not just transparency for its own sake, but the raw material for things nobody anticipated.
The concept isn’t new. Weather services have shared observation data internationally since the 1850s, when the Vienna Meteorological Congress established protocols for exchanging measurements across borders. National mapping agencies have published geographic data for centuries. Scientific journals have required researchers to share their data and methods since the Royal Society’s earliest publications in the 1660s; the entire scientific method depends on the ability to verify and build on others’ work.
What’s changed is the scale, the format, and the expectation. The internet makes distribution trivial. Machine-readable formats make automated analysis possible. And a growing global movement, catalysed by Barack Obama’s 2009 Open Data Executive Order in the US, the UK’s launch of data.gov.uk in 2010, and the G8 Open Data Charter in 2013, argues that data collected with public money should be available to the public by default.
The argument is both principled and practical. Principled because in a democracy, citizens have a right to see what their government does with their money and their data. Practical because open data creates economic value: when businesses, researchers, and developers can freely access government data, they build things that benefit everyone. Studies from the European Commission and the McKinsey Global Institute have estimated that open data could generate trillions of dollars in economic value globally, though such estimates should be taken with appropriate caution, because measuring the counterfactual (what would have happened without open data) is inherently speculative.
Let me tell you what open data is, why it matters, what makes it hard, and when the correct answer is to keep the data closed.
What open data actually is
The term gets thrown around loosely, so let’s be precise. Open data is data that meets three conditions:
- Freely available. Anyone can access it without paying, without registering, without signing a restrictive licence.
- Machine-readable. Published in a format that software can parse and process: CSV, JSON, GeoJSON, XML. Not a scanned PDF. Not a photograph of a spreadsheet.
- Licensed for reuse. Explicitly published under a licence that allows anyone to use, modify, and redistribute the data. Common licences include Creative Commons CC-BY (use it, just credit the source) and CC0 (public domain, no restrictions at all).
All three conditions matter. Data behind a paywall isn’t open, even if it’s in a great format. Data in a beautiful CSV file isn’t open if the licence says “for personal, non-commercial use only.” A government report published as a 400-page PDF is technically available, but it’s not machine-readable: a human can read it, but software can’t easily extract the numbers, the tables, the time series buried inside it.
The Open Knowledge Foundation puts it simply: “Open data is data that can be freely used, re-used and redistributed by anyone, subject only, at most, to the requirement to attribute and share-alike.”
It’s worth emphasising the machine-readable requirement, because it’s where many well-intentioned publishers fall short. A council that publishes its meeting minutes as a scanned PDF has technically made the information available. But a researcher can’t search across thousands of minutes for references to a specific development site. A journalist can’t automatically extract voting records. A civic tech developer can’t build a tool that alerts residents to relevant decisions. The information is available to a human with time and patience, but not to software. And in an era where the most powerful analysis tools are computational, “not available to software” increasingly means “not practically available at all.”
Why it matters
The case for open data rests on three pillars: transparency, innovation, and accountability. Each one is worth taking seriously.
Transparency. In a democracy, citizens fund the government. The government collects data, enormous quantities of it, as part of its operations. Census results, budget allocations, health statistics, environmental monitoring, infrastructure records. Open data says: this information belongs to the public, and the public should be able to see it, scrutinise it, and use it.
This isn’t abstract. When the Australian government publishes the federal budget with downloadable data tables, journalists and researchers can analyse spending patterns, track changes over time, and hold the government to its promises. When a local council publishes its development applications, residents can see what’s being built in their neighbourhood and object if they need to.
Innovation. This is the one that surprises people. Open data doesn’t just serve the people who publish it; it serves people the publishers never imagined.
Consider transport. When transit agencies publish their timetables and real-time vehicle positions in the GTFS format (General Transit Feed Specification, originally developed by Google and Portland’s TriMet), anyone can build on it. Google Maps uses it. Apple Maps uses it. Citymapper uses it. Accessibility apps use it to help vision-impaired passengers navigate public transport. Research groups use it to study urban mobility patterns. None of these uses were planned by the transit agencies. They just published the data, and the ecosystem grew.
Or consider weather. Australia’s Bureau of Meteorology (BoM) and the US National Oceanic and Atmospheric Administration (NOAA) publish vast quantities of weather observation and forecast data. Every weather app on your phone, every single one, is built on top of this publicly funded data. The private weather industry, worth billions globally, exists because governments decided that weather observations should be open. The raw data is free. The value-added products (the slick apps, the hyperlocal forecasts, the agricultural decision tools) are where commercial innovation happens.
Or consider geospatial data. The US Geological Survey’s Landsat program has been capturing satellite imagery of the Earth’s surface since 1972. In 2008, the USGS made the entire archive free. The impact was immediate and enormous: the number of Landsat scenes downloaded went from about 25,000 per year (at $600 each) to over a million per year (free). Researchers used the data to track deforestation, map urban growth, monitor agricultural land use, and measure glacier retreat. The European Space Agency’s Sentinel program followed suit, publishing free high-resolution imagery of the entire planet every five days. These datasets are the foundation of modern environmental monitoring, and they’re free because policymakers decided that the public benefit of open access outweighed the revenue from selling individual images.
Health data tells a similar story. During the COVID-19 pandemic, open data was the difference between informed response and guesswork. Johns Hopkins University’s COVID-19 Dashboard, built on openly published case data from governments worldwide, became the definitive global tracker. Genomic sequences shared through GISAID enabled researchers to track variants in near-real-time. Vaccination data published by health agencies powered the models that informed reopening decisions. None of this would have been possible if each government had kept its data locked behind access agreements and bureaucratic approval processes.
Accountability. Open data is a check on power. When government spending data is published, corruption becomes harder to hide. When environmental monitoring data is open, companies can’t quietly exceed pollution limits without someone noticing. When police use-of-force data is published, patterns become visible.
In 2013, the OpenCorporates project began aggregating company registration data from open government registers around the world. The resulting database (over 200 million companies) has been used by journalists investigating tax havens, by compliance teams verifying business partners, and by researchers studying corporate ownership structures. None of this would be possible if the underlying data were locked behind individual government portals with incompatible formats and restrictive licences.
The pattern across all these examples is the same: the publisher creates the data for one purpose, and the open licence enables a thousand others. The transit agency publishes timetables so passengers can plan trips. A researcher uses the same data to ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. how cities grow. An entrepreneur uses it to build a business. The original purpose is served, and so are purposes that were never imagined. This is the economic argument for open data in a sentence: the value of the data exceeds the value that any single organisation can extract from it.
The five-star model
In 2010, Tim Berners-Lee (the inventor of the World Wide Web) proposed a five-star rating system for open data. It’s a useful framework for thinking about data quality, even if the upper stars remain aspirational for most publishers.
| Stars | What it means | Example |
|---|---|---|
| ★ | Available on the web, any format, open licence | A PDF of a budget report |
| ★★ | Machine-readable structured data | An Excel spreadsheet of the same data |
| ★★★ | Open, non-proprietary format | CSV instead of Excel |
| ★★★★ | URIs to identify things | Each entity has a stable web address |
| ★★★★★ | Linked to other datasets | Your data references and connects to related datasets elsewhere |
Most open data in the wild is at two or three stars. Government agencies publish CSV files with open licences. That’s genuinely useful, and it’s where most of the practical value lives.
The four- and five-star levels describe the Linked Data vision: a web of interconnected datasets where every entity has a URI and datasets reference each other using those URIs. In theory, you could follow links from a company’s registration record to its filed accounts, to the directors’ other directorships, to the properties they own, all across different datasets maintained by different organisations. It’s a beautiful idea. In practice, the overhead of maintaining stable URIs, consistent ontologies, and working linked-data infrastructure has kept this mostly in the realm of research projects and a few dedicated institutions like the BBC and Wikidata.
Don’t let the perfect be the enemy of the good. Three-star open data (CSV files, open licence, stable URLs) is enormously valuable. Publish that first. Worry about linked data later, if ever.
The five-star model is a spectrum of technical quality, not a measure of usefulness. A one-star PDF of a government budget is less technically sophisticated than a five-star linked dataset, but if nobody’s publishing the budget data at all, that PDF is infinitely more useful than nothing. The model is a guide for improvement, not a gate for participation. Start where you are. Improve when you can.
Standards and catalogues
If open data is going to be useful beyond the organisation that published it, people need to be able to find it, understand it, and combine it with other datasets. This is where standards come in.
DCAT (Data Catalog Vocabulary) is a W3C standard for describing datasets in a catalogue. It defines a common vocabulary for metadata: what’s the dataset called, who published it, when was it last updated, what format is it in, what’s the download URL, what licence does it use. If every data portal describes its datasets using DCAT, tools can aggregate across portals and present a unified search experience.
CKAN is the open-source platform that powers many of the world’s open data portals, including data.gov.au, the UK’s data.gov.uk, and the US data.gov. It provides a web interface for browsing and searching datasets, an API for programmatic access, and built-in support for DCAT metadata. If you’re looking for open data from a government, chances are the portal is running CKAN.
Schema.org provides a vocabulary for structured data on the web. If you add Schema.org markup to a dataset’s web page, search engines can understand what the page describes: that it’s a dataset, what it covers, who published it. Google’s Dataset Search uses this markup to index datasets across the web.
GeoJSON is the de facto standard for geospatial data on the web. Property boundaries, bus routes, park outlines, sensor locations: anything with a geographic component can be expressed in GeoJSON, and virtually every mapping library (Leaflet, Mapbox, Google Maps) can consume it directly.
JSON-LD (JSON for Linking Data) is a way to express linked data using ordinary JSON. It’s the bridge between the practical world of JSON APIs and the semantic web world of RDF and URIs, and what Schema.org recommends for embedding structured data in web pages.
The pattern here is that standards enable interoperability. When everyone describes their data the same way, tools can work across datasets without custom integration for each one. When everyone ignores standards (and many do, because implementing them properly takes effort) you end up with thousands of data portals that each require bespoke code to access.
There’s an uncomfortable truth about standards in the open data world: there are too many of them. DCAT, Schema.org, Dublin Core, SDMX, ISO 19115 (for geospatial metadata), DataCite (for research data), INSPIRE (for European spatial data): the list goes on. Each was designed for a legitimate purpose by a legitimate standards body, and each has a community of users. But the proliferation itself creates a barrier. A government agency trying to publish its first dataset faces a bewildering array of metadata standards before it’s even chosen a file format. The result, too often, is paralysis, or a decision to publish a CSV with no metadata at all, which is less ideal but infinitely better than not publishing.
The pragmatic advice: use DCAT if you’re running a data portal. Use Schema.org if you want search engines to find your data. Use the domain-specific standard if you’re publishing data for a specialised community (GTFS for transport, GeoJSON for maps, HL7 FHIR for health). And don’t let the perfect metadata standard prevent you from publishing the data.
The irony is that the standards community, the people who care most about interoperability, have created an interoperability problem of their own. But the answer isn’t fewer standards. It’s better guidance about which standard to use when, and more tools that handle the mapping between standards automatically. The W3C’s Data on the Web Best Practices document is a reasonable starting point for anyone trying to find their way through this landscape.
How to publish well
If you’re in a position to publish open data (whether you’re a government agency, a non-profit, a research institution, or a company choosing to share) here’s what good practice looks like.
Stable URLs. Every dataset should have a permanent URL that doesn’t change. If you reorganise your website, redirect the old URLs to the new ones. People will build systems that depend on your data being at a specific address. Breaking those URLs breaks their systems. Tim Berners-Lee wrote that “cool URIs don’t change” in 1998, and it’s as true now as it was then. If you can’t commit to keeping a URL alive, you’re not ready to publish open data at that URL.
Use HTTPS. This should go without saying, but serve your data over HTTPS. Data served over plain HTTP can be intercepted and modified in transit. If someone is building a system that makes decisions based on your data, and they will be, the integrity of that data matters. HTTPS is free (thanks to Let’s Encrypt) and there’s no excuse for not using it.
Machine-readable formats. CSV for tabular data. JSON or GeoJSON for structured or geospatial data. APIs for data that changes frequently. Not PDF. Not Word documents. Not Excel files with merged cells and colour-coded rows that make sense to a human but are impenetrable to software. If you’re unsure, publish CSV. It’s the lowest common denominator of open data formats: every programming language can read it, every spreadsheet application can open it, and it doesn’t require any special software or libraries. It’s not glamorous, but it works.
Documentation. What does each column mean? What are the units? What’s the geographic coverage? What’s the time range? What are the known limitations? A data dictionary (a document that describes each field in the dataset) is the minimum. Without it, users have to guess, and they’ll guess wrong.
Update schedules. If the data is updated monthly, say so. If the update is late, say so. If the dataset is abandoned, say so. Users need to know whether the data they’re relying on is current, and silence is ambiguous.
Versioning. When the schema changes (new columns added, old columns renamed, date formats changed) don’t just overwrite the old file. Version your datasets. Keep the old versions available. Document what changed and when. Schema changes break downstream systems, and the people who depend on your data deserve warning.
Licensing. Be explicit. Put a licence on it. If you want maximum reuse, use CC0 (no restrictions) or CC-BY 4.0 (attribution required). If there’s no licence, cautious users will assume they can’t use it. In many jurisdictions, data without a licence defaults to “all rights reserved.”
The licence should be stated clearly on the dataset’s landing page, in the metadata, and ideally in a machine-readable format (a license field in your DCAT metadata, a license property in your JSON). Don’t bury it in a terms-of-service page that requires three clicks to find. Make it obvious. Developers evaluating whether to use your data will check the licence first, and if they can’t find it quickly, they’ll move on to a dataset where the terms are clear.
API access. For datasets that are large or frequently updated, provide an API. Let users query for the subset they need rather than downloading a 20-gigabyte file to find three rows. Pagination, filtering, and structured responses (JSON, not HTML) are the basics. The CKAN API is a good model if you’re running CKAN; otherwise, a simple REST API with clear documentation will do.
Metadata. Every dataset should have a description, a publisher, a date of last update, a licence, a geographic and temporal coverage statement, and a contact point. This sounds like overhead, but without it, your data is a file on a server with no context. Users can’t evaluate whether it’s suitable for their purpose. Search engines can’t index it properly. Other data portals can’t catalogue it. Five minutes of metadata saves thousands of hours of user confusion.
Bulk download. APIs are great for targeted queries, but some users need the whole dataset. Researchers building models, developers building offline-first apps, archivists preserving public records: they need to download everything. Provide a bulk download option alongside the API. A compressed CSV or a database dump. Make it easy to get the lot.
When not to publish
Open data is a default worth striving for. But it’s not an absolute. There are legitimate reasons to keep some data closed, and pretending otherwise does the movement a disservice.
Privacy. The most obvious and most important constraint. Personal data (health records, financial transactions, location histories, communications) should not be published openly, full stop. But the boundary isn’t always clear, and “anonymised” data is less anonymous than people think.
In 2006, Netflix published a dataset of 100 million movie ratings from 480,000 subscribers, stripped of names, as part of the Netflix Prize competition. Researchers at the University of Texas demonstrated that by cross-referencing the “anonymous” ratings with public reviews on IMDb, they could identify individual users. The lesson: removing names doesn’t make data anonymous. If a dataset contains enough attributes, re-identification is often possible.
Similarly, in 2014, New York City published a dataset of taxi trips with medallion numbers, pickup and dropoff locations, and timestamps. The medallion numbers were “anonymised” by hashing them with MD5, but because there are only about 13,000 medallions, researchers trivially reversed the hashes and reconstructed the complete trip history of every taxi in the city. Combined with public photos of celebrities entering taxis, this revealed specific individuals’ movements.
These aren’t edge cases. They’re the norm. Latanya Sweeney’s foundational research showed that 87% of the US population can be uniquely identified by just three attributes: date of birth, gender, and five-digit ZIP code. Think about how many “anonymised” datasets contain at least that much information.
De-identification is hard, and the consequences of getting it wrong fall on the people whose data was exposed, not on the people who published it. Techniques like differential privacy, which adds carefully calibrated noise to data to protect individuals while preserving aggregate statistics, offer a way forward, and major organisations like the US Census Bureau and Apple have adopted them. But differential privacy requires expertise to implement correctly, and it involves genuine tradeoffs: more privacy means less precision, and the acceptable balance depends on the use case.
Security. Some data creates risk when published. Detailed floor plans of government buildings. The locations of critical infrastructure control systems. Vulnerability data before patches are available. The specifics of which government systems run which software versions.
This isn’t hypothetical. In 2010, the US government published a report through the Government Accountability Office that inadvertently included detailed information about nuclear facilities, including specific security vulnerabilities. The document was pulled, but not before it had been downloaded and cached. The tension between transparency and security is real, and reasonable people can disagree about where to draw the line. But the line exists. There’s a reason that responsible disclosure exists in security research: some information is genuinely dangerous in the wrong hands, and publishing it openly can cause more harm than the transparency is worth.
Commercial sensitivity. Governments collect commercially sensitive information as part of regulation: tax returns, business financials, trade data at the individual-company level. Publishing this openly would harm the businesses that provided it, and would make businesses less willing to provide accurate information in future. Aggregate statistics (total tax revenue by sector) are appropriate for publication; individual returns are not. The line between “aggregate enough to be safe” and “granular enough to be useful” is a judgement call, and different jurisdictions draw it differently. Australia’s ABS has decades of experience with this balance, applying statistical techniques like suppression of small cell counts and perturbation of values to protect individual businesses while preserving the analytical value of the data.
Indigenous knowledge and data sovereignty. This is one that the open data movement has been slow to grapple with, but it’s critically important.
Not all knowledge should be open. Many Indigenous communities, in Australia and globally, hold cultural knowledge that is sacred, restricted, or governed by protocols that determine who can access it and under what circumstances. Dreamtime stories, ceremonial practices, knowledge of Country, ecological knowledge passed down through generations: these are not “data” waiting to be “opened.” They belong to the communities that hold them, and the decision about whether and how to share them rests with those communities.
The CARE Principles for Indigenous Data Governance, Collective Benefit, Authority to Control, Responsibility, Ethics, provide a framework for thinking about this. They complement the FAIR Principles (Findable, Accessible, Interoperable, Reusable) that guide open science, but they centre the rights and interests of Indigenous peoples rather than the convenience of data users.
In Australia, organisations like the Maiam nayri Wingara Indigenous Data Sovereignty Collective are leading this conversation. The principle is straightforward: Indigenous people should have control over data about them and by them. Open data policies need to respect this, not override it.
This isn’t an abstract concern. Government agencies hold large quantities of data about Indigenous Australians: health records, welfare data, demographic statistics, land use records, cultural heritage surveys. Historically, this data has been collected, analysed, and published by non-Indigenous institutions, often without meaningful consultation. The open data movement’s instinct to publish everything can conflict with Indigenous communities’ right to control narratives about themselves. Getting this correct requires genuine partnership, not just a licence checkbox.
Incomplete or misleading data. Data without context can be worse than no data at all. Crime statistics without information about reporting practices. Health outcomes without demographic context. School rankings without socioeconomic indicators. Publishing raw numbers and letting people draw their own conclusions sounds democratic, but in practice it often produces misleading conclusions: the “lies, damned lies, and statistics” problem that Mark Twain popularised (and probably didn’t coin).
This doesn’t mean you shouldn’t publish imperfect data. It means you should publish it with context, caveats, and documentation. Explain the methodology. Describe the limitations. Provide enough information for users to interpret the data responsibly. A dataset published with a clear statement of its limitations is vastly more useful, and less dangerous, than a dataset published with no context at all. The metadata isn’t a nice-to-have; it’s the difference between information and misinformation.
Open data and AI
There’s a newer dimension to the open data conversation that’s worth acknowledging: the relationship between open data and artificial intelligence.
Large language models and machine learning systems need TrainingThe process of fitting a model’s weights to data by minimising a loss function. data. Lots of it. The quality and breadth of that training data directly affects the quality of the resulting model. Open datasets (Wikipedia, Common Crawl, government statistical databases, scientific publications) are among the most important sources. Without open data, the AI revolution would be significantly more expensive, more concentrated among a few companies that could afford proprietary data, and less capable overall.
This creates tensions that the open data community is still working through. When a company trains a commercial AI model on data that was published under an open licence, is that a success story for open data (the data is being used, creating value, exactly as intended) or an exploitation of the commons (a for-profit company extracting value from publicly funded work without reciprocating)? The answer depends on the licence: CC0 data explicitly permits commercial use with no obligation. But the philosophical question is live, and it’s reshaping how some publishers think about their licensing choices.
Some organisations have responded by adopting more restrictive licences that prohibit AI training. Others have doubled down on openness, arguing that restricting data use undermines the principles that made open data valuable in the first place. The debate isn’t settled, and it won’t be for some time. But it’s worth being aware of as both a publisher and a consumer of open data.
There’s a related question about the datasets that AI systems produce. When a machine learning model trained on open data generates synthetic data, predictions, or classifications, should those outputs be open too? Some argue yes: if the inputs were open, the outputs should be too, maintaining the commons. Others argue that the model itself represents significant investment and the outputs are a derivative work that the model creator should control. This is new territory, and the licences written in the pre-AI era weren’t designed to answer these questions. Expect this to be a major area of policy development in the coming years.
The Australian landscape
Australia’s open data story is a mixed bag: some genuine successes, some persistent frustrations.
data.gov.au is the national open data portal, run by the Australian Government. It hosts thousands of datasets from federal agencies, covering everything from climate observations to Medicare statistics to electoral boundaries. It runs on CKAN, it’s searchable, and the datasets are generally well-licensed under the Australian Government Open Data Licence or Creative Commons.
State and territory portals add another layer. New South Wales has data.nsw.gov.au. Queensland has data.qld.gov.au. Victoria, South Australia, Western Australia, Tasmania, the ACT, and the Northern Territory all have their own portals. The quality varies. Some are well-maintained with regularly updated datasets and good metadata. Others are graveyards of stale CSV files last touched in 2017.
Australia’s federal structure creates its own problem. Similar data (health, education, transport, planning) is collected by different levels of government, in different formats, with different schemas, different update schedules, and different licences. Combining state-level transport data into a national picture requires mapping between different GTFS feeds, different stop ID systems, different route naming conventions. It’s doable, but it’s work that standardisation should have made unnecessary.
The Bureau of Meteorology is a standout. BoM publishes observations, forecasts, radar imagery, and climate data that’s used by agriculture, emergency services, aviation, and millions of people checking the weather on their phones. Its data is the foundation of weather services in Australia. But BoM’s data licensing has historically been more restrictive than you might expect for a publicly funded agency, a tension that’s been debated for years. The commercial weather industry has argued that BoM’s data should be fully open to enable innovation and competition. BoM has argued that its cost-recovery model (charging commercial users for premium data products) helps fund its operations. It’s a microcosm of the broader tension between open access and sustainable funding that runs through the entire open data movement.
The Australian Bureau of Statistics (ABS) deserves mention too. The ABS publishes an enormous range of statistical data, from the Census to labour force statistics to consumer price indices, through its website and APIs. The Census data, in particular, is extraordinarily detailed: population, age, employment, education, language, housing, and more, broken down by geography from the national level all the way down to individual mesh blocks (areas of about 30-60 dwellings). This data is the foundation of urban planning, market research, service delivery, and academic research across Australia. It’s freely available, well-documented, and regularly updated. When open data works well, it looks like this.
Geospatial data has seen significant progress. Geoscape Australia (formerly PSMA) provides authoritative address, building, and land parcel data. The Digital Earth Australia program publishes satellite-derived datasets covering the entire continent: land cover, water observations, surface reflectance. These are powerful resources for researchers, planners, and developers building location-based services.
The Digital Transformation Agency (DTA), and before it the Department of Finance, has pushed for a whole-of-government approach to open data. The Australian Government’s Open Data Policy establishes that government data should be open by default: published proactively, in machine-readable formats, under open licences. The intent is correct. The execution is uneven. Some agencies are exemplary. Others treat “open data” as a compliance checkbox, publishing a handful of datasets and calling it done.
The gap between policy and practice is real, but the trajectory is positive. More data is open today than five years ago. More agencies understand why it matters. And the community of users (developers, journalists, researchers, civic technologists) continues to grow, creating demand that makes it harder for agencies to retreat to closed defaults.
Civic technology has flourished alongside the official portals. Australia has an active civic tech community that builds tools on open data. PlanningAlerts aggregates development applications from local councils across the country and lets residents subscribe to alerts for their neighbourhood, all built on openly published planning data. They Vote For You tracks how members of parliament vote, using data from Hansard and parliamentary records. OpenAustralia makes parliamentary proceedings searchable and accessible. These projects exist because the underlying data is open, and they add a layer of usability and accountability that the original publishers didn’t build.
The civic tech model is powerful: volunteers and small organisations take raw open data and transform it into tools that serve the public interest. But it’s also fragile. These projects typically run on donated time and minimal funding. When a government changes its data format or breaks a URL, the volunteer who maintains the scraper has to fix it on a Saturday. Stability of open data infrastructure isn’t just nice to have; it’s the foundation that the entire civic tech ecosystem depends on.
The economics of open data
Who pays for this? Open data isn’t free to produce. Collecting, cleaning, maintaining, documenting, and publishing data costs money. Somebody has to run the servers, maintain the APIs, respond to user queries, and update the datasets when the underlying reality changes.
For government data, the answer is straightforward: taxpayers fund it. The data is collected as part of the government’s operations (running a census, monitoring the weather, recording property transactions) and the marginal cost of publishing it openly is small compared to the cost of collection. The Lateral Economics report commissioned by the Australian Government estimated that open data could add $25 billion per year to the Australian economy through improved efficiency, innovation, and reduced transaction costs. Even if that estimate is optimistic, the return on investment is substantial.
The harder question is what happens when open data replaces a revenue stream. BoM historically charged for some data products. Ordnance Survey in the UK charged for map data. PSMA in Australia charged for address data. When these datasets become free, the agencies that produced them need alternative funding. The UK resolved this by giving Ordnance Survey an explicit government mandate and funding stream. Australia has moved in a similar direction with Geoscape. The transition isn’t always smooth, but the principle is sound: if the public funded the collection, the public should benefit from the access.
For non-government data, the economics are different. Companies that publish open data do so for strategic reasons: building ecosystems, attracting developers, establishing standards, generating goodwill. OpenStreetMap relies on volunteer contributors. Wikipedia relies on donations. Academic datasets are funded by research grants. Each model has its strengths and its fragilities. The sustainability question (who pays, for how long, and what happens if they stop) is one that the open data movement needs to take seriously. A dataset that disappears when a grant expires or a government changes is worse than a dataset that was never published, because people built things that depended on it.
The most resilient open data is data that’s embedded in an organisation’s core operations: weather data collected by a national meteorology bureau, land records maintained by a property registry, census data collected by a national statistics office. These organisations exist to produce the data, and publishing it is a marginal cost on top of the collection. The least resilient is data maintained by a single enthusiast, funded by a short-term grant, hosted on infrastructure that costs money every month. Sustainability isn’t glamorous, but it matters.
Practical advice for builders
If you’re a developer, researcher, or analyst looking to use open data, here’s what I’ve learned.
Start with the portals. data.gov.au for Australian federal data. State portals for state data. Google Dataset Search for everything else. Awesome Public Datasets on GitHub is a curated list worth bookmarking. The World Bank Open Data portal is excellent for international development and economic indicators. Our World in Data curates and visualises datasets on global issues from poverty to climate change, with all underlying data downloadable. For Australian research data specifically, the Australian Research Data Commons (ARDC) maintains a catalogue of research datasets across Australian institutions.
Check the licence. Before you build anything, read the licence. CC0 and CC-BY are safe for almost any use. More restrictive licences (non-commercial, no-derivatives) may limit what you can do. Government data in Australia is often under Creative Commons, but not always; check each dataset individually.
Check the freshness. When was the dataset last updated? Is the update schedule documented? If you’re building a service that depends on up-to-date data, you need to know whether the data is maintained or abandoned. A dataset that was last updated three years ago might still be useful for historical analysis, but it’s not a reliable foundation for a live service. I’ve seen projects fail because they built on a government dataset that was being actively maintained, only for a machinery-of-government change to move the responsible agency and break the update pipeline. Open data has operational risk, and you should plan for it.
Validate the data. Open data is not clean data. Expect missing values, inconsistent formatting, encoding issues, duplicate records, and schema changes between versions. Build validation into your pipeline. Don’t trust the data blindly. I’ve seen government datasets with latitude and longitude columns swapped, date fields that switch between DD/MM/YYYY and MM/DD/YYYY partway through, and numeric columns that contain the string “N/A” instead of a null value. None of this is malicious. It’s the inevitable result of data being compiled by humans, across departments, over years, with inconsistent tooling. Clean it before you use it.
Cache and version. If you depend on an external dataset, keep a local copy. URLs break. Servers go down. Datasets get restructured without warning. Version your local copies so you can track changes and roll back if an update introduces problems.
Contribute back. If you find errors in a dataset, report them. Most government data portals have a feedback mechanism, even if it’s just a contact email. If you build a useful tool on top of open data, share it. If you clean and enrich a dataset, publish your improvements. The open data ecosystem works because people contribute to it, not just consume from it. Some of the most valuable open data resources (OpenStreetMap, Wikidata, OpenAddresses) exist entirely because individuals contributed their time and knowledge to a shared commons.
Understand the provenance. Who collected this data? When? Using what methodology? What population does it represent? What biases might it contain? A dataset of reported crimes doesn’t tell you about crime; it tells you about reported crime, which is a very different thing. A dataset of hospital admissions doesn’t tell you about disease prevalence; it tells you about the subset of sick people who went to a hospital. Every dataset is a lens, and understanding the shape of that lens is the difference between insight and illusion.
Respect the intent. Just because data is openly licensed doesn’t mean every use is appropriate. Location data for domestic violence shelters, even if technically public, shouldn’t be aggregated into a searchable database. Re-identification of anonymised personal data isn’t illegal in every jurisdiction, but it’s ethically wrong. The licence tells you what you can do. Your judgement tells you what you should do.
Think about the people in the data. Datasets about crime, health, housing, welfare, immigration, and disability describe real people’s lives. Aggregated, those people are statistics. But behind every row is a person, and the way you analyse, present, and discuss data has consequences for how those people are perceived and treated. “Data-driven decision making” sounds neutral, but data reflects the biases of the systems that collected it. Arrest data reflects policing priorities, not just crime rates. Hospital data reflects who has access to healthcare, not just who’s sick. Treat the data with the same respect you’d want if you were in it.
Build for sustainability. If your project depends on an open dataset, plan for the day it disappears. Mirror it. Cache it. Document the source so you can find it again if the URL changes. And if you’re publishing data yourself, think about the long term. Will this URL work in five years? Will someone maintain the update schedule after you move on? Sustainability isn’t the exciting part of open data, but it’s the part that determines whether the ecosystem endures.
Internationally: who’s doing it well
A few countries and institutions stand out as leaders in open data, and they’re worth studying for what they get correct.
The United Kingdom has been a pioneer. The UK’s data.gov.uk was one of the first national open data portals, launched in 2010 under Tim Berners-Lee’s advocacy. The UK government publishes detailed spending data, performance metrics, and geographic data under open licences. The Open Data Institute, co-founded by Berners-Lee and Nigel Shadbolt, has been influential in developing best practices and training governments worldwide.
The United States publishes an extraordinary volume of data through data.gov and agency-specific portals. NOAA’s weather data, NASA’s satellite imagery, the Census Bureau’s demographic data, the SEC’s financial filings: the US federal government’s commitment to open data is broad and deep, though unevenly implemented across agencies.
The European Union has taken a regulatory approach with the Open Data Directive, which requires member states to make government data available for reuse. The European Data Portal aggregates datasets from across the EU. And the Copernicus programme publishes some of the world’s best Earth observation data freely through the Sentinel satellites.
What these leaders have in common is political commitment, institutional infrastructure, clear licensing, and a community of users who hold publishers accountable. Open data doesn’t happen by accident. It requires someone (a minister, a senior official, a persistent advocate) to decide that it matters and to fund the infrastructure that makes it work. And it requires sustained attention, because the natural tendency of bureaucracies is to revert to closed defaults once the political champion moves on.
The case, stated plainly
Open data isn’t a panacea. It doesn’t automatically produce transparency, or innovation, or accountability. Those outcomes depend on people: people who publish data carefully, people who use it responsibly, people who build tools that make it accessible, and people who hold institutions accountable when the data reveals problems.
But the alternative (data locked behind paywalls, buried in PDFs, restricted by default, published only when it suits the publisher) is worse. Closed data is a missed opportunity for every app that won’t get built, every investigation that can’t proceed, every researcher who can’t replicate a result, every citizen who can’t see how their taxes are spent.
The best argument for open data is the things people build with it. Weather apps. Transit planners. Flood maps. Budget trackers. Accessibility tools. Epidemiological models. Election dashboards. Agricultural decision support. Most of these exist because someone, somewhere, decided to publish a dataset under an open licence, in a machine-readable format, at a stable URL.
That’s it. That’s the recipe. Publish the data. Make it machine-readable. Licence it for reuse. Keep the URL stable. Document it. Update it.
And then get out of the way, because the most interesting uses will be ones you never imagined.
The Katrina volunteers who mashed satellite imagery with street maps didn’t wait for permission. The developer who built the first transit app from GTFS data didn’t need a partnership agreement. The journalist who found corruption in government spending data didn’t need special access. The data was there. Open. Machine-readable. Licensed for reuse.
That’s the power of it. Not the data itself, but the permission, granted in advance, to everyone, for any purpose, to use it.
Open data is infrastructure. Like roads, like electricity, like the internet itself. It’s most valuable when it’s universal, reliable, and free at the point of use. It requires investment, maintenance, and political will.
And its greatest returns are the ones nobody planned for.