When promoting open data to people, we could worry less about explaining the importance of file formats and more about inspiring them to publish good data.
The argument for open data is often coupled with a call to make it ‘machine-readable’, which in turn tends to be expressed with an air of exasperation: ‘saving data in csv format is dead easy, so there’s no excuse not to’.
That may or may not be the case; I suspect it depends a lot on the experience, capacity, conceptual understanding and professional freedom of the person holding the data. And it requires a bit of an explanation of what is meant by ‘machine-readable’: arguably all electronic formats are machine-readable, even pdfs produced from a photograph of text; technology is developed to tackle such problems, and the technology does exist to translate an image into digital text.
I agree that using an open format is important; and I appreciate that the star scheme suggested by Tim Berners-Lee encourages people to get into the habit of publishing at all before concerning themselves with which format to publish in. However, when I hear people trying to argue for open data they invariably press on the issue of using an open format (usually csv (comma-separated values)).
What worries me is that it becomes a distraction when trying to convince other people to publish data. The issue of format is not as easy to defend as we think it should be, and so can dilute the argument (in my experience it can take a fair leap of understanding to appreciate why others might want to use data in the first place, and to appreciate why anyone would benefit from them being able to do so as easily as possible).
More pressing, in my opinion, is the need to raise awareness of the integrity of the data being published, and to thrash out some agreement on structuring it. As my friend Simon put it, what are the column headings they should be using?
That is, quite rightly, acknowledged as a difficult problem, as each local authority currently publishes their data to different schemas (if to any at all). But cracking that particular nut would make the data itself much more useful and interoperable, even if some of it does require a few extra steps to digitise.
Certainly it’s important to encourage data to be published in as open and durable a format as possible, and currently that format appears to be csv. There is a clear need to provide a plain English explanation of the difference between, say, publishing as csv and publishing as pdf.
In the meantime though, for the purpose of arguing the toss for publishing data at all, I think the issue of format may be an unnecessary distraction.
Let’s get people enthusiastic about publishing for the sake of sharing, and about publishing good quality data. Once they’re thinking in those terms, the question of format should follow naturally.