Data is the new oil – Rich Folsom

The phrase ‘data is the new oil’ is an interesting one. Apparently first coined by Clive Humby (the ‘humby‘ of Dunhumby) in 2006 – it’s a neat sounding metaphor you won’t ever be far from someone offering if discussing the use, value or processing of the data industry.

I want to try to provide a breakdown of where I see the analogy as being helpful, and where it risks misdirecting us.

The meaning behind the meaning

I have to admit an irritated feeling now when I hear the phrase. The speaker could just say “data is really important“, if that was their message, or they could set out why they think data has similar properties to oil if that was relevant – but they typically don’t. Toxic oil and intangible data of course feel so different, which is what has made the saying catch on. However it is often left as an exercise to the audience to figure out why data and oil are in fact comparable – which makes it not much of a teaching tool, or explanation of things 💬, not as an insight in itself.

The meaning

I think the straightforward meaning of the analogy is to say that, like oil, data is:

highly valuable
unevenly distributed;
not much use to anyone until you process it; and
a (the?) key limiting factor of potential or differentiation the relevant industrial era.

What is data?

Let’s define 💬 a few terms:

Information – a signal of something, manifesting in some sort of orderliness of the universe, with reduce randomness and entropy
Data – a compressed, filtered type/representation of information, in an accessible format, typically at volume, typically excluding the format itself however (e.g. a spreadsheet file is not the data, the data is ‘inside’ it)
A copy of (some) data – the physical manifestation of some data of a medium (like a hard drive, or sheet of paper)
Rights in data – a legal right to own and use (or prevent others using/enjoying) certain data. This might come in the form of confidentiality, copyright, database right, a patent (!), data protection / privacy rights, privilege, official secrets, trade secrets, etc.

This gives another insight in line with the analogy: oil is superior to coal partly because of its higher energy density (about double). Data is a valuable manifestation of information because it is a high density version of it.

With this framework, we can start applying different verbs or qualifiers to these concepts.

More!

Having more copies of data does not mean there is more data (or information, or rights in the data). Having more data, equally, does not necessarily mean there is any more information (in fact it might mean the real information is harder to find, and so the data less valuable). Generally though, when business people say refer to ‘more’ data it implies they have more information, which therefore carries more value. To the technologist however, ‘more data = more value’ is a much more qualified statement than, for example, saying ‘more oil = more value’.

format shifting

Transforming information to data, and data to derived data is fundamental to its use, value and life-cycle. Choices about what sort of sensor can receive information, how to capture it, filter it, encode it, and what format to structure the resultant data in require judgment and skill. The requirement of human effort, skill, or originality is part of the policy justification and trade-off for the state granting a monopoly over the output (ie. IP rights in the data).

For example copyright in a database arises from the original selection or arrangement of data in a database that uses judgment, taste or discretion. This originality needs to relate to the structure of the database itself, not the data within it 💬. A separate ‘database right’ exists from obtaining (from a third party source), verifying or presenting the data in a database.

Splitting it up

We can also see that one can have custody of a data copy (and can own the medium it is on), and separately, there might be a variety of people with different rights in the data (Party A might own the hard drive, Party B the copyright, Party C the database right, Party D the data controller, Party E be a data subject, etc.). Each of these parties would have physical or legal methods to control the use of the data, and contracts with each other.

One might think that rights in the data and the medium the data is stored on, are independent of each other (see, for example the English law example that a lien is over a chose in possession, not a chose in action like a database) . This is not always the case however, For example:

in law…
- extra IP related privileges/waivers 💬 granted to the physical owner of the medium storing the data that the IP rights subsist in (for example the trade mark is exhausted upon first sale, so a customer can re-sell an physical item on a second-hand basis without infringing the copyright or trade mark);
- a software patent can be granted where the patent is not for the “computer program as such” – so you can patent the software where the inventive step relates to the physical world (monitor dimensions, semi-conductor design, etc.) – ie it depends on the physical medium on which the data (software code) is running; and
- the exemption to English copyright for a lawful user to create any backup copy of a computer program, or a private use copy of non-computer program works – the inherent rights are specific to physical copies of the data; and
in practice…
- ‘Tivoization’ where hardware and software are entangled so that even though a third party has IP rights to use the open source software, they cannot properly enjoy those rights because specific hardware is needed to do so; and
- cryptocurrencies 💬.

To use or distribute data you need a copy of it, and should have the necessary rights (to the extent they exist) in the data. For many commercial purposes, the data might be viewed as being the ‘rights in the data’ – because the data ~~can’t~~ shouldn’t be used or exploited in excess of those rights.

These rights in data are not legal technicalities – they are made in, and only exist in the legal system

These rights in data are not legal technicalities – they are made in, and only exist in the legal system. The legal permissions and restrictions that apply to them define them, and their lifecycle, entirely. Sometimes you hear of people tempted to discount rights in data as being of less importance because they are intangible. I would remind them that the data itself is also intangible💬.

Differences

Analogies simplify and abstract – this can help with initial understanding, but of course they have limits, beyond which they are counterproductive. Analogies taken beyond their useful limits can prime us to use a wrong mental framework. And analogies between distant things (like data and oil) can prime us to use a very wrong framework. This is quite important, especially when the analogy is meant to be an aid to new people to the space.

Fungibility

Oil is a commodity which is largely fungible. A barrel of oil of a certain grade is just that – you can figure out what it is worth based on a price index, and you would be just as happy with the next one, as this one.

The whole point of data is that it is not fungible. A copy of data is (for most purposes) fungible with other copies. But the underlying information is not fungible for other information (and if it were, it would be the same thing). As data is not a commodity, it has to be consumed and used to understand its quality.

The whole point of data is that it is not fungible.

Rivalrous-ness

Oil is rivalrous, my rights to ‘enjoy’ a barrel would subtract from yours rights. 2 barrels are worth double what 1 barrel is 💬.
Data is typically described as not being rivalrous 💬. I think that is not right. It is semi-rivalrous: giving someone else a data copy may lead to the data being less useful to me, if they then go to compete with me to consume a scarce resource (a better driving route, a stock tip, etc.).

Fuel vs exhaust

Oil is fuel – no question about that.
Data is often the ‘exhaust’ of another process. In time, that data can then become as valuable as the process itself (or more so!) – at which point you have a ‘cycle’ without a start and end, and so no clear ‘fuel’ or ‘exhaust’. Calling the data the ‘exhaust’ of this sort of process can be seen as dismissive, luddite-ish, or old-fashioned 💬.

Default owner

Property rights in oil and data arise differently. Property rights are inherently part of tangible objects (‘chose in possession’ to use the fancy term). Within a legal system, property rights will subsist in a gallon of oil just by it existing – without it being interacted with in any way. The property rights in data arise when a human does something to information or data, and the rights subsist in the newly created data. The rights reflect the outcome of an activity, and in a sense are a creature of, and reflection of, human dynamism and creativity.

When choosing to compare something to oil, one might be referencing oil’s role in the Industrial Era, it being an oligopolistic good, and a component of the comparative wealth of nations. Here, again, I think the analogy with data falls down. For oil, the default position is a state monopoly. For data rights, the default position is fragmented, with the initial owner being the person who did the activity. In large part, the data custody and rights accrue to the companies that create the data. Of course, we then see centralisation with data hoarders, wholesalers, etc. but that is downstream of fragmented creation.

The more that oil gets handled (on the whole, with important exceptions) the more the ownership becomes distributed. The more that data gets processed (on the whole, with important exceptions), the more the ownership becomes concentrated.

So what would work better?

The classic ‘alternative’ analogy (which I have used in the past) is ‘data is like sunlight‘. it offers a few different comparisons to the oil analogy. Data and sunlight each:

are valuable for a large range of commercial uses;
are effectively unlimited and renewable; and
require tooling and resources to capture and make useful.

I admit it does have a few of the ‘data is the new oil’ flaws that I point out above, but at least it is less of a trope.

We are now post-analogy on data

My advice is that we are now ‘post-analogy’ on data: no ‘data is like…‘ quip is needed. Data can be talked about in its own terms – plainly stating its properties. If a poetic attention-grabber is needed then go for it – but for teaching and analysis, it’s time for straight talking.