You may have heard the expression "data is the new oil" used by different people and thought that it either sounds catchy or that it’s just another slogan thrown around. I think it is quite an accurate expression, especially when used in relation to AI. Let’s dive into where the expression comes from and the reasons why I think it is a good comparison.
The expression has been attributed to several individuals, but it was first coined by Clive Humby, a British mathematician and data science pioneer, in 2006. Humby made the analogy that data is like oil in the sense that it needs to be refined to be useful, and that it has the potential to become a valuable resource for companies and organizations.
To understand the analogy, let’s compare the functions of oil and data. Oil has served as a primary source of energy for various applications, such as electricity generation, transportation, and residential and commercial heating. It has also been a key ingredient in industrial processes, agricultural applications, and transportation. But most relevant to our comparison, oil has been a critical driver of economic growth, creating jobs, enabling the development of new industries, and generating wealth.
Data, on the other hand, has become a crucial element in modern society. It is used to inform decision-making, drive innovation, personalize experiences, conduct research, and drive economic growth. Data analysis helps individuals, businesses, governments, and organizations to identify patterns and insights, develop new products and services, and make informed decisions. Additionally, data has become an essential resource for scientific research and has contributed to numerous advancements in various fields.
Both oil and data have helped drive economic growth and develop new technologies. I would argue that while oil is on its final lap, and its importance will start diminishing from here on, data has just entered the race and the importance of data is only growing. The reason for this is simple, and it is because of AI and analytics. You may have heard of trends such as big data and deep learning, both are referring to the fact that the more data that you have access to, the more insights you will get and the better the predictions will become, and this is, in my opinion, at the heart of why data is the new oil and the new king.
After the big data trends and deep learning, there was a brief moment when researchers and AI influencers were saying that actually small and optimized data is all you need to solve business use cases with machine learning. That is true for certain cases and especially in situations where the limitation is processing power or getting access to data is expensive. However, as we have seen with the recent advances in large language models (LLMs) and transfer learning that is not the case for the majority of use cases. The capabilities of these new models are growing almost in tandem with the size of the models, and the size of the models, now measured in billions of parameters, is directly related to the amount of training data that the algorithms are given.
Clive Humby stated that data, just like oil, needs to be refined to be useful. The reason for this is that a large majority of all the machine learning algorithms used today are partly or fully using methods called supervised learning. For these methods to function, they need training data. As an aside, this is one of the products our company StageZero provides, and also why I know a lot about the topic. Training data is refined or enriched data, data that is annotated or labeled in a way that a machine learning algorithm can be trained on it to perform a predictive function on future input.
Take speech data for example, it can be annotated as containing different things like emotion, speaker intent, and background noise, and it can be transcribed. All these different annotations can be used to segment and enrich data and then be fed to an algorithm so that it can be optimized to learn the patterns found in the data and provide predictions, which is called inference in machine learning terms. The result of it can be algorithms that for example transcribe speech-to-text, recognize emotion from speech, or recognize speaker intent used in voice assistants and call automation.
The basics of it are that the more such training data you have, the better the algorithm becomes at doing predictions. In reality, it is a bit more complicated than this. There is, for example, a diminishing return on training data meaning that the closer you come to 100% accuracy the more data you will need to increase it further.
Low-quality data will add what is called noise to the algorithms, which in the worst-case scenario reduces the performance of the models. This is also similar to how enriched oil functions, you need to use the right type of gasoline in your car or the engine will in the worst-case scenario break.
Data is the new oil because of the new technologies it enables, such as AI, and the innovation and growth that follows from those technologies. Additionally, data and oil share the fact that without further refinement and enhancement, they are not very useful. Simply owning a lot of data is not worth anything, you will need to be able to process insights from it for it to become an advantage.