In working over a decade within the AI industry, I've gained insights into the inner workings of AI algorithm creation, revealing a somewhat unsettling reality. You may have heard that the industry has recently been plagued by a growing wave of unethical business practices. Notably, many emerging AI companies have catapulted their way to success by exploiting vulnerable individuals in developing nations. However, the issue extends beyond this surface layer.
From my experience, it's far more common for the collaborators that companies engage with to attempt fraudulent activities such as providing misleading data or annotations. This problem has become so pervasive that companies regularly dealing with new data partners are compelled to develop their proprietary in-house algorithms to scrutinize data quality through various means.
At StageZero, nearly half of our partners have, in one way or another, attempted to deceive us with erroneous data or annotations.
At StageZero, nearly half of our data partners have, in one way or another, attempted to deceive us with erroneous data or annotations. Furthermore, almost all the partners we've collaborated with have supplied subpar data or annotations during our cooperative ventures.
The AI industry lives on training data, all algorithms need data in a specific way to be used for training. Companies invest hundreds of millions of dollars into data acquisition, data annotation, and algorithm development. Deep learning algorithms need millions of data points to provide accurate generalizable results. Take GPT-4 for example, which is Open AI's latest and most advanced Large Language Model. It is trained on 220 billion parameters.
In general, all that data should be validated by humans, however, in the case of large language models that is likely not the case as the scope of the models is simply too vast. This means that LLMs often can be trained on data that is not vetted, which is also how we have come to see products such as ChatGPT suddenly behaving strangely.
The AI data industry has two repeating issues that you need to consider when building a product:
Over the years, we've encountered several prevalent data scams:
Regrettably, the solution to this problem is a rather cynical one – trust no partner entirely. We've observed that more trustworthy partners are often those who openly discuss incidents they've witnessed. Nevertheless, even with such partners, we recommend a rigorous verification process for the data they provide.
Our approach nowadays involves implementing a series of machine checks and versioning checks for all collected data, ensuring that it matches the instructions and requirements. Additionally, we define a review process, by having a trusted team or individual to validate the accuracy of data and annotations supplied by partners.