Deceptive practices and ethical concerns in AI data

Deceptive practices and ethical concerns in AI data

Navigating the complex terrain of AI data integrity and ethics

Businessman working modern compter Document Management System (DMS),Virtual online documentation database and process automation to efficiently manage files, knowledge and documentation in enterprise with ERP.

In working over a decade within the AI industry, I've gained insights into the inner workings of AI algorithm creation, revealing a somewhat unsettling reality. You may have heard that the industry has recently been plagued by a growing wave of unethical business practices. Notably, many emerging AI companies have catapulted their way to success by exploiting vulnerable individuals in developing nations. However, the issue extends beyond this surface layer.

From my experience, it's far more common for the collaborators that companies engage with to attempt fraudulent activities such as providing misleading data or annotations. This problem has become so pervasive that companies regularly dealing with new data partners are compelled to develop their proprietary in-house algorithms to scrutinize data quality through various means.

At StageZero, nearly half of our partners have, in one way or another, attempted to deceive us with erroneous data or annotations.

At StageZero, nearly half of our data partners have, in one way or another, attempted to deceive us with erroneous data or annotations. Furthermore, almost all the partners we've collaborated with have supplied subpar data or annotations during our cooperative ventures.

Background: AI is fueled by training data

The AI industry lives on training data, all algorithms need data in a specific way to be used for training. Companies invest hundreds of millions of dollars into data acquisition, data annotation, and algorithm development. Deep learning algorithms need millions of data points to provide accurate generalizable results. Take GPT-4 for example, which is Open AI's latest and most advanced Large Language Model. It is trained on 220 billion parameters.

In general, all that data should be validated by humans, however, in the case of large language models that is likely not the case as the scope of the models is simply too vast. This means that LLMs often can be trained on data that is not vetted, which is also how we have come to see products such as ChatGPT suddenly behaving strangely.

The ethical and deceitful practices in the AI industry

The AI data industry has two repeating issues that you need to consider when building a product:

  1. Unethical practices among AI companies: Major AI players often outsource their work to regions with the lowest labor costs, sometimes reducing salaries to as little as $2 per hour, as documented in several cases, such as here and here. It has also been documented that pay has been withheld from workers or deducted in various cases. To me this is not surprising, however, the reason this happens is not that the major AI companies would be malicious. Rather, the reason is that the work has not been in line with the provided guidelines. Once you understand that data that does not conform to instructions is worse than no data, you will understand that companies in the AI industry need to be very strict with the data that they accept and this is always communicated at the start of a project.
  2. Deceitful practices of data partners: In regions where outsourcing to low-wage countries is not an option, there is a proliferation of smaller data providers catering to specific languages or use cases. These smaller players, unfortunately, exhibit a recurring pattern of deceitful practices.

Deceitful data practices you may run into

Over the years, we've encountered several prevalent data scams:

  • Synthetic data substitution: Providing synthetic data instead of the authentic data agreed upon.
  • File duplication and renaming: Duplicating files, renaming them, and claiming they contain distinct content, essentially submitting identical files multiple times for multiple payments.
  • Machine-generated annotations: Using machine learning to generate annotations while falsely attributing them to human effort.
  • Unaltered annotations: Submitting annotations without any substantial changes or improvements.
  • Empty or sparse recordings: Sending empty or almost inaudible speech recordings.

Verify all data by providers

Regrettably, the solution to this problem is a rather cynical one – trust no partner entirely. We've observed that more trustworthy partners are often those who openly discuss incidents they've witnessed. Nevertheless, even with such partners, we recommend a rigorous verification process for the data they provide.

Our approach nowadays involves implementing a series of machine checks and versioning checks for all collected data, ensuring that it matches the instructions and requirements. Additionally, we define a review process, by having a trusted team or individual to validate the accuracy of data and annotations supplied by partners.

Where else can you find me?

LinkedIn
REACH OUT
The easiest way to reach me is by filling out this form.