In the rush to build and train ever larger AI models, developers have swept up much of the searchable Internet, quite possibly including some of your own public data—and potentially some of your private data as well.
How do AI companies gather data?
AI companies utilize automated programs known as web crawlers and web scrapers to gather data from the internet. Web crawlers act like digital spiders that navigate from one URL to another, cataloging the information they encounter. Web scrapers then download this cataloged information. For instance, OpenAI has used a web crawler called Common Crawl to collect training data for its models.
Is my private data safe from AI training?
While general web crawling typically does not include locked-down social media accounts or private posts, companies like Meta have admitted to using public posts from platforms like Facebook and Instagram to train their AI. This raises concerns about how 'public' is defined and whether private information could inadvertently be included in AI training datasets.
What are the implications of biased AI models?
Bias in AI training data can lead to outputs that reflect harmful stereotypes and skewed perspectives. For example, AI image generators may produce more sexualized depictions of women compared to men. This bias arises because the internet itself contains a mix of valuable and toxic information, and AI models can inadvertently amplify these biases, making it essential to rethink how we approach AI training and its applications.