Google Hungers for More Data to Train Its AI

Google is making clear it intends to feast on the content of web publishers to advance its artificial intelligence systems. The tech and search giant is proposing that companies must opt out—as they currently do for search engine indexing—if they don’t want their material scraped.

Critics of this opt-out model say the policy upends copyright laws that put the onus on entities seeking to use copyrighted material, rather than the copyright holders themselves.

Google’s plan was revealed in its submission to the Australian government’s consultation on regulating high-risk AI applications. While Australia has been considering banning certain problematic uses of AI like disinformation and discrimination, Google argues that AI developers need broad access to data.

As reported by The Guardian, Google told Australian policymakers that “copyright law should enable appropriate and fair use of copyrighted content” for AI training. The company pointed to its standardized content crawler called robots.txt, which lets publishers specify sections of their sites closed to web crawlers.

Google offered no details on how opting out would work. In a blog post, it vaguely alluded to new “standards and protocols” that would allow web creators to choose their level of AI participation.

he company has been lobbying Australia since May to relax copyright rules after releasing its Bard AI chatbot in the country. However, Google isn’t alone in its data mining ambitions. OpenAI, creator of leading chatbot ChatGPT, aims to expand its training dataset with a new web crawler named GPTBot. Like Google, it adopts an opt-out model requiring publishers to add a “disallow” rule if they don’t want content scraped.

This is a standard practice for a lot of big tech companies that rely on AI (deep learning and machine learning algorithms) to map their users’ tastes and push content and ads to match.

This push for more data comes as AI popularity has exploded. The capabilities of systems like ChatGPT and Google’s Bard rely on ingesting massive text, image, and video datasets. According to OpenAI, “GPT-4 has learned from a variety of licensed, created, and publicly available data sources, which may include publicly available personal information.”

But some experts argue web scraping without permission raises copyright and ethical issues. Publishers like News Corp. are already in talks with AI firm, seeking payment for using their content. AFP just released an open letter about this very issue.

“Generative AI and large language models are also often trained using proprietary media content, which publishers and others invest large amounts of time and resources to produce,” the letter reads. “Such practices undermine the media industry’s core business models, which are predicated on readership and viewership (such as subscriptions), licensing, and advertising.

“In addition to violating copyright law, the resulting impact is to meaningfully reduce media diversity and undermine the financial viability of companies to invest in media coverage, further reducing the public’s access to high-quality and trustworthy information,” the media agency added.

The debate epitomizes the tension between advancing AI through unlimited data access versus respecting ownership rights. On one hand, the more content consumed, the more capable these systems become. But these companies are also profiting from others’ work without sharing benefits.

Striking the right balance won’t be easy. Google’s proposal essentially tells publishers to “hand over your work for our AI or take action to opt out.” For smaller publishers with limited resources or knowledge, opting out may prove challenging.

Australia’s examination of AI ethics provides an opportunity to better shape how these technologies evolve. But if public discourse gives way to data-hungry tech giants pursuing self-interest, it could establish a status quo where creations are swallowed whole by AI systems unless creators jump through hoops to stop it.

Stay on top of crypto news, get daily updates in your inbox.

Source: https://decrypt.co/151970/google-ai-web-scraping-australia-investigation-training-data