It Took Glue on Pizza to Spotlight Google’s AI Problem

Image: Shutterstock (with AI assist)

Google, the “indispensable” search engine relied on by millions for accurate and reliable search, has done it again, stepping smack into the pile of steaming excrement waiting for it in the middle of the road. Its most recent ill-starred foray into AI generated search has yielded some hilarious results, lighting up the blogosphere and making Google the butt of many jokes. After flubbing the public launch of its first AI enabled service, Bard, back in early 2023 when the AI driven search function produced the wrong results for a simple question about the James Webb Space Telescope, overnight wiping $100 million off Google’s valuation, Google’s new Gemini “AI Overview” service scored another own goal with its “hallucinatory” responses to questions like how to ensure cheese will stay on pizza (add glue) or how many rocks a day should a human eat. (Only one, in case you were wondering). It also informed users that Barack Obama was the first Muslim President of the United States.

When it comes to AI, “hallucinations” refer to incorrect or misleading results resulting from lack of training data, biased or selective training data, or incorrect assumptions made by the model. Hallucinations leading to trademark dilution was one of accusations levelled against OpenAI and Microsoft by the New York Times in its landmark copyright infringement case that is still working its way through the courts. In this case, the AI algorithm incorrectly attributed the false information to the Times, thus undermining its journalistic credibility, and diluting its trademark, or so the argument goes.

Apparently, the source of the pizza glue misinformation was an old tongue-in-cheek post on Reddit. I guess an AI algorithm has no sense of humour and can’t tell sarcasm from reality. It also gives credibility to conspiracy theories and blatantly false information, such as the Barack Obama example. Normally a search on any subject turns up a variety of sources on Google, some clearly more authoritative than others. Searchers can weigh a Wikipedia entry against a Reddit post against information from a government website or reputable academic institution. Even a plain old tendentious website put up by an advocacy organization can be probed and the bona fides of the source checked out. That is becoming more difficult, or at least less obvious, with the AI generated search summary provided by Google’s AI Overview.

If the search topic falls within AI Overview’s purview (and at the moment, not all do), viewers will see a summary of the information requested drawn from sources chosen by the algorithm. The algorithm decides how much information is drawn from any given site, and which sites are chosen. Users have the option of clicking through to access these and other sites that are displayed (below the annoying sponsored listings). However, many consumers, looking for a quick information fix, will not bother to do this and thus risk taking the AI summary as gospel. If you are being advised to mix glue into your pizza topping, you can probably figure out that something is haywire, but if the summary is only slightly wrong, or is on a subject that you are not familiar with, watch out. A good example was provided by the website Plagiarism Today. It asked Google five questions about copyright in the US. Its conclusion regarding the responses provided by AI Overview? Decidedly mixed. One A, one B, one C, one D and one resounding F.

The accuracy of the summary obviously depends on the sources of information chosen by the algorithm, and the emphasis it chooses to put on information from any given source. Unfortunately, in many instances it does not seem to prefer credible and authoritative sources, but instead goes for those that are popular. That is one of the basic problems of AI generally—quantity over quality, popularity over facts. (By the way, this account of how AI Overview works is based on reading about it from US sources since it is not yet available in Canada, which may be a good thing since I have read various US posts explaining that Overview is impossible to disable and very hard to turn off). Google intends to cram it down your throat whether you want it or not.

Of course, Google assures everyone that Overview is a “good thing” and the early kinks will be ironed out. Many websites are not happy with the new interface that will now exist between themselves and the consumer. They lose traffic when users simply read the AI summary and move on, not visiting the source website. Google used to boast that it had a symbiotic relationship with content providers because it facilitated, and even drove, eyeballs to the sites. No longer. It has appropriated–without permission–content from independent sites to feed AI Overview in the same way that the LLM (Large Language Model) AI developers have scooped up content, including copyrighted content, from rightsholders, without permission, licence or payment to enable their AI training. It is one thing to link to third party content, which requires a visit to the actual site to access the content; it is quite another to freely copy from it and mix it, sometimes inaccurately or inappropriately, with content from other websites that may not be reliable or acceptable sources of information.

Google clearly controls what goes into AI Overview and has said that it will apply more filters. If it can screen out sources of sarcasm and parody, it clearly has the capacity to install other filters that could differentiate trustworthy information from garbage. This might require Google to license the use of this curated information (Horrors! Google having to pay for the information of others that it so freely uses!). Licensing has already begun for content used for generative AI training. News Corp has just signed a licensing deal with OpenAI, as has the AP and the Financial Times. Licensing is at the heart of the dispute that OpenAI is facing with the New York Times.

Licensing presupposes knowledge of what inputs are being used, a requirement now enshrined in EU law which requires that AI developers maintain an inventory of works used for training purposes (transparency). This will allow rightsholders to opt out or negotiate a licensing solution unless the copying meets the text and data mining exception (i.e. for research by research and cultural organizations).

However, Silicon Valley has variously proclaimed that (a) it is impossible to track all the information ingested during AI training; (b) it would bankrupt the industry should they have to pay for content (c) they need to use copyrighted content because there is not enough current public domain information available (d) it is not feasible to filter out or identify specific works amongst the millions of datapoints that it ingests (e) everything that it does is fair use anyway (f) all of the above. Google’s embarrassment, and its apparent ability to finetune AI Overview, demonstrates that it is clearly feasible to filter out certain works and types of content. It is the will to do so generally that is lacking. Meanwhile, the number of lawsuits brought by rightsholders against AI developers continues to multiply.

By making itself the object of social media ridicule, and then admitting it can address the problem, Google has actually done us all a favour by highlighting the “garbage in; garbage out” problem. Not all copyrighted material is responsible or accurate but a good chunk of it is, such as professional journalism and academic journals. Access to that material is essential to provide credible results. And that material needs to be paid for, on terms set by the content owners.

The solution is not to stop the development of generative AI; for one thing, that won’t happen. It is to corral it, improve it and make it more trustworthy, if necessary with penalties if it is not. The penalties could be imposed by the market (i.e. Google search is not reliable so I will go elsewhere) or, in certain cases, by regulation. Licensing of accurate, credible information to drive search will inevitably distinguish the fake from the real and dubious from the trustworthy. This is what any credible search engine seeks. It is market gold.

Google, open your bulging wallet and start licensing content that will make us want to continue to try you first to get reliable information. Right now, through your clumsy rollout of AI supported search, you are rapidly losing that trust. It is also not acceptable to plagiarize someone else’s content, mix it with garbage from some other source, and serve it up on a platter to consumers on the pretext that this is the definitive answer. The result, as we have seen, is gluey pizza.

(c) Hugh Stephens, 2024. All rights reserved.

Author: hughstephensblog

I am a former Canadian foreign service officer and a retired executive with Time Warner. In both capacities I worked for many years in Asia. I have been writing this copyright blog since 2016, and recently published a book "In Defence of Copyright" to raise awareness of the importance of good copyright protection in Canada and globally. It is written from and for the layman's perspective (not a legal text or scholarly work), illustrated with some of the unusual copyright stories drawn from the blog. Available on Amazon and local book stores.

2 thoughts on “It Took Glue on Pizza to Spotlight Google’s AI Problem”

  1. just read that ChatGPT is facing an unusual problem – it’s running out of original text to absorb. Does this suggest that future AI programs will need to ingest each other’s output?

Leave a Reply

Discover more from Hugh Stephens Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading