this post was submitted on 02 Sep 2024
84 points (100.0% liked)
Technology
37708 readers
337 users here now
A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.
Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.
Subcommunities on Beehaw:
This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
This article shows rather well three reasons why I don't like the term "hallucination", when it comes to LLM output.
On the main topic of the article. Are LLMs useful? Sure! I use them myself. However only a fool would try to shove LLMs everywhere, with no regards to how intrinsically [yes] unsafe they are. And yet it's what big tech is doing, regardless of being Chinese or United-Statian or Russian or German or whatever.
I feel like "hallucination" was chosen as the word because of what it implies.
It doesn't imply a bad algorithm, which makes the company look bad since hallucinations are out of a person's control. It doesn't imply using poor training data for the same reason.
But hallucination also masks the development of the model. A small kid might say something racist based on what they grew up with, but we would likely call that child immature. Same if an AI doesn't fully understand a question or repeats a wrong answer that was given to it by someone as a laugh.
No one is fault. It was just a hallucination.
I don't really agree with that argument. By that logic, there's really no such thing as a software bug, since the software is always doing what it's supposed to be doing: giving predefined instructions to a processor that performs some action. It's "supposed to" provide a useful response to prompts, anything other than is it not what it should be and could be fairly called a malfunction.
When it comes to the code itself you're right, there's no difference between "bug" and "not a bug". The difference is how humans classify the behaviour.
And yet there's a clear mismatch between what the developers of those large "language" models know that they're able to do, versus what LLMs are being promoted for, and that difference is what is being called "hallucination". They are not intelligent systems, the info that they output is not reliably accurate, it's often useless rubbish. But instead of acknowledging it they label it "hallucination".
Perhaps an example would be good here. Suppose that I made a text editor; it works nicely as a text editor and nothing much else. Then I make it automatically find and replace the string "=2+2" with "4", and use it to showcase my text editor as if it was a calculator. "Look, it can do maths!".
Then the user types down "=3+3", expecting the "calculator" to output "6", and it doesn't. Can we really claim that the user found a "bug"? Not really. It's just that I'm a phony and I sold him a text editor as if it was a calculator.
And yet that's exactly what happens with LLMs.
I think to some extent it's a matter of scale, though. If I advertise something as a calculator capable of doing all math, and it can only do one problem, it is so drastically far away from its intended purpose that the meaning kinda breaks down. I don't think it would be wrong to say "it malfunctions in 99.999999% of use cases" but it would be easier to say that it just doesn't work.
Continuing (and torturing) that analogy, if we did the disgusting work of precomputing all 2 number math problems for integers from -1,000,000 to 1,000,000 and I think you could say you had a (really shitty and slow) calculator, which "malfunctions" for numbers outside that range if you don't specify the limitation ahead of time. Not crazy different from software which has issues with max_int or small buffers.
If it were the case that there had only been one case of a hallucination with LLMs, I think we could pretty safely call that a malfunction (and we wouldn't be having this conversation). If it happens 0.000001% of the time, I think we could still call it a malfunction and that it performs better than a lot of software. 99.999% of the time, it'd be better to say that it just doesn't work. I don't think there is, or even needs to be, some unified understanding of where the line is between them.
Really my point is there are enough things to criticize about LLMs and people's use of them, this seems like a really silly one to try and push.
The comment that you're replying to is fairly specifically criticising the usage of the word "hallucination" to misrepresent the nature of the undesirable LLM output, in the context of people selling you stuff by what it is not.
It is not "pushing" another "thing to criticise about LLMs". OK? I have my fair share of criticism against LLMs themselves, but that is not what I'm doing right now.
When we extend analogies they often break in the process. That's the case here.
Originally the analogy works because it shows a phony selling a product by what it is not. By making the phony to precompute 4*10¹² equations (a completely unrealistic situation), he stops being a phony to become a muppet doing things the hard way.
Emphases mine. Those "ifs" represent a completely unrealistic situation, that does not show anything useful about the real situation.
We know that LLMs output "hallucinations" way more than just once, or 0.000001% of the time. They're common enough to show you how LLMs work.
Except Lvxferre is actually correct; LLMs are not capable of determining what is useful or not useful, nor can they ever be as a fundamental part of their models; they are simply strings of weighted tokens/numbers. The LLM does not "know" anything, it is approximating text similar to what it was trained on.
It would be like training a parrot and then being upset that it doesn't understand what the words mean when you ask it questions and it just gives you back words it was trained on.
The only way to ensure they produce only useful output is to screen their answers against a known-good database of information, at which point you don't need the AI model anyways.
A software bug is not about what was intended at a design level, it's about what was intended at the developer level. If the program doesn't do what the developer intended when they wrote the code, that's a bug. If the developer coded the program to do something different than the manager requested, that's not a bug in the software, that's a management issue.
Right now LLMs are doing exactly what they're being coded to do. The disconnect is the companies selling them to customers as something other than what they are coding them to do. And they're doing it because the company heads don't want to admit what their actual limitations are.
Where I don't think your argument fits is that it could be applied to things LLMs can currently do. If I have an insufficiently trained model which produces a word salad to every prompt, one could say "that's not a malfunction, it's still applying weights."
The malfunction is in having a system that produces useful results. An LLM is just the means for achieving that result, and you could argue it's the wrong tool for the job and that's fine. If I put gasoline in my diesel car and the engine dies, I can still say the car is malfunctioning. It's my fault, and the engine wasn't ever supposed to have gas in it, but the car is now "failing to function in a normal or satisfactory manner," the definition of malfunction.
The purpose of an LLM, at a fundamental level, is to approximate text it was trained on. If it was trained on gibberish, outputting gibberish wouldn't be a bug. If it wasn't, outputting gibberish would be indicative of a bug.
A better analogy would be selling someone a diesel car, when they wanted an electric vehicle, and them being upset when it requires refueling with gas. The car isn't malfunctioning in that case, the salesman was.
I'd argue that's what an LLM is, not its purpose. Continuing the car analogy, that's like saying a car's purpose is to burn gasoline to spin its wheels. That's what a car does, the purpose of my car is to get me from place to place. The purpose of my friend's car is to look cool and go fast. The purpose of my uncle's car is to carry lumber.
I think we more or less agree on the fundamentals and it's just differences between whether they are referring to a malfunction in the system they are trying to create, in which an LLM is a key tool/component, or a malfunction in the LLM itself. At the end of the day, I think we can all agree that it did a thing they didn't want it to do, and that an LLM by itself may not be the correct tool for the job.
No, that was the purpose for you, that made you choose to buy it. Someone else could have chosen to buy a car to live in it, for example. The purpose of a tool is just to be a tool. A hammer's purpose isn't just to hit nails with, it's to be a heavy thing you can use as-needed. You could hit a person with it, or straighten out dents in a metal sheet, or destroy a harddrive. I think you're conflating the intended use of something, with its purpose for existing, and it's leading you to assert that the purpose of LLMs is one specific use only.
An LLM is never going to be a fact-retrieval engine, but it has plenty of legitimate uses: generating creative text is very useful. Just because OpenAI is selling their creative-text engine under false pretenses doesn't invalidate the technology itself.
Sure, 100% they are using/ selling the wrong tool for the job, but the tool is not malfunctioning.
There's no objective definition of "useful". Objectively the program is working. Subjectively it's not working how certain people want it to work.
We're talking about the meaning of "malfunction" here, we don't need to overthink it and construct a rigorous proof or anything. The creator of the thing can decide what the thing they're creating is supposed to do. You can say
We don't need to go to
I wouldn't call pasting verbatim training data hallucination when it fits the prompt. It's not necessarily making stuff up.
I feel like you're unfittingly mixing tool target behavior with technical limitations. Yes, it's not knowingly reasoning. But that doesn't change that the user interface is a prompt-style, with the goal of answering.
I think it's fitting terminology for encompassing multiple issues of false answers.
How would you call it? Only by their specific issues? Or would you use a general term, like "error" or "wrong"?
I've seen it being called hallucination plenty of times. Because the output is undesirable - even if it satisfies the prompt, it is not something you'd want the end user to see, as it shows that the whole thing is built upon the unpaid labour of everyone who uses the internet.
Calling the output by what it is (false, or immoral, or nonsensical) instead of a catch-all would be a progress, I think.