Yesterday, my daughter and I went to an Italian restaurant for dinner. She sat there looking through the menu, while I picked up my mobile phone and took a few photos of the menu. Then I handed the pictures to ChatGPT to order dinner. I specially told it that this was a dinner for two people, with my 14-year-old daughter.
When AI gave a recommendation, I found that the amount it ordered was enough for three people. My daughter asked curiously, “Where is the dessert with a strange name you want to order? Why didn’t I see it?” I lowered my head and checked the menu carefully before I realized a problem — the waiter only gave us the drink list, the selection of the day and the main course menu, but no dessert menu. Restaurants usually wait for customers to finish the main course before serving the dessert menu. But ChatGPT knows the dessert menu of this restaurant already. It not only referred to the photos I took, but also took out the dessert menu directly from its “memory” and recommended the dessert for us accordingly.
This makes me feel something is wrong. If AI is only based on the menu photos I took, then its reasoning process should be “multimodal understanding + logical reasoning” — read the menu content first, and then reason in combination with the context. But now it seems that it directly retrieves a piece of information that I have not provided, that is to say, it is not completely based on my input, but bypasses my data and calls up its own existing knowledge. It’s like when doing math problems, AI does not calculate by itself, but directly googled the answer; or let it analyze the trend according to the news report, and as a result, it directly finds a summary article instead of reasoning by itself. In this way, it feels a little “cheating”.
If AI really only learns information from the pictures I provided, there is nothing wrong with it. But if it takes the complete menu directly from the knowledge base, it is “knowledge recall + information retrieval”, not simple reasoning. I began to think about whether it was feasible to let AI only rely on the menu I provided for analysis without calling its memory? For example, I can tell it: “Please analyze only according to the menu pictures I provided, and don’t use what you already have.” But the problem is that such restrictions may make AI a little “stupid”.
When AI can only rely on pictures and cannot use external knowledge, it will have some problems, such as:
- Some special dish names, such as local specialties or brand-specific dishes, cannot be identified.
- It can’t understand the implicit information of the menu. For example, a dessert may be sugar-free, but it is not clearly stated on the menu.
- It cannot refer to user evaluation or popularity, and cannot provide recommendations based on popular tastes.
For example, suppose the dessert menu says: “Black Forest”. Everyone knows that Black Forest refers to a kind of chocolate cake, but if AI does not have “memory”, it may perform the task like this: first, it does a Named Entity Recognition (NER), finds the noun “forest” in it, then performs Dependency Parsing, and finds that it is a noun phrase modified by an adjective, and there is no more text, it may think it has found the answer, and mechanically takes it literally— Black Forest? Is this a forest? Are you going camping? It doesn’t know exactly that this is a dessert. In other words, allowing AI to analyze only based on input can ensure that it will not cheat, but also reduce its ability to understand.
So, is there any compromise? If I don’t want AI to call the full menu directly but want it to provide more valuable recommendations in combination with existing knowledge, maybe you can prompt it as follows:
- “Please analyze mainly according to the menu pictures I provided, but if necessary, you can provide valuable recommendations appropriately in combination with your knowledge.”
- “Please analyze it completely based on my menu pictures first. If there is any uncertainty, you can refer to your knowledge again.”
- “Please do not retrieve the entire menu directly from the existing knowledge base, but you can combine your knowledge to help understand the content of the menu.”
In this way, AI will not completely rely on external knowledge or become too mechanized, but can reason within a reasonable range.
If AI only orders according to the uploaded menu, it is mainly doing multi-modal understanding + logical reasoning. But if it directly retrieves the existing menu, it is more like doing “knowledge recall + information retrieval”. This boundary is not always clear, but as users, we can guide AI to the path we want through more accurate tips. The real tacit understanding is that AI knows where the user’s expectations and boundaries are, and then “aligns” with the user's intention. That is my next semester’s course: NLU.