Today’s AI has evolved more than before, and AI, which was truly mechanical, now has human cognitive functions. To put it a little technically, AI, which was single modal, is now multimodal. For example, you can recognize a person’s image as a face when you see it, or convert it to text when you hear a voice, but you can’t recognize them in combination, which is single modal. On the other hand, image and audio can be recognized separately, but they can also be recognized in combination, which is multimodal.
The act of listening to the other person’s spoken words (voice) and understanding the other person’s emotions (combining them) while looking at the other person’s facial expression (image) is multimodal, and human beings routinely That’s what I’m doing. Modern AI is also equipped with these functions, and it will be possible to interact with AI as if it were interacting with humans.