This paper is an introduction to the NeurIPS 2022 paper titled "Emergent Graphical Conventions in a Visual Communication Game", authored by Prof. Songchun Zhu and Assistant Professor Yixin Zhu from the Institute for Artificial Intelligence, Peking University.
The first authors of this paper are Shuwen Qiu and Sirui Xie, and the other authors include Lifeng Fan, Tao Gao, Jungseock Joo, Yixin Zhu, and Songchun Zhu.
Paper: https://yzhu.io/publication/teaming2022neurips/
01 The Formation of Writen Language System
Cognitive science research suggests that the formation of written language systems is a process that progresses from pictograms to abstract symbols. As shown in the diagram, our human ancestors would use sketches to depict the sun, aiming to capture its natural appearance as closely as possible. During this process, people gradually established a connection between visual concepts and pictographic symbols. In subsequent communication, whenever there was a need to describe the sun, these symbols would be repeatedly used. To enhance communication efficiency, these icons underwent simplification and abstraction, gradually giving rise to the pictographic writing systems we have today.
During the research, cognitive scientists used the game "Pictionary" to simulate this process. In the initial stages of the game, participants were required to communicate using sketches. As the game progressed, they would encounter previously communicated content. The experimental results showed that through iterative communication, a new symbol system emerged between the players. In the example shown below, when representing the term "Parliament," players initially depicted the location and national flag in detail. Through iteration and refinement, it was eventually represented by curves and circles. Similarly, when representing "Soap Opera," the initial sketch depicted "Soap" and "Opera" literally, which was then simplified to a square and a line. This paper aims to simulate the formation process of graphic symbol systems by training two AI agents to play the game "Pictionary." It explores the balance between accuracy and efficiency in the formation of abstract writing systems and verifies the necessary environmental factors that align with the formation of human graphic symbol systems.
02 Model
As shown in the figure, we describe "Pictionary" as a multi-agent sequential decision-making game. Each round involves two players: a sender who can observe the target of the current communication (a common visual concept like a rabbit or a cup), and a receiver who can observe a set of images (one of which corresponds to the category of the intended communication). The receiver's task is to guess which image corresponds to the sender's drawing. At each time step, the sender continues to add strokes based on the target to the canvas. The receiver, upon observing the newly added strokes, decides whether to request the sender to continue drawing or to make a guess. The game terminates when the receiver makes a guess or when the maximum number of steps is reached. After the termination of the game, both players receive a reward (+1/-1) to encourage efficient communication. The reward/punishment is multiplied by a decay factor based on the total game steps, and ultimately, the players receive their respective returns. The training objective for the sender and receiver is to maximize the final game score. Eligibility traces are used to smooth the entire convergence process.
03 Experiments
We explored the impact of the following environmental factors on the evolution of symbolic systems:
1) Cooperative Training: Whether the sender and receiver engage in cooperative training.
2) Receiver Termination: Whether the receiver has the option to terminate the game.
3) Interactive Sequential Communication: Whether the sender and receiver engage in interactive sequential communication.
For each factor, we designed an experimental group called "complete" and four control groups:
1) sender-fixed: The model parameters of the sender are not updated. The designer controls for the cooperative training factor.
2) max-step: The receiver is not allowed to terminate the game early. The designer controls for the receiver termination factor.
3) one-step: Both players can only communicate for one time step each time. The designer controls for the interactive sequential communication factor.
4) retrieve: The model parameters of the sender are not updated, and the receiver is not allowed to terminate the game early, simulating a scenario where there is no communication between the two players.
In the "retrieve" setting where there is no communication, the sketches do not undergo simplification, resulting in the highest level of iconicity. We consider the experimental results of this setting as the upper bound of achievable communication.
We also presented the evolution of drawings during the training, where each image from left to right represents sketches at different iterations ranging from 0 to 30,000. It can be observed that the sketches undergo a process of simplification, and for drawings belonging to the same category, the sender consistently emphasizes the most salient features of the category. For example, the sketches highlight the ears of rabbits, and even though the giraffes in the images have different poses, especially in the third image where the giraffe has a bent neck, the sketch still emphasizes the giraffe's vertical long neck.
First, we validate the effectiveness of our designed training framework using communication success rate and communication efficiency.
Communication Success Rate: We assume that a new communication system is formed between agents when the communication accuracy is above 80%. As shown in Figure (a), except for the "one-step" setting, the agents formed new communication systems in all other experimental setups. This demonstrates the success of our training framework in enabling effective communication among agents and emphasizes the importance of interactive sequential communication.
Communication Length: In human experiments, the number of strokes required for drawing decreases after repeated communications. As depicted in Figure (b), for the settings that can change the communication length (complete, sender-fixed), the communication length gradually decreases. This indicates that our designed implicit rewards and punishments encourage agents to reduce communication length to improve communication efficiency.
Accuracy vs Efficiency: The reduction in communication length by the agents can be attributed to two reasons: (1) to improve communication efficiency while maintaining accuracy, and (2) convergence to shorter communication due to learning difficulties in longer communication. The former is the desired reason. During the training process, we tested the accuracy of the receiver's judgments for sketches with 1, 3, 5, and 7 strokes drawn by the sender. As shown in Figure (c), in the cumulative test results (using REINFORCE training as the baseline), the accuracy decreases as the number of strokes increases, indicating the inability to learn and update in longer communications. In contrast, with our proposed training framework, the accuracy for sketches with more strokes initially reaches the highest level (ensuring accuracy), and the accuracy for sketches with fewer strokes gradually increases until reaching the accuracy level of sketches with 7 strokes (reducing strokes to improve communication efficiency). This demonstrates that the agents actively balance accuracy and efficiency in their communication process.
To compare the effectiveness of the newly formed communication systems, we design three attributes for graphical symbol systems and their corresponding metrics.
Iconicity: We define iconicity as the proximity of sketches to their corresponding natural images in a mapping space. As shown in Figure 1, in the Psi space, the distance between the drawings and their corresponding images is closer compared to other images. To measure iconicity, we tested the communication accuracy of the agents for unseen images or categories in each experimental setting. As shown in the table, the "complete" and "sender-fixed" settings can control the length of communication based on familiarity with the communicated content. When encountering unfamiliar images or categories, the agents can increase the communication length to improve the iconicity of the drawings.
Symbolicity: We define symbolicity as the ability of sketches belonging to the same category to be easily distinguishable in a high-dimensional mapping space. As shown in Figure 2, there are clear boundaries between different categories. To measure symbolicity, we fine-tuned a pre-trained VGGNet to classify the drawings into different categories. As shown in the bar graph, the symbol system formed under the complete setting exhibits the highest consistency.
Semanticity: We define semanticity as the similarity between the topological structures of sketches and their corresponding images in a high-dimensional mapping space. As shown in Figure 3, concepts with semantic similarity, such as cats and dogs, have relatively closer distances between sketches and images, while the distance from the cup is farther. To measure semanticity, we first project the names of each category using word2vec into a vector space as feature A. Simultaneously, we use the VGG model trained in Attribute 2 to project the evolved sketches into a vector space as feature B. We calculate the correlation coefficient between the vector distances in feature B that can be formed and the vector distances in feature A. From the table results, we can see that the complete setting best preserves semanticity. Additionally, we use t-SNE to project the feature B of the "complete" setting onto a two-dimensional plane, and it can be observed that the boundaries between different categories are clear. Moreover, semantically similar categories such as cows, deer, and horses are adjacent to each other, while they are farther from categories like hamburgers and apples.
In this work, we simulated the formation of a new graphical symbol system using the "you draw, I guess" game. We validated the effectiveness of the training framework and proposed three graphical symbol attributes: iconicity, symbolicity, and semanticity. The experimental results demonstrate that cooperative training between the players, allowing the receiver to terminate the game early, and interactive sequential communication encourage the newly formed graphical symbol system to exhibit high symbolicity while preserving iconicity and semanticity. We hope that this work provides insights into the evolution of pictographic writing systems.