A couple of days ago, Google CEO Sundar Pichai happily tweeted: “We have achieved quarterly revenue exceeding $100 billion for the first time in history, and this is the first time in the company’s history. Moreover, we have achieved double-digit growth in every major segment of our business. (Five years ago, our quarterly revenue was only $5 billion 🚀)”
Currently, ChatGPT has 800 million weekly active users. On the consumer-facing application front, Google is catching up at an extremely fast pace. Especially with the recent viral success of Nano Banana, Google has “stood out” among various models.
According to data from Appfigures, an app analytics company, during the launch period of Nano Banana, the number of downloads of Gemini soared, while the number of downloads of Adobe’s generative AI image and video application Firefly declined significantly. Although it is not yet possible to fully confirm whether there is a direct causal relationship between the two, their timelines almost completely coincide.
Josh Woodward, head of the Gemini app and vice president of Google Labs, stated that the popularity of this tool has brought about unexpected chain reactions. More importantly, many users who initially only came to use Nano Banana later started using Gemini for other tasks.
“We have seen significant changes in the user structure of the app,” Woodward said in an interview with foreign media. This includes a “substantial increase” in users aged 18-34, and the user base, which was previously dominated by men, now attracts more female users. Attracting young users is good news for Google, as the company has long been worried that they spend more time on social media platforms like TikTok.
Woodward also revealed that the number of international users of Gemini is also rising rapidly. In fact, this is not surprising—Nano Banana once set off a global trend: users used it to create their own 3D avatars. “That trend first started in Thailand,” Woodward said. “An internet celebrity posted a video, and then it quickly spread to Vietnam and Indonesia, becoming popular across Southeast Asia almost overnight.”
For Google, attracting users through popular features like Nano Banana is a clever entry point. Many people download Gemini for fun, but once they stay to use other features, Google wins. Woodward also admitted that the company pays close attention to this kind of “retention stickiness”—whether users will continue to return and form usage habits. It is reported that Google defines monthly active users as those who open the app on Android or iOS systems, or use the app through a web page and perform interactive operations. However, this definition excludes very basic operation requests, such as users setting a timer.
Recently, Oliver Wang, chief scientist of Google DeepMind, and Nicole Brichtova, product manager, appeared on the a16z podcast. They discussed with Justine Moore, partner at a16z, and Guido Appenzeller and Yoko, who focus on artificial intelligence and infrastructure investment, about how Nano Banana was born, why it spread virally, and the future of image and video editing. We have translated this content and abridged and organized it without altering the original meaning to present to readers.
The Origin of Nano Banana
Yoko: Could you first talk about the story behind the Nano Banana model? How was it created?
Oliver Wang: Of course. Our team has actually been working on image models for several years, and we previously developed the Imagine model series. Before Gemini 2.0 launched its image generation function, there was already an image generation model in Gemini. Later, the team gradually shifted its focus to Gemini-related scenarios, such as interaction, dialogue, and editing. Therefore, several of our teams collaborated to integrate these capabilities, and finally created the Nano Banana model that everyone is familiar with later.
Brichtova: Our Imagine model has always been known for its visual quality, especially in generation and editing tasks. After the launch of Gemini 2.0 Flash, we truly felt the magic of “simultaneous text and image generation” for the first time—we could tell stories while generating images, and even modify images through dialogue. The only regret was that the image quality at that time did not meet the ideal standard. Thus, Nano Banana, which was later the version of Gemini 2.5 Flash Image, was born to meet this demand.
Yoko: But I have to say that the name Nano Banana is much cooler!
Brichtova: Yes, and it’s also easier to pronounce. It actually integrates the intelligence and multimodal interaction features of Gemini with the high visual quality advantages of Imagine. I think that’s why it can resonate with so many people.
Yoko: During the development process, was there any moment when you thought, “Wow, this is going to be a hit”?
Oliver Wang: To be honest, I didn’t think it would be a hit until the model was launched on the LMArena platform. At that time, we estimated that the traffic would be similar to that of previous models, but the number of visits soared, and even continuous quota increases couldn’t keep up. It was at that moment that I realized, “Wow, so many people really like using it.” Even though the model on that website was only accessible part of the time, everyone was willing to try it. That was the first “wow” moment for me.
Brichtova: For me, the “wow” moment came a bit earlier. I often test models of different generations with the same instructions, such as “Make me look like an astronaut,” “Take me on an adventure,” or “Walk the red carpet.” Until one time, when I ran these instructions on the internal test version, the images generated for the first time really looked like me. In the past, this kind of effect could only be achieved through special fine-tuning (such as LoRA), which also required uploading several images and spending a long time training. But this time, it was generated directly with zero-shot learning, and I was shocked on the spot. Later, I gave an internal presentation, and all the images in the presentation document were of my own face.
After more colleagues tried it themselves, they also found it very amazing. It’s fun to watch others use it, but putting yourself, your family members (children, spouse), or even pets into the model creates a stronger sense of engagement, and this emotional resonance emerges. Later, there was a trend of “transforming ourselves in 1980s style” within the company. At that time, we all realized: Well, this thing really has potential.
Oliver Wang: Testing this kind of model is really interesting because you can see others create all kinds of amazing works with it—many of which you never thought of before.
Has It Caused Professional Impact?
Guido: In the long run, we are actually creating a new set of tools that can change visual art. What used to require complex manual operations in Photoshop can now be done with just one command. Then, how should art creation be taught in the future? What will art classes in universities be like in five years?
Brichtova: I think a diversified scenario will emerge. First, in the professional field, we have heard many creators say that these models can help them save the tedious parts of their work, allowing them to spend 90% of their time on creativity instead of 90% on editing and manual operations as before. I am very much looking forward to this, and I believe it will bring an explosion in the creative field.
For consumers, it can probably be divided into two types of scenarios. One is to do interesting things, such as designing Halloween costumes for children and then sharing them with family and friends; the other is to handle practical tasks, such as making PPTs. I used to be a consultant, and at that time, I spent a lot of time making PPTs look good and clarifying the logic of the story. But in the future, you may only need to tell the Agent your needs, and it can help you typeset the PPT and generate visual elements that match the content.
Overall, I think it depends on your needs: whether you want to participate in the creative process and collaborate with the model to make adjustments, or just want the model to complete the task with less participation from yourself.
Guido: Then in such a world, what counts as “art”? Some people say that art is the ability to create “samples outside the distribution.” Do you think this definition is accurate?
Oliver Wang: I think the definition of “distribution samples” is too strict. Many great works of art are actually extensions within the existing artistic context. The definition of art itself is a philosophical issue. For me, the most crucial part of art is the “creator’s intention.” AI generation is just a tool; real art comes from human choices and human expression.
I am not worried about professional artists or professional creators. Because I found that if I have to use these models, I can’t create anything that anyone wants to see, but they can always use the latest tools to create works with soul.
Justine: Many artists were unwilling to use AI before because they thought it was too difficult to control—such as inconsistent characters and unreusable styles. When training Nano Banana, did you specifically optimize these aspects?
Oliver Wang: Yes, we paid special attention to “customizability” and “character consistency” during the development process, and we tried our best to do it well. The iterativeness of interactive dialogue is also important, because art creation itself is an iterative process—you will continue to modify, see the direction, and then adjust further. This model is very practical in this regard, but we still have a lot of room for improvement. For example, in long conversations, the model’s ability to follow instructions will deteriorate. This is a key point we are currently improving, and we hope it can be more like a natural creative partner.
Guido: When talking about this topic with visual artists, we always get some very skeptical responses, such as “The effect is too bad.” Don’t people realize that AI is just a new type of tool that will eventually empower artists?
Oliver Wang: I think this is related to the degree of control over the output results. The earliest text-to-image models were like one-time tools—you input text and get output. Ordinary users would think, “It looks okay, and at least it was created by myself.” But this model may make creative professionals uncomfortable because they know that most decisions are dominated by the model and training data, and they themselves have no participation at all.
Indeed, this cannot be regarded as real creation. As a creator, you should have greater freedom of self-expression. Therefore, I believe that as models become more controllable, concerns like “This is completely operated by computers” will dissipate.
On the other hand, there was a period when we were amazed by the images generated by the model. When we saw the works, we would sincerely praise, “Wow, the large model can actually reach this level.” But this sense of novelty quickly passed. Even the most amazing images at the beginning can now be seen at a glance: “Oh, this was made with a single prompt, and the author didn’t spend much effort.” After the novelty fades, the threshold of creation emerges again: we must find ways to use AI tools to create interesting things, and this has always been difficult. We still need artists, and only they can do this better. I also think that artists are better at distinguishing which works contain real control adjustments and creative intentions.
Brichtova: Art creation requires profound technical accumulation and aesthetic taste, which often take decades to form. I think these models do not have real aesthetic capabilities, and the resistance mentioned earlier may stem from this.
We do cooperate deeply with cross-disciplinary artists—including those in image, video, and music fields—hoping to advance the technological frontier with them. Many people are passionate, and what they really contribute is the professional knowledge accumulated through decades of design experience. We are cooperating with Ross Lovegrove to conduct in-depth analysis of the model based on his manuscripts, thereby creating new works. We also designed physical prototype chairs for physical verification.
Many artists are eager to integrate their accumulated professional knowledge and rich language for describing works into model dialogues, thereby breaking through the boundaries of creation. It should be noted that this is by no means achievable with a prompt written in one or two minutes; it requires the injection of a lot of aesthetic accumulation, human creativity, and craftsmanship, and finally can be sublimated into art.
Oliver Wang: I think there is also such a phenomenon: most consumers of creative content, even those who pay close attention to creative content, do not actually know what they like. It must be a visionary person who creates novel and unique works. Only when these works are displayed will people exclaim, “This is amazing.” In other words, people are better at appreciating, but cannot independently conceive these creative carriers.
Therefore, when we optimize the model, although we will make adjustments based on the average preferences of the public, we also realize that it is difficult to produce interesting results in this way. Otherwise, what we end up with is works that everyone thinks are okay, but cannot really touch people’s hearts—the kind of works that can completely change people’s understanding of art.
Guido: Then in the future, when children learn to draw, will they just doodle a few strokes on a tablet, and AI will turn them into exquisite works?
Brichtova: I hope it won’t be like that (laughs). I’m not sure if we need to turn all children’s paintings into “beautiful images.” A more ideal way is that AI is like a partner or a teacher. I can’t draw and have no talent for it, but I hope these tools can teach children the steps of painting, give revision suggestions, and even prompt them for the next step like “automatic image completion,” or provide several options and explain how to do them. I don’t want the paintings of 5-year-old children to become “perfect,” because that would lose some important things, such as the child’s creativity and unique perspective.
Oliver Wang: Interestingly, on the contrary, when we train the model to draw in a “children’s crayon style,” it is very difficult to do so. Because that style is highly abstract; it seems simple, but it is actually difficult.
Overall, I am very optimistic about the application of AI in the education field. Most people are actually “visual learners,” but current AI teaching is still limited to text and voice. But students do not learn in this way. Imagine that when explaining a certain concept, if it can explain the principles while matching pictures and animations, the learning effect will be greatly improved. This will make knowledge more useful and accessible, which is very promising.
AI Tools: More Professional or Simpler?
Yoko: Since you released Nano Banana, many people have been talking about “editing models.” Oliver, you used to work at Adobe. What do you think about the evolution of the model layer and traditional software editing?
Oliver Wang: Professional tools like Adobe are characterized by many controls and many “buttons,” which require a high degree of control. But now there is a balance issue: we hope that ordinary people can edit on their mobile phones with voice commands, while also hoping that professional creators can make fine adjustments. Currently, we have not fully solved this balance problem, but many people are already developing great UIs, and there are many implementation methods.
Brichtova: Personally, I hope that in the future, there will be no need to learn the meaning of all control buttons. The model can intelligently recommend the next steps based on the operations you have already done. This is a direction worth exploring. The future UI may not require you to learn as many complex operations as before; the tool will actively prompt what it can do based on your behavior.
Guido: Professionals only care about results. They are willing to accept high complexity and have relevant training and experience. The Cursor interface is not a simple single-text prompt either. So in the future, will there be ultra-complex interfaces for professional users and simple interfaces for ordinary users?
Oliver Wang: I actually like node-based interfaces like ComfyUI very much. Although they are complex, they have extremely powerful functions. Now many people use Nano Banana to generate storyboards, video keyframes, and connect different models to create workflows, with amazing effects. I think such interfaces are great for both professional and ordinary users. As for what the future will hold for professional users, it is still unknown.
Brichtova: It depends on the target users. For example, for people like my parents, the chat interface is very easy to use. They only need to upload a picture and say “Help me modify it” without learning new tools. Professional creators need strong control capabilities. There is also a group of people in the middle who want to create but are intimidated by professional tools. They will also have their own new interface forms, and there are great opportunities and many needs to be met here.
Yoko: Will the future be dominated by “one model for all purposes” or “collaboration between multiple models”?
Oliver Wang: I definitely don’t think there is any single model that can meet all needs. The future will definitely have a variety of models. For example, we will optimize the instruction-following ability of some models to ensure that they fully do what the user requires, but such models may not be suitable for scenarios that require inspiration. In such scenarios, users hope that the model is more “free” and can break out of the framework to provide inspiration.
Multimodal Capabilities Become a Must
Yoko: Do you think that in the future, to become a leading large language model or visual art form, it must have multimodal capabilities such as images, language, and audio at the same time?
Oliver Wang: I agree 100%, and I firmly believe this should be the case. The most exciting future prospect of AI models for me is that they can become tools to help humans achieve more goals. Imagine that in the future, there will be autonomously operating models that can talk to each other and complete all work. At that time, the necessity of visual communication mode will definitely decrease. But as long as there are humans involved, and as long as the motivation to solve tasks comes from humans, visual modality will still be crucial for all AI agents in the future, which is also a completely logical judgment.
Guido: We will eventually have such a large model: when you put forward an image generation demand, it will think for an hour or two, design drafts, explore different directions, and finally present the results.
Brichtova: And it is not limited to a single image. Suppose people are redesigning their houses and do not want to participate in the specific process. Then they only need to provide inspiration materials, such as “I like this style,” and then send the materials to the large model like communicating with a designer.