Google has introduced Gemini, a new AI model aimed at enhancing the capabilities of its Bard AI chatbot by incorporating an understanding of video, audio, and photos. Initially available to Google Pixel 8 phone owners, Gemini is set to expand its reach to Gmail and other Google Workspace tools in early 2024.
Gemini’s deployment began in December, primarily in English, offering text-based chat abilities that enhance AI performance in complex tasks such as document summarization, reasoning, planning, and programming code writing. The forthcoming update will introduce multimedia capabilities, enabling the chatbot to comprehend elements like hand gestures in videos or interpret children’s dot-to-dot drawings.
This development highlights the rapid progress in generative AI, where chatbots generate responses based on plain language prompts, a departure from traditional programming instructions. In its third major AI model revision, Google aims to integrate Gemini across widely-used products such as Search, Chrome, Google Docs, and Gmail.
Gemini is not only targeting end-users but is also extending its reach to developers. Google has incorporated Gemini into its AI Studio web interface and Vertex AI, offering reduced prices to encourage developer adoption. Google aims to integrate Gemini into various software tools by engaging with developers. Additionally, Google plans to incorporate Gemini into its services, including the Duet AI assistant in Gmail, Google Docs, Meet, and other parts of Google Workspace.
Thomas Kurian, CEO of the Google Cloud division, revealed that Duet AI for Workspace will transition to Gemini in early 2024. This transition promises richer functionalities, such as turning hand drawings into photorealistic versions for presentations or providing enhanced understanding during multilingual video conferences.
Gemini represents a departure from traditional text-based chat, aiming to mimic human understanding of the dynamic, three-dimensional world through complex communication abilities like speech and imagery.
Gemini comes in three versions tailored for different computing power levels:
- Gemini Nano: Designed for mobile phones, it powers features on Google’s Pixel 8 phones and offers capabilities like conversation summarization and message reply suggestions.
- Gemini Pro: Tuned for fast responses, it runs in Google’s data centers and powers the latest version of Bard.
- Gemini Ultra: Currently limited to a test group, it will be available in a new Bard Advanced chatbot in early 2024, with pricing details yet to be disclosed.
Eli Collins, a product vice president at Google’s DeepMind division, expressed that Gemini represents a step closer to building AI models that emulate a helpful collaborator rather than a mere piece of software.
The competitive landscape includes OpenAI, whose technology powers Microsoft’s Copilot AI, including the recently released GPT-4 Turbo AI model. Both Google and Microsoft are integrating AI features into significant products such as Office, Windows, and more.
AI Is Advancing In Intelligence, But It Remains Imperfect
While multimedia capabilities are expected to bring significant changes compared to text-based interactions, the underlying challenges persist. AI models, trained on extensive real-world data to recognize patterns, can generate sophisticated responses to complex prompts. However, there’s a lingering issue of trust, as these models may provide plausible answers rather than accurate ones. Google’s chatbot, Bard, explicitly warns users about potential response inaccuracies.
Gemini represents the next generation of Google’s large language model, succeeding PaLM and PaLM 2, the foundations of Bard. Gemini stands out by being trained simultaneously on text, programming code, images, audio, and video, allowing it to handle multimedia input more efficiently than separate models for each mode.
Google’s research paper highlights diverse examples of Gemini’s capabilities. Gemini demonstrates versatility, from predicting the next shape in a series to identifying links between photos of the moon and a golf ball, or converting bar charts into labeled tables. It can even process handwritten physics problems, detect errors, and provide corrections. While Google showcased a demo video featuring Gemini recognizing hand gestures and solving visual challenges, it’s crucial to realize that these were dramatizations rather than real-time demonstrations.
Google’s Gemini Viral Promotional Video
While not fundamentally misrepresenting Gemini’s abilities, the promotional video included disclaimers about response speed and a link to a discussion on how the demo worked. Despite this transparency, testing by external parties is limited, leaving questions about Gemini’s real-world performance. Nevertheless, Gemini is designed to accept both spoken and video input, showcasing the potential for more interactive and diverse AI interactions in the future.
Gemini Ultra is undergoing extensive testing before its anticipated release next year.
The testing process includes “red teaming,” where external individuals are enlisted to identify security vulnerabilities and other issues associated with Gemini Ultra. Evaluating the model’s performance, especially with multimedia input data, poses additional complexities. The significance lies in understanding how seemingly harmless individual elements, such as text messages and photos, can convey drastically different meanings when combined.
In addressing this challenge, Google CEO Sundar Pichai emphasized a bold and responsible approach. This involves ambitious research with significant potential benefits, accompanied by implementing safeguards. Pichai also highlighted the importance of collaborative efforts with governments and other stakeholders to proactively address risks as AI capabilities advance.