Gemini

Gemini: A Family of Highly Capable Multimodal Models

Google DeepMind Gemini Team · 2023 · 총 5개 섹션 · 11개 문장

이렇게 사용하세요

1원문과 번역을 읽어보세요

2'상세 설명 펼치기'로 맥락을 파악하세요

3핵심 용어를 클릭해 정의를 확인하세요

Introduction

We introduce Gemini, a new family of multimodal models developed at Google. Gemini models are designed from the ground up to be natively multimodal, processing and reasoning over text, images, audio, video, and code.

우리는 Google에서 개발한 새로운 멀티모달 모델 패밀리인 Gemini를 소개한다. Gemini 모델은 처음부터 텍스트, 이미지, 오디오, 비디오, 코드를 처리하고 추론하는 네이티브 멀티모달¹ 방식으로 설계되었다.

We introduce a new family of models in three sizes: Ultra, Pro, and Nano. These models are designed to run efficiently across a wide range of hardware, from data centers to mobile devices.

우리는 Ultra, Pro, Nano 세 가지 크기의 새로운 모델 패밀리를 소개한다. 이 모델들은 데이터 센터에서 모바일 기기까지 다양한 하드웨어에서 효율적으로 실행되도록 설계되었다.

Gemini Ultra achieves state-of-the-art performance on a wide range of benchmarks, being the first model to outperform human experts on MMLU¹, one of the most popular methods to test the knowledge and problem solving abilities of AI models.

Gemini Ultra는 광범위한 벤치마크에서 최고 수준의 성능을 달성하며, AI 모델의 지식과 문제 해결 능력을 테스트하는 가장 인기 있는 방법 중 하나인 MMLU¹에서 인간 전문가를 능가한 최초의 모델이다.

Model Architecture

Gemini models are based on Transformer¹ decoders, enhanced with improvements in architecture and model optimization to enable stable training at scale. We train Gemini models using Google's TPU² infrastructure.

Gemini 모델은 Transformer¹ 디코더를 기반으로 하며, 대규모 안정적인 훈련을 가능하게 하는 아키텍처 및 모델 최적화 개선이 적용되었다. Gemini 모델은 Google의 TPU² 인프라를 사용하여 훈련된다.

Gemini supports interleaved sequences of text, image, audio, and video as inputs. Images are encoded using a variant of Vision Transformer (ViT)¹. Audio is encoded using a Universal Speech Model (USM) encoder.

Gemini는 텍스트, 이미지, 오디오, 비디오가 혼합된 시퀀스를 입력으로 지원한다. 이미지는 Vision Transformer¹(ViT)의 변형을 사용하여 인코딩된다. 오디오는 Universal Speech Model(USM) 인코더를 사용하여 인코딩된다.

Evaluation

On text benchmarks, Gemini Ultra achieves state-of-the-art performance on 30 out of 32 benchmarks we evaluated. It outperforms GPT-4 on 24 of 26 benchmarks where both models are compared.

텍스트 벤치마크에서 Gemini Ultra는 평가한 32개 벤치마크 중 30개에서 최고 성능을 달성한다. 두 모델이 비교되는 26개 벤치마크 중 24개에서 GPT-4를 능가한다.

On multimodal benchmarks, Gemini Ultra achieves state-of-the-art on all image understanding benchmarks. In video understanding, it outperforms all previous models by a significant margin.

멀티모달 벤치마크에서 Gemini Ultra는 모든 이미지 이해 벤치마크에서 최고 성능을 달성한다. 비디오 이해에서는 이전 모든 모델을 큰 차이로 능가한다.

On coding benchmarks, Gemini Ultra achieves state-of-the-art performance on HumanEval¹ (74.4%), representing a significant improvement over previous models. It also demonstrates strong performance on competition-level math problems.

코딩 벤치마크에서 Gemini Ultra는 HumanEval¹에서 최고 성능(74.4%)을 달성하여 이전 모델 대비 큰 개선을 보인다. 또한 경쟁 수준의 수학 문제에서도 강한 성능을 보인다.

Responsible Deployment

We conducted extensive safety evaluations of Gemini models, including testing for harmful content, bias, and the potential for misuse. We use a combination of automated testing and human evaluation to assess safety.

우리는 유해 콘텐츠, 편향, 남용 가능성에 대한 테스트를 포함하여 Gemini 모델에 대한 광범위한 안전성 평가를 수행했다. 안전성 평가를 위해 자동화된 테스트와 인간 평가의 조합을 사용한다.

We are committed to working with governments, researchers, and civil society to develop appropriate governance frameworks for powerful AI systems.

우리는 강력한 AI 시스템을 위한 적절한 거버넌스 프레임워크를 개발하기 위해 정부, 연구자, 시민 사회와 협력하겠다.

Conclusion

We present Gemini, a family of highly capable multimodal models that achieve state-of-the-art performance across a wide variety of tasks. The native multimodal capabilities of Gemini represent a significant advance over prior approaches.

우리는 다양한 작업에서 최고 수준의 성능을 달성하는 고성능 멀티모달 모델 패밀리인 Gemini를 제시한다. Gemini의 네이티브 멀티모달 기능은 이전 접근 방식에 비해 중요한 발전을 나타낸다.

원본 출처: Gemini: A Family of Highly Capable Multimodal Models by Google DeepMind Gemini Team (2023)

학습 목적으로 재구성된 콘텐츠입니다.