GPT-4

GPT-4 Technical Report

OpenAI · 2023 · 총 6개 섹션 · 14개 문장

이렇게 사용하세요

1원문과 번역을 읽어보세요

2'상세 설명 펼치기'로 맥락을 파악하세요

3핵심 용어를 클릭해 정의를 확인하세요

Introduction

We report the development of GPT-4, a large multimodal model capable of processing image and text inputs and producing text outputs.

우리는 이미지와 텍스트 입력을 처리하고 텍스트 출력을 생성할 수 있는 대규모 멀티모달² 모델인 GPT-4의 개발을 보고한다.

While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam² with a score around the top 10% of test takers.

많은 실제 시나리오에서 인간보다 능력이 떨어지지만, GPT-4는 모의 변호사 시험에서 상위 10% 점수로 합격하는 것을 포함해 다양한 전문 및 학술 벤치마크¹에서 인간 수준의 성능을 보인다.

One of the primary goals of developing GPT-4 was to improve upon the ability to follow human intent, to be helpful, harmless, and honest.

GPT-4 개발의 주요 목표 중 하나는 인간의 의도를 따르고, 도움이 되고, 해롭지 않으며, 정직한 능력을 향상시키는 것이었다.

Predictable Scaling

A large focus of the GPT-4 project was developing scalable training infrastructure and methods. A central challenge was to predict the performance of GPT-4 using much smaller models trained with the same methodology.

GPT-4 프로젝트의 주요 초점은 확장 가능한 훈련 인프라와 방법론을 개발하는 것이었다. 핵심 과제는 동일한 방법론으로 훈련된 훨씬 작은 모델을 사용하여 GPT-4의 성능을 예측하는 것이었다.

We were able to make predictions about the final performance of GPT-4 on internal evaluations before training was completed, validating the approach and instilling confidence in our ability to train large models reliably.

우리는 훈련이 완료되기 전에 내부 평가에서 GPT-4의 최종 성능에 대한 예측을 만들 수 있었으며, 이는 접근 방식을 검증하고 대형 모델을 안정적으로 훈련하는 능력에 대한 신뢰를 심어주었다.

Capabilities and Evaluation

We evaluated GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans. We selected 57 exams, including the LSAT¹, SAT, GRE, AP exams, and professional licensing exams.

우리는 원래 인간을 위해 설계된 시험을 시뮬레이션하는 것을 포함하여 다양한 벤치마크에서 GPT-4를 평가했다. LSAT¹, SAT, GRE, AP 시험 및 전문 자격 시험을 포함한 57개의 시험을 선택했다.

GPT-4 substantially improves over GPT-3.5. GPT-4 outperforms GPT-3.5 by 19 percentage points on the MMLU¹ benchmark, and achieves top-10% performance on the bar exam compared to GPT-3.5's bottom 10%.

GPT-4는 GPT-3.5 대비 실질적으로 향상되었다. MMLU¹ 벤치마크에서 GPT-3.5 대비 19퍼센트포인트 높은 성능을 보이며, 변호사 시험에서 GPT-3.5의 하위 10%와 달리 상위 10% 수준을 달성한다.

GPT-4 can take in as input any combination of text and images. On a range of domains including documents with text and photographs, diagrams, or screenshots, GPT-4 demonstrates similar capabilities as it does on text-only inputs.

GPT-4는 텍스트와 이미지의 어떤 조합도 입력으로 받을 수 있다. 텍스트와 사진, 다이어그램 또는 스크린샷이 포함된 문서를 포함한 다양한 분야에서 GPT-4는 텍스트 전용 입력과 유사한 능력을 보인다.

Limitations

GPT-4 still has many known limitations that we are working to address. It is not fully reliable—it can suffer from hallucinations and often makes reasoning errors.

GPT-4는 우리가 해결하기 위해 노력 중인 많은 알려진 한계가 여전히 있다. 완전히 신뢰할 수 없으며—환각 현상을 겪을 수 있고 종종 추론 오류를 범한다.

GPT-4's knowledge cutoff is September 2021, meaning it lacks knowledge of events after that date. It also doesn't learn from experience—every conversation starts fresh.

GPT-4의 지식 기준일¹은 2021년 9월로, 그 이후 사건에 대한 지식이 없다. 또한 경험으로부터 학습하지 않으며, 모든 대화는 새로 시작한다.

GPT-4 does not have a persistent memory of past conversations. Its context window of 8,192 to 32,768 tokens limits the amount of information it can consider at once.

GPT-4는 과거 대화에 대한 영구적인 기억이 없다. 8,192~32,768 토큰의 컨텍스트 윈도우¹는 한 번에 고려할 수 있는 정보의 양을 제한한다.

Safety and Alignment

GPT-4 poses similar risks as prior models. The model can generate potentially harmful content such as advice for planning attacks, misinformation, hate speech, and private information.

GPT-4는 이전 모델과 유사한 위험을 제기한다. 모델은 공격 계획 조언, 허위 정보, 혐오 발언, 개인 정보와 같은 잠재적으로 유해한 콘텐츠를 생성할 수 있다.

We used RLHF to make GPT-4 follow a set of principles derived from our usage policies. These principles prioritize safety, helpfulness, and honesty.

우리는 RLHF를 사용하여 GPT-4가 사용 정책에서 파생된 일련의 원칙들을 따르도록 했다. 이 원칙들은 안전성, 유용성 및 정직성을 우선시한다.

Conclusion

GPT-4 is a large multimodal model with human-level performance on various professional and academic benchmarks. We have shown that it is possible to train large-scale AI systems that are capable of complex reasoning tasks and are safer and more aligned with human values.

GPT-4는 다양한 전문 및 학술 벤치마크에서 인간 수준의 성능을 갖춘 대규모 멀티모달 모델이다. 우리는 복잡한 추론 작업을 수행할 수 있으면서도 더 안전하고 인간의 가치와 더 잘 정렬된 대규모 AI 시스템을 훈련하는 것이 가능함을 보였다.

원본 출처: GPT-4 Technical Report by OpenAI (2023)

학습 목적으로 재구성된 콘텐츠입니다.