# Choosing the right model

## Cloud-Based Models

LLM Vision is compatible with multiple providers, each of which has different models available. Some providers run in the cloud, while others are self-hosted.\
To see which model is best for your use case, check the figure below. It visualizes the averaged [MMMU ](#user-content-fn-1)[^1]scores of available cloud-based models. The higher the score, the more accurate the output.

{% hint style="info" %}
**`gpt-5-mini`** is the recommended model due to its strong performance-to-price ratio.
{% endhint %}

<figure><img src="https://2802862115-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FFIhUccwfwWnHypSNsrKL%2Fuploads%2FkckidtbmhZeJNSqem1HM%2Fbenchmark_visualization.jpg?alt=media&#x26;token=67b8656f-a791-44fb-9f71-103d27c0a275" alt=""><figcaption><p>Data is based on the <a href="https://mmmu-benchmark.github.io/#leaderboard">MMMU Leaderboard</a></p></figcaption></figure>

## Self-hosted Models

{% hint style="info" %}
**`gemma3:12b`** is the recommended model for self-hosting, offering performance comparable to `gpt-4o-mini` while fitting within 12GB of VRAM.
{% endhint %}

<figure><img src="https://2802862115-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FFIhUccwfwWnHypSNsrKL%2Fuploads%2FQWGyOZugogq2P5G3L6YF%2Fopen_source_benchmark_visualization.jpg?alt=media&#x26;token=a8792937-e1f5-4604-8ae3-daed663c20c5" alt=""><figcaption><p>Data is based on the <a href="https://mmmu-benchmark.github.io/#leaderboard">MMMU Leaderboard</a></p></figcaption></figure>

[^1]: MMMU stands for "Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark". It assesses multimodal capabilities including image understanding.
