Language Model Benchmark Results

Last updated:

Gemma 3 Small

Model: gemma3:1b:Q4_K_M

Launch Date: 2025-03-12

Size: 815 MB

License: Gemma License

QWEN25 Small

Model: qwen2.5:1.5b:Q4_K_M

Launch Date: 2024-09-15

Size: 986 MB

License: Apache License

Gemma2 Small

Model: gemma2:2b:Q4_0

Launch Date: 2024-06-07

Size: 1600 MB

License: Gemma License

SmolLM 2

Model: smollm2:1.7b:Q8_0

Launch Date: 2024-10-31

Size: 1800 MB

License: Apache License

Llama 3.2

Model: llama3.2:3b:Q4_K_M

Launch Date: 2024-09-25

Size: 2000 MB

License: Llama 3.2 License

GPT-4o-mini

Model: gpt-4o-mini-2024-07-18

Launch Date: 2024-07-18

Size: 2047 MB

License: Closed Model

Gemma 3

Model: gemma3:4b:Q4_K_M

Launch Date: 2025-03-12

Size: 4300 MB

License: Gemma License

QWEN 2.5

Model: qwen2.5:7b:Q4_K_M

Launch Date: 2024-09-15

Size: 4700 MB

License: Apache License

Gemma 2

Model: gemma2:9b:Q4_0

Launch Date: 2024-06-27

Size: 5400 MB

License: Gemma License

Phi 4

Model: phi4:14b:Q4_K_M

Launch Date: 2025-01-08

Size: 9100 MB

License: MIT License

Word Length

Benchmark ID: 0011_word_length

Description: A benchmark to evaluate a model's ability to count the total number of letters in a given word.

Letter Count

Benchmark ID: 0012_letter_count

Description: A benchmark to evaluate a model's ability to count how many times a specific letter appears in a word.

Spell Check

Benchmark ID: 0015_spell_check

Description: A benchmark to evaluate a model's ability to identify misspelled words in a sentence and provide their correct spelling.

Antonym Identification

Benchmark ID: 0016_antonym

Description: Tests ability to identify the correct antonym from a list of options.

Definitions

Benchmark ID: 0020_definitions

Description: A benchmark to evaluate a model's ability to identify the correct definition of words.

Unit Conversion

Benchmark ID: 0022_unit_conversion

Description: A benchmark to evaluate a model's ability to accurately convert between different units of measurement.

Part of Speech

Benchmark ID: 0032_part_of_speech

Description: A benchmark to evaluate a model's ability to identify the part of speech of a specific word in a sentence.

Translation (EN → FR)

Benchmark ID: 0050_translation_en_fr

Description: Tests ability to translate EN words to FR with multiple choice validation

Translation (EN → ZH)

Benchmark ID: 0050_translation_en_zh

Description: Tests ability to translate EN words to ZH with multiple choice validation

Translation (SW → KO)

Benchmark ID: 0050_translation_sw_ko

Description: Tests ability to translate SW words to KO with multiple choice validation

Pinyin Letter Count

Benchmark ID: 0051_pinyin_letters

Description: A benchmark to evaluate a model's ability to count how many times a specific letter appears in the Pinyin representation of a Chinese sentence.

Geography Knowledge

Benchmark ID: 0120_geography

Description: A benchmark to evaluate a model's knowledge of world geography through multiple-choice questions about countries, capitals, physical features, and other geographical information.