Code Generation Evaluation Results

Generated on: 2025-04-02 10:34:58

Query Types Explained

Summary Statistics

Model Performance by Category
Model total_prompts Overall Performance Filter Group Iter Sort Sorting Tform Viz Window
overall_orig overall_suffix overall_alt_suffix orig_filter suffix_filter alt_suffix_filter orig_group suffix_group alt_suffix_group orig_iter suffix_iter alt_suffix_iter orig_sort suffix_sort alt_suffix_sort orig_sorting suffix_sorting alt_suffix_sorting orig_tform suffix_tform alt_suffix_tform orig_viz suffix_viz alt_suffix_viz orig_window suffix_window alt_suffix_window
api/deepseek/deepseek-chat-v3-0324 86 86 91 88 100 87 87 85 100 92 100 100 100 100 100 100 100 100 100 84 92 88 0 0 0 100 100 100
api/claude-3-7-sonnet-20250219 86 75 95 90 100 100 100 100 85 92 100 100 100 100 100 100 100 100 75 64 100 92 0 0 0 100 100 100
api/claude-3-5-sonnet-20241022 86 68 93 87 100 100 100 42 92 78 100 100 50 100 100 100 100 100 100 67 94 90 0 0 0 100 100 100
api/google/gemma-3-27b-it 86 61 87 75 100 100 100 35 78 85 100 50 50 100 100 100 75 100 75 64 90 75 0 0 0 0 100 0
api/gpt-4o 86 56 79 86 100 87 100 21 64 92 50 50 50 100 100 100 75 100 100 62 83 88 0 0 0 0 100 0
api/meta-llama/llama-3-70b-instruct 86 39 72 41 87 87 75 21 35 14 0 50 0 100 100 100 75 100 75 37 81 45 0 0 0 0 50 0
api/gemini-2.0-flash 86 15 13 15 50 50 50 14 14 14 0 0 0 0 0 0 0 0 0 13 11 13 0 0 0 0 0 0
api/gemini-2.5-pro-exp-03-25 86 2 2 2 0 0 0 7 7 7 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0

Visualizations