How we evaluate LLM sampling strategies through creative writing quality and MMLU-Pro accuracy benchmarks.
5
Sampling Strategies
20
Samples Per Strategy
2
LLM Judges
5
Quality Criteria
Local inference using KoboldCpp server
20 samples per strategy using 5 creative prompts (4 reps each)
300-400 words per story with compliance scoring
Multiple-choice with single letter answers, exact match scoring
Kimi-K2 (Chinese) and Mistral Medium 3 (European)
Multiple judges evaluate each sample, scores averaged
Word count compliance, instruction following analysis
Fraction of correct responses, no judges needed
How well the story flows and maintains logical consistency throughout the narrative arc.
Uniqueness of ideas, plot elements, and creative expression that distinguishes the work.
Depth and believability of characters, their motivations, and growth within the story.
How engaging and accessible the text is for readers, maintaining interest throughout.
Writing style, language use, sentence variety, and literary technique mastery.
View the actual prompts used in the evaluation pipeline
The five sampling configurations tested across all models. Each strategy represents a different approach to controlling token selection randomness.
Resolves to model-specific optimal settings (e.g., Llama: temp 0.6, top_p 0.9)
Conservative min-p sampling with lower temperature for controlled creativity
Moderate min-p sampling with standard temperature for balanced output
Top-nσ with high temperature and tight deviation threshold
Top-nσ with moderate temperature and relaxed sigma threshold
Objective measure of instruction adherence
Each prompt specifies exactly 300-400 words. We track compliance rates as an objective measure of instruction following capability. High compliance indicates better instruction adherence.
What we measure for each sample
Understanding the 1-10 quality scale used for creative writing evaluation.
9-10
Outstanding creative writing with excellence across all criteria
7-8
Strong writing with minor areas for improvement
5-6
Adequate quality with balanced strengths and weaknesses
1-4
Significant issues requiring attention