Documentation

Methodology

How we evaluate LLM sampling strategies through creative writing quality and MMLU-Pro accuracy benchmarks.

5

Sampling Strategies

20

Samples Per Strategy

2

LLM Judges

5

Quality Criteria

Evaluation Process

Generation

Model Setup

Local inference using KoboldCpp server

Sampling

20 samples per strategy using 5 creative prompts (4 reps each)

Target Length

300-400 words per story with compliance scoring

MMLU Mode

Multiple-choice with single letter answers, exact match scoring

Evaluation

Judge Models

Kimi-K2 (Chinese) and Mistral Medium 3 (European)

Consensus Scoring

Multiple judges evaluate each sample, scores averaged

Quality Control

Word count compliance, instruction following analysis

MMLU Accuracy

Fraction of correct responses, no judges needed

Quality Criteria

25%

Narrative Coherence

How well the story flows and maintains logical consistency throughout the narrative arc.

25%

Creativity & Originality

Uniqueness of ideas, plot elements, and creative expression that distinguishes the work.

20%

Character Development

Depth and believability of characters, their motivations, and growth within the story.

20%

Engagement & Readability

How engaging and accessible the text is for readers, maintaining interest throughout.

10%

Stylistic Quality

Writing style, language use, sentence variety, and literary technique mastery.

Judging Prompt Details

View the actual prompts used in the evaluation pipeline

Sampling Strategies

The five sampling configurations tested across all models. Each strategy represents a different approach to controlling token selection randomness.

model_default

Dynamic

Resolves to model-specific optimal settings (e.g., Llama: temp 0.6, top_p 0.9)

standard_minp

temp 0.7, min_p 0.02

Conservative min-p sampling with lower temperature for controlled creativity

creative_minp

temp 1.0, min_p 0.02

Moderate min-p sampling with standard temperature for balanced output

standard_sigma

temp 1.5, σ 1.0

Top-nσ with high temperature and tight deviation threshold

creative_sigma

temp 1.0, σ 1.5

Top-nσ with moderate temperature and relaxed sigma threshold

Instruction Following

Word Count Compliance

Objective measure of instruction adherence

Each prompt specifies exactly 300-400 words. We track compliance rates as an objective measure of instruction following capability. High compliance indicates better instruction adherence.

Tracked Metrics

What we measure for each sample

  • Word count compliance percentage
  • Average deviation from target range
  • Instruction following consistency
  • Generation failure detection

Score Interpretation

Understanding the 1-10 quality scale used for creative writing evaluation.

9-10

Exceptional

Outstanding creative writing with excellence across all criteria

7-8

Good

Strong writing with minor areas for improvement

5-6

Average

Adequate quality with balanced strengths and weaknesses

1-4

Below Average

Significant issues requiring attention