What does "Divisive" mean on a beer?

A "Divisive" label means raters strongly disagree about the beer. It is flagged when the outfit mean-square statistic exceeds 1.5, indicating more variation than the model expects. It is not a quality judgment — divisive beers can be excellent or mediocre.

Can breweries influence their scores on O7S?

No. The Rasch model adjusts for each rater's individual tendencies (leniency/severity), so inflated ratings from a single source are automatically down-weighted. Scores reflect the calibrated consensus of all raters.

How out of seven Works

We predict how much you will personally enjoy a beer, then show you exactly how we do it. The model comes from psychometric measurement, the same field that calibrates standardized tests and medical outcomes research. It has over 60 years of peer-reviewed validation.

The Scale

Seven levels of enjoyment

1 Not for me Rejection — the only truly negative response

2 Fine Tolerable, minimal positive endorsement

3 Like Mild positive enjoyment

4 Really like Clear positive enjoyment

5 Love Strong positive enjoyment

6 Blown away Exceptional, memorable experience

7 Best I can imagine Peak enjoyment for this rater

This scale is unipolar — it measures degrees of positive enjoyment. Score 1 is the only true negative.

Raw Ratings Individual 1-7 scores

Rasch Calibration Adjusts for rater tendencies

Calibrated Score out of 7

Calibration

The calibrated ruler

Think of a calibrated ruler where beers and raters are placed on the same measurement scale. Each beer has a "difficulty," how much of an acquired taste it is. Low difficulty means nearly everyone rates it highly; high difficulty means fewer people do. Each rater has a "leniency," their tendency to rate high or low. The Rasch Rating Scale Model estimates both simultaneously, separating genuine enjoyment from rater bias.

A score of 5/7 from a critical taster who rarely goes above 4 tells us something very different from a 5/7 from a generous taster who gives everything a 5. The model accounts for this. The transformation is population-independent: the same difficulty always maps to the same score, regardless of what other beers exist.

Crowd Favorite Top-tier calibrated enjoyment

Well Rated Above-average enjoyment

Mixed Raters disagree

Predictions

Your personal forecast

Once you've rated at least 5 beers, we can forecast how much you'll enjoy any calibrated beer. The base prediction accounts for your rating tendency. We then adjust for patterns in your taste that leniency alone can't explain. The result is a full probability distribution across the enjoyment scale. The more you rate, the better the predictions become.

Predicted score: your expected rating on the 1-7 scale
Probability curve: shows the likelihood of each score
Crowd comparison: see how your forecast differs from the average

Hidden Gems are beers where your predicted enjoyment is high but the crowd score is lukewarm. Not Your Style is the inverse: crowd favorites the model predicts you won't enjoy as much.

Hopslam Ale Bell's Brewery

Predicted 5.2/ 7

Love

Your Profile

Know yourself as a taster

Most people think they're balanced raters. The model often reveals they're significantly more critical, or more generous, than they realize. Your leniency is one dimension of your taste profile. The other is your taste pattern: which styles and flavors you gravitate toward. Together they produce predictions tuned to your specific palate.

These labels are relative to the current population, not fixed thresholds. As the community grows and the rater distribution shifts, your label may change even if your rating behavior stays the same. This is a feature, not a bug — it keeps the labels meaningful as the population evolves.

Accuracy

How accurate are we?

We hold ourselves to a simple standard: predict your rating, then check if we were right. As our community grows, we'll publish accuracy metrics here, comparing predictions against actual ratings from real users.

Mean Error -- Average distance from actual rating

Within 1 Point --% Predictions close to actual

Bias -- Systematic over/under-prediction

Metrics will appear once we have enough real users with verified predictions. Transparency is a feature, not a marketing claim.

Under the Hood

Technical Details

The Rasch Rating Scale Model

The RSM models the probability of a person n responding in category k to item i as:

P(X_ni = k) = exp(Σ_j=0..k(θ_n − β_i − τ_j)) / Σ_m exp(Σ_j=0..m(θ_n − β_i − τ_j))

where θ_n is person leniency, β_i is item difficulty, and τ_j are 6 step thresholds shared across all items (7 categories produce 6 transitions). Parameters are estimated using Joint Maximum Likelihood Estimation (JMLE) with Bayesian regularization. Convergence criterion: maximum parameter change <0.01 logits.

Score Transformation

Beer difficulty (β in logits) is transformed to a 1-7 display score by computing the RSM expected score for an average rater (θ = 0). This gives the score an average person would assign to a beer of that difficulty, based on the model's estimated step thresholds.

The transformation is population-independent: the same difficulty always produces the same score, regardless of what other beers exist. Crowd labels (Crowd Favorite, Well Rated, Mixed) are derived from where a beer falls on this calibrated scale.

Fit Statistics

Each beer and rater receives infit and outfit mean-square statistics measuring how well their response patterns match the model. Infit is information-weighted (sensitive to unexpected responses near the person's ability); outfit is unweighted (sensitive to outlier responses).

Acceptable infit range: 0.7-1.3
"Divisive" flag: outfit > 1.5
Threshold standard errors computed from Fisher information

Prediction Confidence

Confidence is measured via normalized Shannon entropy:

confidence = 1 − H / log(7)

where H = −Σ P(k) log P(k) over 7 categories. A value of 1.0 means the model is certain about one category; 0.0 means maximum uncertainty (uniform distribution across all 7).

High — confidence ≥ 0.7
Medium — confidence ≥ 0.4
Low — confidence < 0.4

Taste Pattern Analysis

After Rasch calibration, residuals (observed rating minus Rasch-expected rating) isolate the component of each rating that leniency and difficulty can't explain: personal taste. Matrix factorization decomposes these residuals into latent factors that capture style-level and flavor-level preferences.

The final prediction uses a BellKor-style combination: the Rasch baseline (expected score from leniency + difficulty + thresholds) plus the MF residual (dot product of user and beer latent factors plus biases). The combined target score is then mapped back into the Rating Scale Model's probability framework to produce a full distribution across all 7 categories.

The MF model is trained using Alternating Least Squares (ALS) with L2 regularization on the residual matrix. A temporal train/validation split ensures the model generalizes to future ratings, not just past ones.

Data Requirements

Global calibration

20+ ratings per beer
5+ beers rated per user
5+ beers rated for personalized predictions

Per-style-category calibration

10+ ratings per beer
5+ beers rated per user

FAQ

Frequently Asked Questions

How many ratings does a beer need?

At least 20 ratings for a global calibrated score, or 10 ratings within a single style category for a per-category score.

How often are scores updated?

The Rasch calibration pipeline runs daily, recalculating all beer scores and user profiles with the latest ratings.

What does "Divisive" mean?

It means raters strongly disagree about the beer. The outfit statistic exceeds 1.5, indicating more variation in responses than the model expects. This is not a quality judgment — divisive beers can be excellent or mediocre. It simply means opinions are split.

Can breweries influence their scores?

No. The Rasch model adjusts for each rater's individual baseline tendencies, so inflated ratings from a single source are automatically down-weighted. Scores reflect the calibrated consensus of all raters.

Why 7 points instead of 5 or 10?

Psychometric research shows 7 points is the sweet spot for single-item ratings. A study of 172 participants (Finstad, 2010) found that 2.5% of respondents on 5-point scales tried to answer between points; on 7-point scales, that dropped to zero. Reliability and discriminating power are significantly higher up to 7 points, with diminishing returns beyond (Colman et al., 1997). Netflix abandoned 5-star ratings entirely because compression made them useless for personalization. We kept a numeric scale but chose the length that maximizes what each rating tells us.

How are predictions different from crowd scores?

A crowd score averages ratings across all users: a lenient rater's 5 and a critical rater's 5 count the same, and a pilsner lover's opinion is mixed with a stout lover's. Your prediction accounts for both your rating tendencies and your specific taste patterns to estimate what you would rate a beer. The crowd score might be 4.1, but your personal prediction might be 5.8, or 2.3.

Is the Rasch model peer-reviewed?

Yes — over 60 years of peer-reviewed validation across psychometric measurement, educational testing, medical outcomes research, and sensory science. out of seven uses the Rating Scale Model variant introduced by David Andrich in 1978.

References

Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danish Institute for Educational Research.
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561-573. doi:10.1007/BF02294208
Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149-174. doi:10.1007/BF02296272
Finstad, K. (2010). Response Interpolation and Scale Sensitivity: Evidence Against 5-Point Scales. Journal of Usability Studies, 5(3), 104-110. UXPA
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer, 42(8), 30-37. doi:10.1109/MC.2009.263
Colman, A.M., Norris, C.E., & Preston, C.C. (1997). Comparing Rating Scales of Different Lengths. Psychological Reports, 80(2), 355-362. doi:10.2466/pr0.1997.80.2.355