How out of seven Works

We predict how much you will personally enjoy a beer, then show you exactly how we do it. The model comes from psychometric measurement, the same field that calibrates standardized tests and medical outcomes research. It has over 60 years of peer-reviewed validation.

Seven levels of enjoyment

1 Not for me Rejection — the only truly negative response
2 Fine Tolerable, minimal positive endorsement
3 Like Mild positive enjoyment
4 Really like Clear positive enjoyment
5 Love Strong positive enjoyment
6 Blown away Exceptional, memorable experience
7 Best I can imagine Peak enjoyment for this rater

This scale is unipolar — it measures degrees of positive enjoyment. Score 1 is the only true negative.

Raw Ratings Individual 1-7 scores
Rasch Calibration Adjusts for rater tendencies
Calibrated Score out of 7

The calibrated ruler

Think of a calibrated ruler where beers and raters are placed on the same measurement scale. Each beer has a "difficulty," how much of an acquired taste it is. Low difficulty means nearly everyone rates it highly; high difficulty means fewer people do. Each rater has a "leniency," their tendency to rate high or low. The Rasch Rating Scale Model estimates both simultaneously, separating genuine enjoyment from rater bias.

A score of 5/7 from a critical taster who rarely goes above 4 tells us something very different from a 5/7 from a generous taster who gives everything a 5. The model accounts for this. The transformation is population-independent: the same difficulty always maps to the same score, regardless of what other beers exist.

Crowd Favorite Top-tier calibrated enjoyment
Well Rated Above-average enjoyment
Mixed Raters disagree

Your personal forecast

Once you've rated at least 5 beers, we can forecast how much you'll enjoy any calibrated beer. The base prediction accounts for your rating tendency. We then adjust for patterns in your taste that leniency alone can't explain. The result is a full probability distribution across the enjoyment scale. The more you rate, the better the predictions become.

  • Predicted score: your expected rating on the 1-7 scale
  • Probability curve: shows the likelihood of each score
  • Crowd comparison: see how your forecast differs from the average

Hidden Gems are beers where your predicted enjoyment is high but the crowd score is lukewarm. Not Your Style is the inverse: crowd favorites the model predicts you won't enjoy as much.

Hopslam Ale Bell's Brewery
Predicted 5.2/ 7
Love
38%1234567

Know yourself as a taster

Most people think they're balanced raters. The model often reveals they're significantly more critical, or more generous, than they realize. Your leniency is one dimension of your taste profile. The other is your taste pattern: which styles and flavors you gravitate toward. Together they produce predictions tuned to your specific palate.

Critical Bottom 25% More selective than most raters
Balanced Middle 50% Close to the population average
Generous Top 25% Rates higher than most raters

These labels are relative to the current population, not fixed thresholds. As the community grows and the rater distribution shifts, your label may change even if your rating behavior stays the same. This is a feature, not a bug — it keeps the labels meaningful as the population evolves.

How accurate are we?

We hold ourselves to a simple standard: predict your rating, then check if we were right. As our community grows, we'll publish accuracy metrics here, comparing predictions against actual ratings from real users.

Mean Error -- Average distance from actual rating
Within 1 Point --% Predictions close to actual
Bias -- Systematic over/under-prediction

Metrics will appear once we have enough real users with verified predictions. Transparency is a feature, not a marketing claim.

Technical Details

The Rasch Rating Scale Model

The RSM models the probability of a person n responding in category k to item i as:

P(Xni = k) = exp(Σj=0..kn − βi − τj)) / Σm exp(Σj=0..mn − βi − τj))

where θn is person leniency, βi is item difficulty, and τj are 6 step thresholds shared across all items (7 categories produce 6 transitions). Parameters are estimated using Joint Maximum Likelihood Estimation (JMLE) with Bayesian regularization. Convergence criterion: maximum parameter change <0.01 logits.

Score Transformation

Beer difficulty (β in logits) is transformed to a 1-7 display score by computing the RSM expected score for an average rater (θ = 0). This gives the score an average person would assign to a beer of that difficulty, based on the model's estimated step thresholds.

The transformation is population-independent: the same difficulty always produces the same score, regardless of what other beers exist. Crowd labels (Crowd Favorite, Well Rated, Mixed) are derived from where a beer falls on this calibrated scale.

Fit Statistics

Each beer and rater receives infit and outfit mean-square statistics measuring how well their response patterns match the model. Infit is information-weighted (sensitive to unexpected responses near the person's ability); outfit is unweighted (sensitive to outlier responses).

  • Acceptable infit range: 0.7-1.3
  • "Divisive" flag: outfit > 1.5
  • Threshold standard errors computed from Fisher information
Prediction Confidence

Confidence is measured via normalized Shannon entropy:

confidence = 1 − H / log(7)

where H = −Σ P(k) log P(k) over 7 categories. A value of 1.0 means the model is certain about one category; 0.0 means maximum uncertainty (uniform distribution across all 7).

  • High — confidence ≥ 0.7
  • Medium — confidence ≥ 0.4
  • Low — confidence < 0.4
Taste Pattern Analysis

After Rasch calibration, residuals (observed rating minus Rasch-expected rating) isolate the component of each rating that leniency and difficulty can't explain: personal taste. Matrix factorization decomposes these residuals into latent factors that capture style-level and flavor-level preferences.

The final prediction uses a BellKor-style combination: the Rasch baseline (expected score from leniency + difficulty + thresholds) plus the MF residual (dot product of user and beer latent factors plus biases). The combined target score is then mapped back into the Rating Scale Model's probability framework to produce a full distribution across all 7 categories.

The MF model is trained using Alternating Least Squares (ALS) with L2 regularization on the residual matrix. A temporal train/validation split ensures the model generalizes to future ratings, not just past ones.

Data Requirements

Global calibration

  • 20+ ratings per beer
  • 10+ ratings per user
  • 5+ ratings for personalized predictions

Per-style-category calibration

  • 10+ ratings per beer
  • 5+ ratings per user
  • 3+ ratings for per-category predictions

Frequently Asked Questions

How many ratings does a beer need?

At least 20 ratings for a global calibrated score, or 10 ratings within a single style category for a per-category score.

How often are scores updated?

The Rasch calibration pipeline runs daily, recalculating all beer scores and user profiles with the latest ratings.

What does "Divisive" mean?

It means raters strongly disagree about the beer. The outfit statistic exceeds 1.5, indicating more variation in responses than the model expects. This is not a quality judgment — divisive beers can be excellent or mediocre. It simply means opinions are split.

Can breweries influence their scores?

No. The Rasch model adjusts for each rater's individual baseline tendencies, so inflated ratings from a single source are automatically down-weighted. Scores reflect the calibrated consensus of all raters.

Why 7 points instead of 5 or 10?

Psychometric research shows 7 points is the sweet spot for single-item ratings. A study of 172 participants (Finstad, 2010) found that 2.5% of respondents on 5-point scales tried to answer between points; on 7-point scales, that dropped to zero. Reliability and discriminating power are significantly higher up to 7 points, with diminishing returns beyond (Colman et al., 1997). Netflix abandoned 5-star ratings entirely because compression made them useless for personalization. We kept a numeric scale but chose the length that maximizes what each rating tells us.

How are predictions different from crowd scores?

A crowd score averages ratings across all users: a lenient rater's 5 and a critical rater's 5 count the same, and a pilsner lover's opinion is mixed with a stout lover's. Your prediction accounts for both your rating tendencies and your specific taste patterns to estimate what you would rate a beer. The crowd score might be 4.1, but your personal prediction might be 5.8, or 2.3.

Is the Rasch model peer-reviewed?

Yes — over 60 years of peer-reviewed validation across psychometric measurement, educational testing, medical outcomes research, and sensory science. out of seven uses the Rating Scale Model variant introduced by David Andrich in 1978.

  1. Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danish Institute for Educational Research.
  2. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561-573. doi:10.1007/BF02294208
  3. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149-174. doi:10.1007/BF02296272
  4. Finstad, K. (2010). Response Interpolation and Scale Sensitivity: Evidence Against 5-Point Scales. Journal of Usability Studies, 5(3), 104-110. UXPA
  5. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer, 42(8), 30-37. doi:10.1109/MC.2009.263
  6. Colman, A.M., Norris, C.E., & Preston, C.C. (1997). Comparing Rating Scales of Different Lengths. Psychological Reports, 80(2), 355-362. doi:10.2466/pr0.1997.80.2.355