Skip to content

Instantly share code, notes, and snippets.

@howardrotterdam
Created March 30, 2025 09:09
Show Gist options
  • Select an option

  • Save howardrotterdam/4581c4e8def064491358a5b259e258ab to your computer and use it in GitHub Desktop.

Select an option

Save howardrotterdam/4581c4e8def064491358a5b259e258ab to your computer and use it in GitHub Desktop.
How to evaluate the performance of different recommendation models

How to evaluate the performance of different recommendation models like ALS and FP-growth? Are there any common metrics that are meaningful across different models?

Grok3:

Evaluating the performance of recommendation models like Alternating Least Squares (ALS) and FP-growth involves assessing how well they predict user preferences or generate relevant recommendations. While these two models differ in their approach—ALS is a matrix factorization technique commonly used in collaborative filtering, and FP-growth is an association rule mining algorithm often applied to market basket analysis—they share some common evaluation principles. The choice of metrics depends on the task (e.g., rating prediction or item recommendation) and the data available (e.g., explicit ratings or implicit feedback like clicks). Below, I’ll outline key methods and metrics that are meaningful across different recommendation models.

1. Evaluation Approaches

Before diving into metrics, it’s worth noting the typical ways to evaluate these models:

  • Offline Evaluation: Use historical data, split into training and test sets, to measure how well the model predicts withheld data.
  • Online Evaluation: Test the model in a live system (e.g., A/B testing) to observe real user interactions.
  • Cross-Validation: For offline settings, k-fold cross-validation helps ensure robustness by averaging performance across multiple splits.

ALS is often evaluated offline with rating prediction or ranking tasks, while FP-growth is typically assessed on its ability to generate frequent itemsets or rules. Still, both can be adapted to recommendation tasks, making some metrics comparable.

2. Common Metrics

Here are widely used metrics that apply to both models, depending on whether you’re predicting ratings or ranking items:

For Rating Prediction (Explicit Feedback)

If your task involves predicting numerical ratings (e.g., 1–5 stars), as is common with ALS:

  • Mean Squared Error (MSE): Measures the average squared difference between predicted and actual ratings. Lower is better.
    • Formula: ( MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 ), where ( y_i ) is the true rating and ( \hat{y}_i ) is the predicted rating.
  • Root Mean Squared Error (RMSE): The square root of MSE, providing error in the same units as the ratings. Easier to interpret.
    • Formula: ( RMSE = \sqrt{MSE} ).
  • Mean Absolute Error (MAE): Average absolute difference between predicted and actual ratings. Less sensitive to outliers than MSE.
    • Formula: ( MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| ).

These metrics can apply to FP-growth if you convert its rules into rating predictions (e.g., assigning confidence scores as proxies for ratings), though this is less common.

For Ranking/Top-N Recommendations (Implicit or Explicit Feedback)

For tasks where the goal is to recommend a list of items (common for both ALS and FP-growth), ranking metrics are more relevant:

  • Precision@k: Fraction of recommended items in the top-k list that are relevant (e.g., liked or interacted with by the user).
    • Formula: ( Precision@k = \frac{\text{# of relevant items in top k}}{k} ).
  • Recall@k: Fraction of all relevant items that appear in the top-k recommendations.
    • Formula: ( Recall@k = \frac{\text{# of relevant items in top k}}{\text{total # of relevant items}} ).
  • F1 Score@k: Harmonic mean of Precision@k and Recall@k, balancing the trade-off between them.
    • Formula: ( F1 = 2 \cdot \frac{Precision@k \cdot Recall@k}{Precision@k + Recall@k} ).
  • Mean Average Precision (MAP@k): Average precision across multiple users, accounting for the order of relevant items in the top-k list. Higher weight is given to relevant items ranked higher.
  • Normalized Discounted Cumulative Gain (NDCG@k): Measures ranking quality by assigning higher scores to relevant items appearing earlier in the list, normalized by the ideal ranking.
    • Formula involves logarithmic discounting: ( NDCG@k = \frac{DCG@k}{IDCG@k} ), where ( DCG@k = \sum_{i=1}^k \frac{rel_i}{\log_2(i+1)} ).

ALS excels at ranking when trained on implicit feedback (e.g., clicks), while FP-growth can generate top-N recommendations by mining frequent itemsets and ranking them by support or confidence.

For Implicit Feedback (Binary Relevance)

When dealing with implicit data (e.g., purchases, views), where relevance is binary (1 for interaction, 0 otherwise):

  • Area Under the ROC Curve (AUC): Measures the model’s ability to rank relevant items higher than irrelevant ones across all possible thresholds. AUC of 1 is perfect, 0.5 is random.
  • Hit Rate@k: Proportion of test cases where at least one recommended item in the top-k is relevant.

Both ALS and FP-growth can be evaluated this way—ALS by ranking items based on predicted scores, FP-growth by ranking itemsets or rules.

3. Model-Specific Considerations

  • ALS: Since it optimizes a loss function (e.g., squared error with regularization), RMSE or NDCG@k are natural fits. It’s sensitive to hyperparameters like rank (number of latent factors) and regularization strength, so tuning these is key.
  • FP-growth: Focuses on frequent itemsets, so metrics like support (frequency of itemsets) and confidence (strength of rules) are internal measures. For recommendation, you’d convert these into Precision@k or Recall@k by recommending items from high-confidence rules.

4. Practical Tips

  • Data Splitting: Use time-based splits (e.g., train on older data, test on newer) to simulate real-world usage, especially for dynamic systems.
  • Baseline Comparison: Compare against simple baselines (e.g., popularity-based recommendations) to ensure your model adds value.
  • Diversity and Novelty: Beyond accuracy, consider recommendation diversity (are suggestions varied?) or novelty (are they non-obvious?), though these are harder to quantify across models uniformly.

5. Which Metrics to Choose?

  • Use RMSE/MAE if predicting explicit ratings (more relevant for ALS).
  • Use Precision@k, Recall@k, NDCG@k for top-N recommendations, as they work well for both ALS and FP-growth.
  • Use AUC for implicit feedback ranking tasks.
  • Pick k (e.g., 5, 10) based on how many recommendations users typically see.

In short, Precision@k, Recall@k, and NDCG@k are the most versatile metrics across models like ALS and FP-growth, especially for ranking tasks. They focus on what matters in recommendation systems: delivering relevant items to users. Tailor your choice to the specific problem—rating prediction or item ranking—and the type of feedback available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment