← Back to research
methodology··5 min

When a Trained Model Performs Worse Than Theory

Our logreg had ROC AUC=0.505. Replacing it with a hand-weighted theory prior raised lottery top-decile prediction 51%.

The instinct in any empirical domain is to throw a model at it. We have N=442 labeled examples — 196 life-peak events, 246 documented setbacks. We have 14 astrological features per example. Fit a logistic regression, extract the weights, ship the model.

We did that. The trained model came back with ROC AUC = 0.505 — essentially coin-flip-on-peaks-vs-setbacks — plus several weights with signs that were physically impossible:

  • benefic_harmonious_count weight: −0.34 (more harmonious transits from Jupiter/Venus → lower luck score. No serious astrology theory predicts this.)
  • progressed_harmonious_count: −0.28
  • profection_money_house: −0.31 (profection year lord in a money house → lower luck score. Classical Hellenistic wealth doctrine is the exact opposite.)

This wasn't a silly feature-engineering error. The model was trained correctly on the data we had; it converged to those weights because the data, at N=442 with modest effect sizes and multicollinearity, couldn't tell the trainer which sign was right.

Throwing more features at it made it worse. Adding regularization flattened the weights without making them correct. Trying sign constraints did help — but then we were just telling the model what the right answers were, which defeats the purpose of training.

The actual problem

Logistic regression needs one of two things to produce sensible weights: big effects, or big data. We had neither.

Most of our astrological features are 0/1 indicators that fire 5–25% of the time. The between-class effect sizes (Cohen's h) live in the 0.10–0.20 range — small-to-medium. To estimate small effects precisely via logreg on binary features, you need sample sizes in the low thousands. We had 442.

At N=442, the trainer's estimates of individual feature coefficients are noise-dominated. It picks up on incidental correlations in the specific sample — the handful of "setback" examples that happened to fall during planetary-hour-benefic times — and compensates by pushing the sign in whatever direction minimizes training loss on this sample. The result is a fitted model that performs at chance on held-out data.

What replaced it

We abandoned the trained model and built a theory-anchored prior model instead. No training. The weights come from three sources:

  1. Classical theory for features that match classical claims. Example: benefic_harmonious_count gets weight +0.82 because the classical doctrine (Brennan, Brady, Ebertin) says harmonious aspects from Jupiter/Venus are favorable, full stop.
  2. Permutation-study effect sizes for features where we have empirical evidence. Example: h1_zr_peak gets weight −0.40 (twice the observed Cohen's h) because a 10,000-permutation test said the signal is significantly negative.
  3. Peak-vs-setback gap as the sizing rule for features that fire above null but symmetrically across peaks and setbacks. This prevents "eventful day" signals like h6_vedic_benefic_md from dominating a luck-specific scoring model.

No feature gets a sign that contradicts either classical theory (where we have no empirical evidence) or permutation-test evidence (where we do). This is a much stronger guarantee than any trained model can give you at our sample size.

Did it work?

The test we care about: for each of 53 lottery winners, where does the actual moment of winning rank within the 96 fifteen-minute slots of the same calendar day?

  • Random baseline: each slot is equally likely → median percentile = 0.500.
  • Old trained model: median percentile = 0.677, top-quartile rate = 0.245 (below random — the model was worse than random at putting the winning moment in the top 25%).
  • New theory-anchored model: median percentile = 0.708, top-quartile rate = 0.302, top-decile rate = 0.151.

The new model's top-decile rate — fraction of winners whose actual win moment scored in the top 10% of the day's slots — is 51% higher than random. The old trained model's top-quartile rate was below random.

That's a measurable, product-relevant improvement from deleting a trained model and writing weights by hand.

The meta-point

This isn't a blanket argument against machine learning. Trained models work great when you have enough data and enough effect size to overwhelm the noise floor. That's why they dominate in adversarial-search domains (chess, Go) and perception (vision, speech) — there's real structure, and there's millions of examples.

For emerging empirical domains with small cohorts and small effects — astrology, behavioral genetics, many parts of social science — a hand-tuned model anchored to domain theory and permutation-test evidence will often beat a trained model because it respects the fact that the data is insufficient to infer signs from correlations.

Machine learning isn't always the answer. Sometimes the answer is to do the statistics carefully, write down what the theory says, and use the data to check the theory rather than to derive the weights.

What we changed, concretely

Anyone can inspect the diff:

The application code lives at services/astro-api/app/services/forecast_model.py; the only change to the runtime was in the weights and the feature list, not the scoring logic.

What we're watching for

The theory-anchored model has one weakness: it's not learning from data the way a trained model would, which means it can't improve automatically as the cohort grows. The plan is to re-run the permutation study at each cohort-size milestone (266 peaks → 400 → 600) and re-anchor the weights using the sharper effect sizes.

When the cohort is large enough that a constrained logistic regression (with sign constraints pinned by the permutation study) can reliably beat the theory prior on held-out data, we'll switch. Until then, the theory prior is the better bet.