IMO that's the fundamental difference between statistics and ML. The culture of ...

lupire · 2025-05-08T11:33:22 1746704002

You are describing the difference between academic mathematician statisticians and "applied/engineering/actuarial/business" people who use statistics. The "black box" culture goes back to before ML and before both computing Machines M and statistical Learning (iterative models)

ogogmad · 2025-05-08T11:51:21 1746705081

I suspect that the "black box" philosophy for statistics/ML is actually bad if you don't have a quick way of verifying the predictions. For instance, using PCA as a "black box" is perfectly fine if you're using it to de-noise readings from a camera or other instrument, because a human being can quickly tell if the de-noising is working correctly or not. But if you're using PCA to make novel discoveries, where you don't have an independent way of checking those discoveries, then it might be outright essential to have a deep definition-theorem-proof style understanding of PCA. What do people think of this hunch?

The point about PCA applies to population genetics and psychometrics (IQ). Some conclusions have been derived using PCA that appear to be supported by little else, and these have come under question.

kyllo · 2025-05-08T16:26:11 1746721571

You make a good point, though the difference between ML and statistics isn't just about interpreting and validating the model. It's about the "novel discoveries" part aka Doing Science.

Statistical modeling is done primarily in service of scientific discovery--for the purpose of making an inference (population estimate from a sample) or a comparison to test a hypothesis derived from a theoretical causal model of a real-world process before viewing data. The parameters of a model are interpreted because they represent an estimate of a treatment effect of some intervention.

Methods like PCA can be part of that modeling process either way, but analyzing and fitting models to data to mine it for patterns without an a priori hypothesis is not science.

kyllo · 2025-05-08T16:36:36 1746722196

Only perfect multicollinearity (correlation of 1.0 or -1.0) is a problem at the linear algebra level when fitting a statistical model.

But theoretically speaking, in a scientific context, why would you want to fit an explanatory model that includes multiple highly (but not perfectly) correlated independent variables?

It shouldn't be an accident. Usually it's because you've intentionally taken multiple proxy measurements of the same theoretical latent variable and you want to reduce measurement error. So that becomes a part of your measurement and modeling strategy.

0xDEAFBEAD · 2025-05-08T17:37:18 1746725838

I think this distinction is not sharp. You do hear ML practitioners talk about interpretability a lot.