Machine Learning: Curse of Dimensionality

deepnet · on Aug 5, 2015

As a counterpoint, the Blessing of Dimensionality is that local minima are mostly as good as the global minima.

In David Silver's RLDM 2015 talk @20 mins in

http://videolectures.net/rldm2015_silver_reinforcement_learn...

he recounts this as his experience.

the paper The Loss Surfaces of Multilayer Networks goes in depth http://arxiv.org/abs/1412.0233

[edit] and the paper Qualitative Characterisation of Neural Net Optimisation http://arxiv.org/abs/1412.6544v4

Fede_V · on Aug 5, 2015

Local minima are only nearly as good as global minima in very very deep networks.

deepnet · on Aug 5, 2015

And yet Goodfellow et al's paper only uses 2 hidden layers and 1 softmax output layer (on MNIST) which is not so very deep but the effect is still present.

(Qualitative Characterisation of Neural Net Optimisation ICML 2015) http://arxiv.org/abs/1412.6544v4

On p.6 Goodfellow et al. compare the neural nets to linear solutions by removing the non-linear layers, so that a single linear function is learned and find for Deep Linear Networks that :

" All minima are global minima, and are linked to each other in a continuous manifold. "

Which is pretty exciting because the received wisdom was local-minima are a really big problem.

DenisM · on Aug 5, 2015

Another problem worth mentioning is that computing e.g. Euclidean distance you end up adding up things of completely different types, and their relative weights are often entirely arbitrary, e.g. adding distance and weight.

Imagine three points on two-dimension plane: A(1,1) B(5,10), C(2,100). Clustered like this, A and B will end up in one cluster, while C is in another. Now let's scale down the second axis by a 100, now we have A(1,0.01) B(5,0.1), C(2,1) so A and C are clustered together, while B is the second cluster. And that's just liner scaling, imagine how much havoc I could wreck with a logarithmic scale?

I haven't done anything to research this problem, it's just something I came across that made me uneasy about clustering as a whole.

NhanH · on Aug 5, 2015

You're supposed to normalize the value of your range on all dimensions so that they are the same numerical magnitude/range (as the simplest normalization).

The issue is definitely not specific to clustering, you would have the same problem with one dimension dominate the weight without normalization even for simple regression too.

codingbinary · on Aug 5, 2015

That depends on your clustering algorithms. If it is for example based on shared-neighbor information, then you end up with the same clusters as you did before scaling.

And in your example, I get: (A, B) < (B, C) < (A, C) before and (A, B) < (B, C) < (A, C) after scaling.

So the order is preserved. It really depends on your clustering algorithm. DBSCAN for example could end up with a different result due to its epsilon-range parameter.

beagle3 · on Aug 5, 2015

The curse of dimensionality is real and problematic, but it only applies to the real dimension of the problem, not the measured/perceived one.

Let's say you have samples with 5 dimensions. That's still not (usually) considered within the realm of the curse.

Now, let's assume that because of the way they are measured, you actually have 10,000,000 measured dimensions for each sample. Now, it starts to be a problem - especially if you don't know how that expansion happens -- it might be linear, nonlinear, with noise, etc.

Even though we are now working in 10,000,000 space, and it looks like the curse of dimensionality applies, it does not. You just have to find a reasonable way to compact everything back down to the real problem dimension, or somewhere close enough to it.

The most popular tool for this is called Random Projections, that follows from the work of Lindenstrauss and Johnson in 94. Compressive Sensing is a complementary field.

codingbinary · on Aug 6, 2015

What you want to get down to is the intrinsic dimensionality. Go below that and you start discarding useful information.

Jun8 · on Aug 5, 2015

Relevant HN post from some time ago:

https://news.ycombinator.com/item?id=3995615

The problem with the super simple explanation at the beginning (searching for the penny) is that it doesn't really expose the error of our insight in higher dimensions: everybody knows intuitively that as the dimensions goes up there's more "space". Insight fails, I think, in trying to understand what this statement really means and its effects on point distributions aka objects.

brendamorgan · on Aug 5, 2015

Curse of Dimensionality refers to non-intuitive properties of data observed when working in high-dimensional space, specifically related to usability and interpretation of distances and volumes.

codingbinary · on Aug 5, 2015

You can get around the curse a little bit by using shared nearest-neighbor methods, but that only helps you a little bit.

Although, the curse of dimensionality is no precise. The real problem is due to intrinsic dimensionality. There's a nice paper on how to estimate it http://www.nii.ac.jp/TechReports/14-001E.pdf

truncate · on Aug 5, 2015

I remembered someone telling me that they prefer cosine similarity over euler many times. I was looking into reason why and found an interesting paper "On the Surprising Behavior of Distance Metrics in High Dimensional Space"[1] which at some point argues that fractional metrics are perform better than more widely used distance metrics on high-dimensional space

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23....

tdaltonc · on Aug 5, 2015

There is a cornucopia of distances to choose from. It's not as simple as "just use the best one." Cosine similarity assumes a certain projection of data (by ignoring distance of points are from the origin). It preforms well whenever that's a good projection. I think it's best to think of it as one way to reduce dimensionality, and should be used when you think that's the right way to reduce dimensionality.