Sunday, February 22, 2009

Comparing likelihoods of heterogeneous models

Given a dataset X, which model is the best model to fit it?
In a parametric family of models, often this problem is tackled by maximum likelihood (estimation of the parameters). However if the family of models has too much freedom, it may overfit, the same way function approximation and regression does. If a model A predicts the dataset will be X and only X, given the dataset X, it would obviously have the maximum likelihood. It is like doing a pdf estimation with kernel density estimation with a Dirac delta kernel.
How can we avoid this situation?
Since it is caused by the model's lack of generalization ability, one obvious way is to use a test set (or cross validation). If the likelihood value for the test set and training set does not differ significantly, it is not overfitting.
Can we compare the likelihoods of two MLEs on two different model families that does not overfit the data? I think so.

No comments: