Monday, September 23, 2013

Convnet issue: got nan or inf!

I was able to install the convnet library and  trying to replicate the 13% accuracy on the CIFAR10 dataset claimed. but my implementation would run a few epochs and then give me a message such as " got nan or inf" and the process would die. 
I would get a message such as  : 

5.1... logprob:  1.224800, 0.430300 (2.227 sec)
5.2... logprob:  1.251499, 0.436500 (2.165 sec)
5.3... logprob:     nan, 1.000000 ^ got nan or inf!


Sometimes it would run for 5 epochs and die other times would be 40. 

After emailing Alex Krizhevsky the author of the library, he said, "it can happen if the optimization becomes numerically unstable, but it can also be caused by a faulty GPU"

So I swtiched my GPU for a different one and it worked!
So if you see this problem, something simple you can do is just try a different GPU. Good luck