Some fun to do in summer school?

What would a good assignment for the summer school in computer vision? That was the question I suddenly got puzzled with, because of the last minute replacement for another teacher I agreed to do for the Vision and Sports School 2022 in Prague.

What would I like to get as a student? Something fun, probably related to the computer vision, but not your typical homework or coursework. Something light enough to have fun, but kind of useful. Something with a take-home message, but learned in personal experience, not written on a whiteboard.

And what is the most important thing in the all machine learning? Of course, it is data. So…I have got an idea.

The task

On the day of my course I have delivered some (quick) talk on the deep learning, and we proceeded to the computer class. The students got the link to the Google Colab with fastai working example of training image classification model. No coding is required, but they are free to do it if they want, e.g. to explore various CNN/ViT architectures etc. The task is the following:

You have to train a two-way image classification model.

Class 1 is “Beer” – hardly can be more Czech than this.

Class 2 is “Trdelnik” or “Chimney cake” as someone translates this into English – tourist-trap pastry, not a traditional at all, but sold everywhere in Prague. I actually like it - what can be wrong about dough + sugar, right?

Your job is to produce the model, which is as good as possible. You may write a google search scrapper, look for the ready datasets, draw images yourself, generate synthetic data – I do not care. You can use ImageNet-pretrained models, or anything else.

When you think, you are done, you call me. I came to your desk, download the test set from my super secret website, and we benchmark it together (but nobody else can see the images).

The leadearboard with the best score is written on the physical whiteboard in the class. The leaderboard is initalized with two entries: 0% – best result so far.

The person, whos model performs the best, wins.

What is the catch?

The next hour or so, people were constructing their datasets. Most advanced ones even created their validation sets. When the accuracy on the validation set reached 98%, someone got enough confidence and called me.

We did a benchmark and the result was 28.5% accuracy. The class giggled a bit. Next 20 minutes nobody could cross this line. The random chance would do better than any of the models. The finally new breakthrough came - 42%. People started to realise the task was not as easy as they thought.

Next hour people finally got 57%. The class was giggling and laughing with every trial – the number on the leaderboard was unchanging. Finally someone used CLIP model and got astounishing 71%.

The catch is the data

What do you usually imagine, when someone says “beer”? Maybe a nice glass of pilsner or IPA. Or maybe a bottle. Rarely a can. What do you imagine, when you hear trdelnik? Probably nothing, unless you live in Prague or visit the city. Then you see that special fire cage, where the pastry is done.

So, I have selected the can beer, the beer near the fire-pit for barbeque, Duff from The Simpsons, and top-down view of the beer glasses near the shashlyck for the Beer class. For the Trdelnik I selected close-up view with a toppings – not popular at that time yet – similar to the beer foam, and the kiosk selling the trdelnik. Finally, to make people less sad, I have found the image, which contains both, so any answer would be correct. Here is my test set in full:

I have also create another test set with similar ideas of naturally adversarial examples – for those, who wanted to try again. It was a bit simpler, but also challenging.

Everything good comes to an end

This trick worked for couple more years – and then VLMs came. I have tested PaliGemma on my tiny dataset – and it gave me 85% accuracy out of the box. Time to invent something new.