Zero-Shot Image Tagging: Harnessing AI for Efficient and Flexible Image Classification

12 min readMay 29, 2023

Humans possess an incredible ability to distinguish an extensive range of object categories, numbering at least 30,000 basic classifications, and countless more specific subcategories. What’s even more remarkable is our capacity to create new categories on the fly, either from just a few examples or based solely on high-level descriptions. In contrast, traditional computer vision techniques often rely on thousands of labeled samples for each object class to train a recognition model effectively. However, taking inspiration from the human ability to recognize without extensive exposure, the field of learning to learn or lifelong learning has gained significant attention. Researchers are exploring ways to enable machines to acquire knowledge and continually expand their recognition capabilities, mimicking the adaptability and flexibility exhibited by human cognition.

Automatic Image Tagging

Automatic image tagging refers to the process of assigning relevant descriptive tags or labels to images using AI algorithms. Instead of relying on manual efforts, these algorithms leverage machine learning techniques to analyze the visual features and content of images, enabling automated classification and tagging.

How Does Automatic Image Tagging Work?

Automatic image tagging relies on powerful AI models, such as convolutional neural networks (CNNs), to extract meaningful features from images. These models are trained on vast labeled datasets, allowing them to learn patterns, textures, shapes, and other visual characteristics associated with different objects, scenes, or concepts.

Once trained, the AI model can process new, unlabeled images and predict relevant tags based on the learned patterns. This process involves passing the image through the model, which generates a set of probabilities or confidence scores for various predefined tags. The top-ranking tags with high probabilities are then assigned to the image.

Zero-Shot Image Tagging

Zero-shot image tagging is an exciting extensionof automatic image tagging that goes beyond the limitations of predefined tags. Traditional image tagging methods require a predefined set of tags for training and prediction. In contrast, zero-shot image tagging allows the assignment of tags to images that were not seen during the training phase, enabling a more flexible and versatile tagging approach.

Humans are capable of recognizing novel objects that they have never seen in the past given some information/description about them.

What is Zero-shot learning?

Zero-shot learning (ZSL) is a problem setup in machine learning, where at test time, a learner observes samples from classes that were not observed during training, and needs to predict the category they belong to.

It can be considered as a special case of transfer learning where the source and target domains have different tasks/label spaces and the target domain is unlabelled, providing little guidance for the knowledge transfer.

Example: Zero-shot tagging for detecting COVID-19 presence

In the context of medical imaging, such as lung scans, zero-shot tagging can be employed to detect and attribute specific visual characteristics indicative of COVID-19 infection.

For instance, in lung scans of COVID-19 patients, certain visual attributes emerge that signify infection. These attributes include foggy effects, white spot features spread across different lung areas, reduced visibility of bones and other organs due to a dense distribution of inflammation, and a dominance of white or low-intensity pixels within the lung region. By leveraging zero-shot tagging techniques, the AI model can learn to recognize and associate these visual attributes with COVID-19 infection, enabling accurate identification and classification of lung scans from patients with COVID-19.

Now, let’s delve into the problem statement at hand!

What we have: training and zero-shot classes
Considering that no samples from zero-shot classes will be used during training,
How is it possible to recognize objects that was never seen before?
Lets find out!

In zero shot learning the data consists of

Seen classes: These are classes for which we have labelled images during training
Unseen classes: These are classes for which labelled images are not present during the training phase.
Auxiliary information: This information consists of descriptions/semantic attributes/word embeddings for both seen and unseen classes at train time. This information acts as a bridge between seen and unseen classes

However there are three pertinent issues that underpin the zero-shot image tagging task:

Any object or concept can either be present at a localized region or be inferred from the holistic scene information (e.g., ‘sun’ vs ‘sunset’).
Objects and concepts are often occluded in natural scenes and scene context provides valuable information for image tagging.
The assignment of multiple image tags (seen and unseen) requires an accurate mapping function from visual to semantic domain. The available label space is significantly larger comprising of thousands of possible tags, and an ideal zero-shot framework should have the flexibility to incorporate new unseen tags on the fly during testing.

Steps in Zero-Shot Image Tagging

Problem Formulation:

Set of ‘seen’ tags denoted by S = {1, . . ., S}
Set of ‘unseen’ tags U = {S+1, . . ., S+U},
S ∩ U = φ , where, S and U represents the total number of seen and unseen tags respectively.
C = S ∪ U, such that C = S + U is the cardinality of the tag-label space.
For each of the tags c ∈ C, we can obtain a ‘d’ dimensional word vector vc (word2vec or GloVe) as a semantic embedding.
The training examples can be defined as a set of tuples, {(Xs, ys) : s ∈ [1, M]}, where Xs is the sth input image and ys ⊂ S is the set of relevant seen tags.
We represent uth testing image as Xu which corresponds to a relevant seen and/or unseen tag yu ⊂ C.
Xu, yu, U and its corresponding word vectors are not observed during training.

Now, we define the following problems:

• Conventional tagging: Given Xu as input, assign relevant seen tags yu ⊂ S.

• Zero-shot tagging (ZST): Given Xu, assign relevant unseen tags yu ⊂ U.

• Generalized zero-shot tagging (GZST): Given Xu as input, assign relevant tags from both seen and unseen yu ⊂ C.

2. MIL formulation:

formulate the mentioned problem definitions in Multiple Instance Learning (MIL) framework.
Let us represent the sth training image with a bag of n + 1 instances Xs = {xs,0 . . . . . xs,n}, where i th instance xsi represents either an image patch (for i > 0) or the complete image itself (for i = 0).
We assume that each instance xs,i has an individual label l s,i which is unknown during training.
As ys represents relevant seen tags of Xs, according to the MIL assumption, the bag has at least one instance for each tag in the set ys and no instance for {S \ ys } tags
The bag- and instance-level labels are related by:

Thus, instances in Xs can work as a positive example for y ∈ ys and negative example for y’ ∈ {S \ ys }.
This formulation does not use instance level tag annotation which makes it a weakly supervised problem.

Our aim is to design and learn an end-to-end deep learning model that can itself generate the appropriate bag-of-instances and simultaneously assign relevant tags to the bag.

Network Architecture

It is composed of two parts: bag generation network and Multiple Instance Learning (MIL) network.
The bag generation network generates a bag-of-instances as well as their visual features.
The MIL network processes the resulting bag of instance features to find the final multi-label prediction which is calculated by a global pooling of the prediction scores of individual instances.
In this manner, bag generation and zero-shot prediction steps are combined in a unified framework that effectively transfers learning from seen to unseen tags.

Bag generation

A Faster-RCNN model with Region Proposal Network (RPN) is learned using the ILSVRC-2017 detection dataset.
Use a base network ResNet-50/ VGG and GoogLeNet which is shared between RPN and classification/localization network.The base network is initialized with pre-trained weights.
Now, given a training image Xs, the RPN can produce a fixed number (n) of region of interest (ROI) proposals {xs,1 . . . xs,n} with a high recall rate.
For image tagging, all tags may not represent an object. Rather, tags can be concepts that describe the whole image, e.g., nature and landscape.
To address this issue, we add a global image ROI (denoted by xs,0) comprising of the complete image to the ROI proposal set generated by the RPN.
Afterwards, ROIs are fed to ROI-Pooling and subsequent densely connected layers to calculate D-dimensional features set: Fs = [fs,0 . . . fs,n] ∈ R^( D×(n+1)) where fs,0 is the feature representation of the whole image.
This bag is then forwarded to MIL network for prediction.

2. MIL Network:

Our network design then comprises of two component blocks within the MIL network: ‘bag processing block’ and ‘semantic alignment block’.
The bag processing block has two fully connected layers with a non-linear activation ReLU.
The role of this block is to remap the bag of features to the dimension of semantic embedding space by calculating

F’s is forwarded to the semantic alignment block.
This block performs two important operations to calculate the final score for each bag: (i) semantic projection and (ii) MIL pooling operation.

Based on the sequence of this two operations this block can be implemented in two ways:

Case 1- Semantic domain aggregation:

In this case, the visual domain features are first mapped to semantic space and their responses are aggregated.
Specifically, given a bag of instance features F’s, we first compute the prediction scores of individual instances, Ps = [ps,0 . . . ps,n] ∈ R^(S×(n+1)) by projecting them onto the fixed semantic embedding, W = [v1 . . . vS] ∈ R^(d×S), containing word vectors of seen tags,

Since the supervision is only available for bag-level predictions(i.e., image tags), we require an aggregation mechanism A(·) to combine predictions scores Ps for individual instances in a bag.
Using the semantic-domain aggregation, we can obtain the final bag score as follows:

Case 2- Visual domain aggregation:

In this case, visual features are first aggregated using a pooling operation and then transformed to semantic domain.
Specifically, given a bag of instance features F’s we perform a pooling operation first to obtain a universal feature representation of bag f’s ie. f’’s
After that, we project f’’s onto the semantic embedding space to calculate the final score of the bag:

3. Aggregation Mechanism:

The aggregation mechanism can be implemented using a non-parametric (fixed) or a parametric (learnable) function.
Fixed Pooling: Given a set of input feature vectors the aggregated output o can be obtained via max, mean or log-sum-exp (LSE) pooling

4) Loss formulation: Suppose, for s th training image, zs = [z1 . . . zS] contains final multi-label prediction of a bag for seen classes.

This bag is a positive example for each tag y ∈ ys and negative examples for each tag y 0 ∈ {S \ ys }.
Thus, for each pair of y and y’, the difference zy’ — zy represents the disparity between predictions for positive and negative tags. Our goal is to minimize these differences in each iteration.
We formalize the loss of a bag considering it to contain both positive and negative examples for different tags:

5) Prediction:

During testing, we modify the fixed embedding W to include seen and unseen word vectors instead of only seen word vectors.
Suppose, after modification W becomes W’ = [v1 . . . vS, vS+1 . . . vS+U] ∈ R^ d×C. With the use of W’, we get prediction scores of both seen and unseen tags for each individual instance in the bag.
Then, after the global pooling step, we get the final prediction score for each seen and unseen tags.
Finally, based on the tagging task (conventional/zero-shot/generalized zero-shot), we assign top K target tags (from the set S, U or C) with higher scores to an input image.

Challenges

Domain shift:

Problem where the training and testing data come from two different distributions.
The distribution of seen classes on which the model is trained is different from the distribution of unseen classes on which the model has to be tested.
Since our deep network learns the function using only seen classes during training, it might not work for/generalize well to out of distribution unseen classes at test time.

2. Bias:

model has access to only seen class image-label pairs during training and no unseen class images are available.
This makes the model inherently biased towards predicting the seen classes as the correct class at test time.
This becomes crucial especially in the case of generalized ZSL where the test image can belong to both seen or unseen classes.
Since the model is biased towards seen classes, it often misclassified unseen class images into one of the seen classes which reduces recognition performance drastically.

3. Cross-domain knowledge transfer

While training a zero-shot model, we have visual image features for seen classes and only semantic information for unseen classes.
However, at test time we need to recognize visual features from unseen classes.
Thus, how well our model is at knowledge transfer from semantic domain to visual domain plays a crucial role in zero-shot recognition.

4. Semantic loss

While learning a classifier on seen classes during training, the model might attributes/features that do not help it to differentiate between seen classes
However, those ignored features could be helpful to differentiate between unseen classes at test time.

5. Hubness

The problem of hubness arises when a high dimensional vector is projected into a low dimensional space. Such a projection reduces variance and results in mapped points being clustered as a hub.

Practical Applications

Fine-Grained Object Recognition and Zero-Shot Learning in Remote Sensing Imagery: eg: recognizing and classifying trees
COVID — 19 Screening
neural decoding from fMRI images
face verification
object recognition
video understanding
natural language processing

Benefits of Zero-Shot Image Tagging

Improved Flexibility: Zero-shot image tagging provides greater flexibility in handling novel or rare concepts that may not have been part of the initial training data. It allows AI models to assign relevant tags to unseen images accurately, expanding the range of objects or scenes that can be tagged automatically.
Enhanced Image Understanding: Zero-shot image tagging encourages AI models to learn rich visual representations and understand the relationships between different concepts. By associating visual features with textual descriptions, these models can capture more nuanced and abstract visual concepts, enabling more accurate and descriptive image tagging.
Multilingual Tagging: Zero-shot image tagging also opens up possibilities for multilingual tagging. Models trained with textual descriptions in one language can infer tags in another language, facilitating cross-lingual image understanding and improving accessibility for global audiences.
Contextual Image Tagging: Zero-shot image tagging can consider contextual information to assign tags to images. By analyzing the surrounding text or the broader context in which an image appears, the model can infer relevant tags based on the textual cues, further enhancing the accuracy and relevance of image tagging.

Conclusion:

Automatic image tagging, including the exciting realm of zero-shot image tagging, has revolutionized the way we organize and categorize visual content. By leveraging AI algorithms and powerful neural networks, we can automate the process of assigning descriptive tags to images, enhancing searchability, and enabling efficient content management. As advancements continue, zero-shot image tagging holds tremendous potential for addressing new tagging challenges, facilitating broader generalization, and expanding the applications of automated image classification in various domains.