When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer \“I see a conifer,\” rather than the specific label \“Norway spruce\“. This raises two issues for evaluation: Firstly, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., \“conifer\“). Secondly, a useful classification measure should give partial credit to less specific, but not incorrect, answers (\“Norway spruce\” being a type of \“conifer\“). To meet these requirements, we propose a framework for evaluating unconstrained text predictions such as those generated from a vision-language model against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.
Add the full text or supplementary notes for the publication here using Markdown formatting.