vlm-competence-dev Documentation#

Overview#

This repository provides utilities for extracting hidden states from state-of-the-art vision-language models. By surfacing these intermediate representations, you can perform a comprehensive analysis of the knowledge encoded within each model.

Supported Models#

We currently support extracting hidden states from the following vision-language models. The architecture name (used for model selection) is shown in square brackets:

BLIP-2 [blip2]
CLIP [clip]
CogVLM [//PENDING//]
Glamm [glamm]
InternLM-XComposer [internlm-xcomposer]
InternVL [internvl]
Janus [janus]
LLaVa [llava]
MiniCPM-V2 [minicpm]
MiniCPM-o [minicpm]
Molmo [molmo]
OMG-LLaVa [//PENDING//]
PaliGemma [paligemma]
Qwen [qwen]

Setup#

First, clone the repository:

git clone https://github.com/repo/repo.git
cd repo

Because each model may have different dependencies, it is recommended to use a separate virtual environment for each model you run.

For example, using conda:

conda create -n <env_name> python=3.10
conda activate <env_name>

Or, using native python venv:

python -m venv <env_name>
source <env_name>/bin/activate

After activating your environment, install dependencies for your desired model architecture. Replace <architecture> with the appropriate value (e.g., blip2, llava):

pip install -r envs/<architecture>.requirements.txt

Usage#

python src/main.py --architecture <architecture> --model-path <model-path> --debug --config <config-file-path> --input-dir <input-dir> --output-db <output-db>

vlm-competence-dev Documentation

Contents

vlm-competence-dev Documentation#

Overview#

Supported Models#

Setup#

Usage#