clip-japanese-base

This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. This model was trained on ~1B web-collected image-text pairs, and it is applicable to various visual tasks including zero-shot image classification, text-to-image or image-to-text retrieval.

How to use

Install packages

Code

pip install pillow requests sentencepiece transformers torch timm

Python

import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)

image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)

with torch.no_grad():
    image_features = model.get_image_features(**image)
    text_features = model.get_text_features(**text)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [[1., 0., 0.]]

Model architecture

The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.

Evaluation

Dataset

STAIR Captions (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval.
Recruit Datasets for image classification.
ImageNet-1K for image classification. We translated all classnames into Japanese. The classnames and templates can be found in ja-imagenet-1k-classnames.txt and ja-imagenet-1k-templates.txt.

Result

Model	Image Encoder Params	Text Encoder params	STAIR Captions (R@1)	Recruit Datasets (acc@1)	ImageNet-1K (acc@1)
Ours	86M(Eva02-B)	100M(BERT)	0.30	0.89	0.58
Stable-ja-clip	307M(ViT-L)	100M(BERT)	0.24	0.77	0.68
Rinna-ja-clip	86M(ViT-B)	100M(BERT)	0.13	0.54	0.56
Laion-clip	632M(ViT-H)	561M(XLM-RoBERTa)	0.30	0.83	0.58
Hakuhodo-ja-clip	632M(ViT-H)	100M(BERT)	0.21	0.82	0.46

Licenses

The Apache License, Version 2.0

Citation

INI

@misc{clip-japanese-base,
    title = {CLIP Japanese Base},
    author={Shuhei Yokoo and Shuntaro Okada and Peifei Zhu and Shuhei Nishimura and Naoki Takayama}
    url = {https://huggingface.co/line-corporation/clip-japanese-base},
}

clip-japanese-base

How to use

Install packages

Code

pip install pillow requests sentencepiece transformers torch timm

Python

import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)

image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)

with torch.no_grad():
    image_features = model.get_image_features(**image)
    text_features = model.get_text_features(**text)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [[1., 0., 0.]]

Model architecture

The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.

Evaluation

Dataset

STAIR Captions (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval.
Recruit Datasets for image classification.
ImageNet-1K for image classification. We translated all classnames into Japanese. The classnames and templates can be found in ja-imagenet-1k-classnames.txt and ja-imagenet-1k-templates.txt.

Result

Model	Image Encoder Params	Text Encoder params	STAIR Captions (R@1)	Recruit Datasets (acc@1)	ImageNet-1K (acc@1)
Ours	86M(Eva02-B)	100M(BERT)	0.30	0.89	0.58
Stable-ja-clip	307M(ViT-L)	100M(BERT)	0.24	0.77	0.68
Rinna-ja-clip	86M(ViT-B)	100M(BERT)	0.13	0.54	0.56
Laion-clip	632M(ViT-H)	561M(XLM-RoBERTa)	0.30	0.83	0.58
Hakuhodo-ja-clip	632M(ViT-H)	100M(BERT)	0.21	0.82	0.46

Licenses

The Apache License, Version 2.0

Citation

INI

@misc{clip-japanese-base,
    title = {CLIP Japanese Base},
    author={Shuhei Yokoo and Shuntaro Okada and Peifei Zhu and Shuhei Nishimura and Naoki Takayama}
    url = {https://huggingface.co/line-corporation/clip-japanese-base},
}

clip japanese base

clip-japanese-base

How to use

Model architecture

Evaluation

Dataset

Result

Licenses

Citation

clip japanese base

clip-japanese-base

How to use

Model architecture

Evaluation

Dataset

Result

Licenses

Citation