This guide provides detailed instructions for optimizing the https://huggingface.co/microsoft/trocr-base-printed model with OpenVINO.
To prepare your environment for model optimization and inference:
sudo apt update
sudo apt install git-lfs -y
python3 -m venv openvino-env
source openvino-env/bin/activate
pip install --upgrade pip
python -m pip install "optimum-intel[openvino]"@git+https://github.com/huggingface/optimum-intel.gitOptimize your Hugging Face models for inference using the OpenVINO runtime by replacing standard transformer model classes with corresponding OpenVINO classes. For example, AutoModelForXxx becomes OVModelForXxx. For TrOCR, use OVModelForVision2Seq as shown below:
from transformers import TrOCRProcessor
from optimum.intel.openvino import OVModelForVision2Seq
from PIL import Image
import requests
# Load image from the IAM database
url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-small-handwritten')
model = OVModelForVision2Seq.from_pretrained('microsoft/trocr-small-handwritten', export=True)
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]For 8-bit quantization during model loading, set load_in_8bit=True when calling from_pretrained():
model = OVModelForVision2Seq.from_pretrained('microsoft/trocr-small-handwritten', load_in_8bit=True, export=True)NOTE: The load_in_8bit option is enabled by default for models with more than 1 billion parameters, which can be disabled with load_in_8bit=False.
Utilize the Optimum Intel CLI to export models from HuggingFace to OpenVINO IR with various levels of weight compression:
optimum-cli export openvino --model MODEL_ID --weight-format WEIGHT_FORMAT --output EXPORT_PATHReplace placeholders appropriately:
MODEL_ID: ID of the HuggingFace model.WEIGHT_FORMAT: Desired weight format, options include{fp32,fp16,int8,int4,int4_sym_g128,int4_asym_g128,int4_sym_g64,int4_asym_g64}. Refer to the Optimum Intel documentation for more details.EXPORT_PATH: Directory path for storing the exported OpenVINO model.--ratio RATIO: (Default: 0.8) Compression ratio between primary and backup precision. In the case of INT4, NNCF evaluates layer sensitivity and keeps the most impactful layers in INT8 precision (by default 20% in INT8). This helps to achieve better accuracy after weight compression.
To see complete usage, execute:
optimum-cli export openvino -hExample commands to export microsoft/trocr-base-printed with different precision formats (FP16, INT8, and INT4):
optimum-cli export openvino --model microsoft/trocr-base-printed --weight-format fp16 ov_model_fp16
optimum-cli export openvino --model microsoft/trocr-base-printed --weight-format int8 ov_model_int8
optimum-cli export openvino --model microsoft/trocr-base-printed --weight-format int4 ov_model_int4After conversion, pass the converted model path as model_id when using from_pretrained().
Also, you can determine your target device (CPU, GPU, or MULTI:CPU,GPU) as device argument in that method. See documentation for other supported device options: AUTO, HETERO, BATCH.
from transformers import TrOCRProcessor
from optimum.intel.openvino import OVModelForVision2Seq
from PIL import Image
import requests
url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
device = "CPU"
ov_config = {"PERFORMANCE_HINT": "LATENCY", "CACHE_DIR": "./ov_cache"}
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-printed')
model = OVModelForVision2Seq.from_pretrained(model_id='./ov_model_int8', device=device, ov_config=ov_config, export=False)
pixel_values = processor(images=image, return_tensors="pt").pixel
_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]