Extract text content from images using Tesseract OCR via Python
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
skills listSkill Instructions
name: image-ocr description: Extract text content from images using Tesseract OCR via Python
Image OCR Skill
Purpose
This skill enables accurate text extraction from image files (JPG, PNG, etc.) using Tesseract OCR via the pytesseract Python library. It is suitable for scanned documents, screenshots, photos of text, receipts, forms, and other visual content containing text.
When to Use
- Extracting text from scanned documents or photos
- Reading text from screenshots or image captures
- Processing batch image files that contain textual information
- Converting visual documents to machine-readable text
- Extracting structured data from forms, receipts, or tables in images
Required Libraries
The following Python libraries are required:
import pytesseract
from PIL import Image
import json
import os
Input Requirements
- File formats: JPG, JPEG, PNG, WEBP
- Image quality: Minimum 300 DPI recommended for printed text; clear and legible text
- File size: Under 5MB per image (resize if necessary)
- Text language: Specify if non-English to improve accuracy
Output Schema
All extracted content must be returned as valid JSON conforming to this schema:
{
"success": true,
"filename": "example.jpg",
"extracted_text": "Full raw text extracted from the image...",
"confidence": "high|medium|low",
"metadata": {
"language_detected": "en",
"text_regions": 3,
"has_tables": false,
"has_handwriting": false
},
"warnings": [
"Text partially obscured in bottom-right corner",
"Low contrast detected in header section"
]
}
Field Descriptions
success: Boolean indicating whether text extraction completedfilename: Original image filenameextracted_text: Complete text content in reading order (top-to-bottom, left-to-right)confidence: Overall OCR confidence level based on image quality and text claritymetadata.language_detected: ISO 639-1 language codemetadata.text_regions: Number of distinct text blocks identifiedmetadata.has_tables: Whether tabular data structures were detectedmetadata.has_handwriting: Whether handwritten text was detectedwarnings: Array of quality issues or potential errors
Code Examples
Basic OCR Extraction
import pytesseract
from PIL import Image
def extract_text_from_image(image_path):
"""Extract text from a single image using Tesseract OCR."""
img = Image.open(image_path)
text = pytesseract.image_to_string(img)
return text.strip()
OCR with Confidence Data
import pytesseract
from PIL import Image
def extract_with_confidence(image_path):
"""Extract text with per-word confidence scores."""
img = Image.open(image_path)
# Get detailed OCR data including confidence
data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
words = []
confidences = []
for i, word in enumerate(data['text']):
if word.strip(): # Skip empty strings
words.append(word)
confidences.append(data['conf'][i])
# Calculate average confidence
avg_confidence = sum(c for c in confidences if c > 0) / len([c for c in confidences if c > 0]) if confidences else 0
return {
'text': ' '.join(words),
'average_confidence': avg_confidence,
'word_count': len(words)
}
Full OCR with JSON Output
import pytesseract
from PIL import Image
import json
import os
def ocr_to_json(image_path):
"""Perform OCR and return results as JSON."""
filename = os.path.basename(image_path)
warnings = []
try:
img = Image.open(image_path)
# Get detailed OCR data
data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
# Extract text preserving structure
text = pytesseract.image_to_string(img)
# Calculate confidence
confidences = [c for c in data['conf'] if c > 0]
avg_conf = sum(confidences) / len(confidences) if confidences else 0
# Determine confidence level
if avg_conf >= 80:
confidence = "high"
elif avg_conf >= 50:
confidence = "medium"
else:
confidence = "low"
warnings.append(f"Low OCR confidence: {avg_conf:.1f}%")
# Count text regions (blocks)
block_nums = set(data['block_num'])
text_regions = len([b for b in block_nums if b > 0])
result = {
"success": True,
"filename": filename,
"extracted_text": text.strip(),
"confidence": confidence,
"metadata": {
"language_detected": "en",
"text_regions": text_regions,
"has_tables": False,
"has_handwriting": False
},
"warnings": warnings
}
except Exception as e:
result = {
"success": False,
"filename": filename,
"extracted_text": "",
"confidence": "low",
"metadata": {
"language_detected": "unknown",
"text_regions": 0,
"has_tables": False,
"has_handwriting": False
},
"warnings": [f"OCR failed: {str(e)}"]
}
return result
# Usage
result = ocr_to_json("document.jpg")
print(json.dumps(result, indent=2))
Batch Processing Multiple Images
import pytesseract
from PIL import Image
import json
import os
from pathlib import Path
def process_image_directory(directory_path, output_file):
"""Process all images in a directory and save results."""
image_extensions = {'.jpg', '.jpeg', '.png', '.webp'}
results = []
for file_path in sorted(Path(directory_path).iterdir()):
if file_path.suffix.lower() in image_extensions:
result = ocr_to_json(str(file_path))
results.append(result)
print(f"Processed: {file_path.name}")
# Save results
with open(output_file, 'w') as f:
json.dump(results, f, indent=2)
return results
Tesseract Configuration Options
Language Selection
# Specify language (default is English)
text = pytesseract.image_to_string(img, lang='eng')
# Multiple languages
text = pytesseract.image_to_string(img, lang='eng+fra+deu')
Page Segmentation Modes (PSM)
Use --psm to control how Tesseract segments the image:
# PSM 3: Fully automatic page segmentation (default)
text = pytesseract.image_to_string(img, config='--psm 3')
# PSM 4: Assume single column of text
text = pytesseract.image_to_string(img, config='--psm 4')
# PSM 6: Assume uniform block of text
text = pytesseract.image_to_string(img, config='--psm 6')
# PSM 11: Sparse text - find as much text as possible
text = pytesseract.image_to_string(img, config='--psm 11')
Common PSM values:
0: Orientation and script detection (OSD) only3: Fully automatic page segmentation (default)4: Single column of text of variable sizes6: Uniform block of text7: Single text line11: Sparse text13: Raw line
Image Preprocessing
For better OCR accuracy, preprocess images:
from PIL import Image, ImageFilter, ImageOps
def preprocess_image(image_path):
"""Preprocess image for better OCR results."""
img = Image.open(image_path)
# Convert to grayscale
img = img.convert('L')
# Increase contrast
img = ImageOps.autocontrast(img)
# Apply slight sharpening
img = img.filter(ImageFilter.SHARPEN)
return img
# Use preprocessed image for OCR
img = preprocess_image("document.jpg")
text = pytesseract.image_to_string(img)
Advanced Preprocessing Strategies
For difficult images (low contrast, faded text, dark backgrounds), try multiple preprocessing approaches:
- Grayscale + Autocontrast - Basic enhancement for most images
- Inverted - Use
ImageOps.invert()for dark backgrounds with light text - Scaling - Upscale small images (e.g., 2x) before OCR to improve character recognition
- Thresholding - Convert to binary using
img.point(lambda p: 255 if p > threshold else 0)with different threshold values (e.g., 100, 128) - Sharpening - Apply
ImageFilter.SHARPENto improve edge clarity
Multi-Pass OCR Strategy
For challenging images, a single OCR pass may miss text. Use multiple passes with different configurations:
-
Try multiple PSM modes - Different page segmentation modes work better for different layouts (e.g.,
--psm 6for blocks,--psm 4for columns,--psm 11for sparse text) -
Try multiple preprocessing variants - Run OCR on several preprocessed versions of the same image
-
Combine results - Aggregate text from all passes to maximize extraction coverage
def multi_pass_ocr(image_path):
"""Run OCR with multiple strategies and combine results."""
img = Image.open(image_path)
gray = ImageOps.grayscale(img)
# Generate preprocessing variants
variants = [
ImageOps.autocontrast(gray),
ImageOps.invert(ImageOps.autocontrast(gray)),
gray.filter(ImageFilter.SHARPEN),
]
# PSM modes to try
psm_modes = ['--psm 6', '--psm 4', '--psm 11']
all_text = []
for variant in variants:
for psm in psm_modes:
try:
text = pytesseract.image_to_string(variant, config=psm)
if text.strip():
all_text.append(text)
except Exception:
pass
# Combine all extracted text
return "\n".join(all_text)
This approach improves extraction for receipts, faded documents, and images with varying quality.
Error Handling
Common Issues and Solutions
Issue: Tesseract not found
# Verify Tesseract is installed
try:
pytesseract.get_tesseract_version()
except pytesseract.TesseractNotFoundError:
print("Tesseract is not installed or not in PATH")
Issue: Poor OCR quality
- Preprocess image (grayscale, contrast, sharpen)
- Use appropriate PSM mode for the document type
- Ensure image resolution is sufficient (300+ DPI)
Issue: Empty or garbage output
- Check if image contains actual text
- Try different PSM modes
- Verify image is not corrupted
Quality Self-Check
Before returning results, verify:
- Output is valid JSON (use
json.loads()to validate) - All required fields are present (
success,filename,extracted_text,confidence,metadata) - Text preserves logical reading order
- Confidence level reflects actual OCR quality
- Warnings array includes all detected issues
- Special characters are properly escaped in JSON
Limitations
- Tesseract works best with printed text; handwriting recognition is limited
- Accuracy decreases with decorative fonts, artistic text, or extreme stylization
- Mathematical equations and special notation may not extract accurately
- Redacted or watermarked text cannot be recovered
- Severe image degradation (blur, noise, low resolution) reduces accuracy
- Complex multi-column layouts may require custom PSM configuration
Version History
- 1.0.0 (2026-01-13): Initial release with Tesseract/pytesseract OCR
More by benchflow-ai
View allRepair an (often imperfect) Flexible Job Shop Scheduling baseline into a downtime-feasible, precedence-correct schedule while staying within policy budgets and matching the evaluator’s exact metrics and “local minimal right-shift” checks.
Test Temporal workflows with pytest, time-skipping, and mocking strategies. Covers unit testing, integration testing, replay testing, and local development setup. Use when implementing Temporal workflow tests or debugging test failures.
Extract locational marginal prices (LMPs) from DC-OPF solutions using dual values. Use when computing nodal electricity prices, reserve clearing prices, or performing price impact analysis.
This skill should be used when the user asks to "design package structure", "create managed package", "configure 2GP", "set up namespace", "version management", or mentions managed package topics like "LMA", "subscriber orgs", or "package versioning". Provides comprehensive guidance for second-generation managed package (2GP) architecture, ISV development patterns, and package lifecycle management.
