run-llama

extract-structured-data-from-unstructured-files-pdf-pptx-docx

@run-llama/extract-structured-data-from-unstructured-files-pdf-pptx-docx
run-llama
173
26 forks
Updated 1/6/2026
View on GitHub

Invoke this skill BEFORE implementing any structured data extraction from documents to learn the correct llama_cloud_services API usage. Required reading before writing extraction code. Requires llama_cloud_services package and LLAMA_CLOUD_API_KEY as an environment variable.

Installation

$skills install @run-llama/extract-structured-data-from-unstructured-files-pdf-pptx-docx
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathdocumentation/skills/structured-data-extraction/SKILL.md
Branchmain
Scoped Name@run-llama/extract-structured-data-from-unstructured-files-pdf-pptx-docx

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

skills list

Skill Instructions


name: Extract structured data from unstructured files (PDF, PPTX, DOCX...) description: Invoke this skill BEFORE implementing any structured data extraction from documents to learn the correct llama_cloud_services API usage. Required reading before writing extraction code. Requires llama_cloud_services package and LLAMA_CLOUD_API_KEY as an environment variable.

Structured Data Extraction

Quick start

  • Define a schema for the for the data you would like to extract:
from pydantic import BaseModel, Field


class Resume(BaseModel):
    name: str = Field(description="Full name of candidate")
    email: str = Field(description="Email address")
    skills: list[str] = Field(description="Technical skills and technologies")

NOTE: Use basic types when possible. Avoid nested dictionaries. Lists are ok.

  • Create a LlamaExtract instance:
from llama_cloud_services import LlamaExtract

# Initialize client
extractor = LlamaExtract(
    show_progress=True,
    check_interval=5,
    # Optional API key, else reads from env
    # api_key=os.environ.get("LLAMA_CLOUD_API_KEY"),
)
  • Define the extraction configuration:
from llama_cloud import ExtractConfig, ExtractMode

# Configure extraction settings
extract_config = ExtractConfig(
    # Basic options
    extraction_mode=ExtractMode.MULTIMODAL,  # FAST, BALANCED, MULTIMODAL, PREMIUM
    extraction_target=ExtractTarget.PER_DOC,  # PER_DOC, PER_PAGE
    system_prompt="<Insert relevant context for extraction>",  # set system prompt - can leave blank
    # Advanced options
    high_resolution_mode=True,  # Enable for better OCR
    nvalidate_cache=False,  # Set to True to bypass cache
    # Extensions
    cite_sources=True,  # Enable citations
    use_reasoning=True,  # Enable reasoning (not available in FAST mode)
    confidence_scores=True,  # Enable confidence scores (MULTIMODAL/PREMIUM only)
)
  • Extract the data from the document:
result = extractor.extract(Resume, config, "resume.pdf")

# result.data has our model as a python dict
print(Resume.model_validate(result.data))

For more detailed code implementations, see REFERENCE.md.

Requirements

The llama_cloud_services package must be installed in your environment (with it come the pydantic and llama_cloud packages):

pip install llama_cloud_services

And the LLAMA_CLOUD_API_KEY must be available as an environment variable:

export LLAMA_CLOUD_API_KEY="..."