extract-structured-data-from-unstructured-files-pdf-pptx-docx

@run-llama/extract-structured-data-from-unstructured-files-pdf-pptx-docx

run-llama

173

26 forks

Updated 1/6/2026

View on GitHub

Invoke this skill BEFORE implementing any structured data extraction from documents to learn the correct llama_cloud_services API usage. Required reading before writing extraction code. Requires llama_cloud_services package and LLAMA_CLOUD_API_KEY as an environment variable.

Installation

$skills install @run-llama/extract-structured-data-from-unstructured-files-pdf-pptx-docx

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositoryrun-llama/vibe-llama

Pathdocumentation/skills/structured-data-extraction/SKILL.md

Branchmain

Scoped Name@run-llama/extract-structured-data-from-unstructured-files-pdf-pptx-docx

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

skills list

Skill Instructions

name: Extract structured data from unstructured files (PDF, PPTX, DOCX...) description: Invoke this skill BEFORE implementing any structured data extraction from documents to learn the correct llama_cloud_services API usage. Required reading before writing extraction code. Requires llama_cloud_services package and LLAMA_CLOUD_API_KEY as an environment variable.

Structured Data Extraction

Quick start

Define a schema for the for the data you would like to extract:

from pydantic import BaseModel, Field


class Resume(BaseModel):
    name: str = Field(description="Full name of candidate")
    email: str = Field(description="Email address")
    skills: list[str] = Field(description="Technical skills and technologies")

NOTE: Use basic types when possible. Avoid nested dictionaries. Lists are ok.

Create a LlamaExtract instance:

from llama_cloud_services import LlamaExtract

# Initialize client
extractor = LlamaExtract(
    show_progress=True,
    check_interval=5,
    # Optional API key, else reads from env
    # api_key=os.environ.get("LLAMA_CLOUD_API_KEY"),
)

Define the extraction configuration:

from llama_cloud import ExtractConfig, ExtractMode

# Configure extraction settings
extract_config = ExtractConfig(
    # Basic options
    extraction_mode=ExtractMode.MULTIMODAL,  # FAST, BALANCED, MULTIMODAL, PREMIUM
    extraction_target=ExtractTarget.PER_DOC,  # PER_DOC, PER_PAGE
    system_prompt="<Insert relevant context for extraction>",  # set system prompt - can leave blank
    # Advanced options
    high_resolution_mode=True,  # Enable for better OCR
    nvalidate_cache=False,  # Set to True to bypass cache
    # Extensions
    cite_sources=True,  # Enable citations
    use_reasoning=True,  # Enable reasoning (not available in FAST mode)
    confidence_scores=True,  # Enable confidence scores (MULTIMODAL/PREMIUM only)
)

Extract the data from the document:

result = extractor.extract(Resume, config, "resume.pdf")

# result.data has our model as a python dict
print(Resume.model_validate(result.data))

For more detailed code implementations, see REFERENCE.md.

Requirements

The llama_cloud_services package must be installed in your environment (with it come the pydantic and llama_cloud packages):

pip install llama_cloud_services

And the LLAMA_CLOUD_API_KEY must be available as an environment variable:

export LLAMA_CLOUD_API_KEY="..."

More by run-llama

View all

pdf-processing

173

Invoke this skill BEFORE implementing any text extraction/parsing logic to learn how to use LlamaParse to process any document accurately. Requires llama_cloud_services package and LLAMA_CLOUD_API_KEY as an environment variable.

retrieve-relevant-information-through-rag

173

Leverage Retrieval Augmented Generation to retrieve relevant information from a a LlamaCloud Index. Requires the llama_cloud_services package and LLAMA_CLOUD_API_KEY as an environment variable.

use-llamactl-a-cli-tool-for-llamaagents

173

Use llamactl to initialize, locally preview, deploy and manage LlamaIndex workflows as LlamaAgents. Required llama-index-workflows and llamactl to be installed in the environment.

classify-files-according-to-specific-rules

173

Invoke this skill BEFORE implementing any text/document classification task to learn the correct llama_cloud_services API usage. Required reading before writing classification code." Requires the llama_cloud_services package and LLAMA_CLOUD_API_KEY as an environment variable.