Extraction

Summary

Overview

The Extraction step in Supplier Data Manager (SDM) reads text and asset fields and uses an AI model to populate other fields with the extracted data. This is useful for converting large, unstructured product descriptions into specific, structured attributes.

For example, the Extraction step can analyze a detailed guitar description and extract values to fill attributes like Wood Composition, Number of Strings, and similar structured fields.

Interface and use cases

All info in our dedicated guide

The Extraction step interface in Supplier Data Manager is divided into three columns:

Left column: Displays additional context for the user. This information can be relevant to the extraction process but is not directly used by the AI model. Configured via additional_information in admin.
Center column: Shows the source fields used to feed the AI model.
Right column: Contains the target attributes that the AI fills.

Notable UI features:

Attributes in the right column that have been filled by the AI model are marked with a small bot icon.
If multiple products have identical source content, extraction is performed only once to avoid redundant processing.

The screenshot above shows the Extraction step interface with the three-column layout: left column displaying additional context, center column showing source fields, and right column showing target attributes filled by the AI model.

The screenshot above shows the Extraction step target attributes panel with bot icons marking AI-filled fields and origin filter options.

The Origin tab filters attributes by how their value was set:

Previously filled: Fields that already had a value at the start of the step. These can be overridden if replace_existing is set to true.
Model: Fields filled by the AI model, extracted from the sources fields defined in the model configuration.
User created: Field was empty and has been completed by the user.
User modified: Field was filled by the AI but has been amended by the user.
Empty: Field has no value.

The importance filter is defined at the project > field level and is also used during the Normalization step. See Requirement Levels in SDM for details.

The max_requirement_level is often set to "important" for the Extraction step. If a required piece of information is not present in the description, you generally do not want it to block the AI model.

Configuration

Always refer to the API docs as the authoritative source. The information below may lag behind the API docs.

Params configuration

fields, groups, include_fields, exclude_fields, max_requirement_level: Standard configuration options.
additional_information: List of attribute names to display in the left column of the Extraction step UI. These fields are shown as context for the operator but are not passed to the AI model.
misc: Legacy field that allowed creation of new attributes on the fly in the UI. No longer used.

!!! info The misc option is not used in typical Akeneo use cases because it can lead to unintended catalog structure changes. Use it with caution: creating ad hoc attributes early in a workflow can cause ripple effects downstream.
- allow: Boolean to enable or disable on-the-fly attribute creation.
- multiple: Allow creation of multiple new attributes.
- separator: Defines the separator used when multiple attributes are created.
automated_audit_rate: A float between 0 and 1. Defines the fraction of automated predictions that are randomly sampled for human review, similar to the Classification step. Required when use_model is not null.
replace_existing: Boolean. Determines whether the AI model attempts to replace values that were already filled before the Extraction step started.
max_options: Integer between 1 and 1000 (default: 1000). For select fields with many options, if the number of options exceeds this value, the field is treated as a string and a different prompt is used behind the scenes. Use this to optimize processing and prompt generation for large option sets.

Model configuration

sources: Fields used as input to the AI model. Asset fields can be used as sources with the following limitations:
- Assets must be of type media — image_url assets are not supported.
- Supported image formats: JPEG, JPG, PNG, WebP, TIFF, GIF.
- Maximum image file size: 20 MB per image; maximum 5 images per product.
- Maximum PDF file size: 50 MB; maximum 20 pages; maximum 1 PDF per product.
use_model: Specifies which AI model to use. Valid values: coreai_api, demo, or null (no model). When set to a non-null value, thresholds and automated_audit_rate (in params) are both required.
thresholds: Confidence levels used to classify predictions by field type (number, string, select, boolean). The AI model outputs a pseudo-confidence score of 1 for any prediction it makes. Because comparisons use strict inequalities:
- Setting automated to 1 means all predictions are classified as to_check (since 1 > 1 is false).
- Setting automated to any value strictly less than 1 means all predictions are classified as automated.
- Each threshold object has three float fields:
  - to_check: Predictions with confidence above this value (but not above automated) are flagged for review.
  - automated: Predictions with confidence above this value are accepted automatically.
  - to_complete: Predictions that do not meet the to_check threshold are marked as needing manual completion.
misc_source: Configures a virtual source field built from unmapped supplier fields that may contain useful additional data.
- use: Boolean. Set to true to activate misc sources; defaults to false.
- name: Name of the block displayed in the center column of the UI and in the data table.
- template: Template string for formatting each misc field entry. Defaults to {name}: {value}.
- separator: String used to join multiple misc field entries into a single source value.

Per-field model configuration

Each field in the fields list can include the following optional model-specific keys:

queries: List of questions to ask the AI model when extracting the value for this field. Useful for guiding the model toward the correct attribute.
hypothesis_template: Custom template used for extractive QA inference on this field.
threshold: Per-field confidence threshold (overrides the global thresholds setting for this field).

Limitations and best practices

Maximum products per job: 30,000 rows.
Recommended maximum source attributes for the AI: 50. Using more than 50 attributes as sources may reduce extraction accuracy.
Identical source content: For products with identical values across all source fields, extraction is performed only once. This optimizes processing time and is why the Extraction step may show fewer rows than other steps in the same job.
Complex UI for bulk editing: Reading large amounts of data in table format makes bulk editing challenging. Consider breaking large jobs into smaller batches.
AI confidence tuning: Pay attention to the thresholds configuration for each field type to balance automation against accuracy.
On-the-fly attribute creation (misc): Use cautiously. Adding attributes dynamically early in a workflow can create inconsistencies and unintended ripple effects later in the pipeline.
Large option sets: For select fields with many options, use max_options to control the point at which the field is treated as a free-text string rather than a constrained select.
Overwriting existing data: Consider the implications of replace_existing carefully. It is useful for data updates but can overwrite manually curated values.
Testing at scale: Before running large jobs, test with smaller batches (fewer than 200 products, fewer than 20 source attributes) to verify extraction quality and adjust configuration as needed.

Akeneo Help Center