Extraction

Summary

Overview

The Extraction step is designed to read and extract information from text fields and populate other fields with the extracted data. This process is particularly useful for converting large, unstructured descriptions into specific, structured attributes.

For example, it can analyze a detailed description of a guitar and extract information to fill attributes like Wood Composition, Number of Strings, etc.

Interface and Use Cases

All info in our dedicated guide

 

The user interface for the Extraction step is divided into three main sections:

  1. Left Column: Displays additional information for the user. The information can be relevant to the extraction process but is not directly used by the AI model (config in admin).
  2. Center Column: Shows the fields used for extractions (source fields).
  3. Right Column: Contains the attributes that need to be filled (target fields).

Notable UI features:

  • Attributes in the right column that have been filled by AI are marked with a small bot emoji.
  • If multiple products have the same source content, the extraction will only be performed once to avoid redundant processing.

Notable UI features:

  • Attributes in the right column that have been filled by AI are marked with a small bot emoji.
  • If multiple products have the same source content, the extraction will only be performed once to avoid redundant processing.

  • The Origin tab allows filtering of the attributes using the following options:
    • Previously filled: Fields that were already filled at the start of the step. Could potentially be overridden if replace_existing is set to true in the params.
    • Model: Fields filled by the AI model, extracted from the sources field(s) defined in the model configuration
    • User created: field was empty and has been completed by user
    • User modified: field was filled by the AI but has been amended by user
    • Empty: field has no value
  • The importance filter, is defined at the project > field level and is also used during Normalization step. For more details check Requirement Levels in SDM

The max_requirement_level is often set to "important" for the Extraction step. If an information is required but is not in the description we do not want it to influence the model too much

 

Configuration

Always be sure to refer to the API docs. However the API docs are always up to date and should remain a source of truth. It is recommended to double check the info presented here against it.

 

Params Configuration

  • fields, groups, include_fields, exclude_fields, max_requirement_level: Standard configuration options.
  • additional_informations: List of attribute names to be displayed in the left column of the form in the UI.
  • misc: Legacy field allowing creation of new attributes on the fly in the UI. not used anymore

Not very common, especially in the Akeneo use case, because we don’t modify the PIM structure

  • allow: Boolean to enable/disable this feature.
  • multiple: Allow creation of multiple new attributes.
  • separator: Defines the separator for multiple attributes.
  • automated_audit_rate: A number between 0 and 1, similar to the Classification step.
  • replace_existing: Determines if the model should attempt to replace already existing values that are supposed to be filled during the extraction step.
  • max_options: An integer between 1 and 1000. For select fields with numerous options, if the number of options exceeds this value, the field is treated as a string, using a different prompt behind the scenes.

Model Configuration

  • sources: Fields used to feed the AI model.
  • use_model: Specifies the AI model to use (coreai_api, demo, or null).
  • thresholds: Confidence levels used to assign scores by field type (number, string, select, boolean).
    • to_check: float
    • automated: float
    • to_complete: float
  • misc_source: Indicates if there are unmapped supplier fields that may contain additional info that could be used by the model.
    • use: Boolean to activate misc sources (true) or not (false, default)
    • name: Name of the block to be displayed in the center column of the UI and the data table.

Limitations & Best Practices

  1. Complex UI: The interface can be complicated, especially for bulk editing. The complexity of reading large amounts of data in a table format makes implementing bulk edit features challenging.
  2. Performance considerations: For products with identical source content, extraction is performed only once to optimize processing time and resource usage.
  3. AI Confidence: Pay attention to the confidence levels (thresholds) set for different field types to balance between automation and accuracy.
  4. New attribute creation: Use the misc configuration cautiously. While it allows for flexibility in creating new attributes on the fly, it might lead to inconsistencies if not managed properly, as well as lead to ripple effects for the customer when adding random attributes early in their complete workflow.
  5. Large option sets: For select fields with many options, consider using the max_options parameter to optimize processing and prompt generation.
  6. Existing data handling: Carefully consider the implications of using replace_existing. It might be beneficial for data updates but could potentially overwrite manually curated information.