Inputs

Summary

Overview

Every Job in SDM system starts with an input file. 

By default, when using a file type that supports multiple sheets (like Excel), only the first sheet will be processed. However, this behavior can be customized using Custom Input Formats in the SDM admin.

Custom Input Formats

SDM admin > Home > Files > Input formats

Custom Input Formats are predefined templates that specify how multi-sheet files should be processed.

They allow for more complex data imports by defining:

  1. Which sheets should be present in the file
  2. What columns each sheet should contain
  3. How different sheets could/should be joined together
  4. Which columns are required and which are optional
  5. Any unique identifiers across the dataset

Custom Input Formats serve two primary functions:

  1. Matching: They define criteria to match against uploaded files.
  2. Transformation: They specify how data should be extracted and combined from matching files into the Pandas Dataframe that will be used by the job.

Data Transformation

Once a file matches a Custom Input Format:

  1. The system extracts data based on the format's specifications:
    • It includes all specified columns.
    • If a null entry is present in the columns list, all unnamed columns are also included.
  2. Sheets are joined together using pandas join operations as specified in the format.

Custom Input Formats can be defined and managed in the admin panel:

Job creation

1. Using the interface

When uploading a file, the system handles Custom Input Formats in the following way:

Default Behavior: The uploaded file is automatically compared against all available Custom Input Formats for the organization. The first format that matches the file structure will be used. If there is no match, or there are no Custom Input Formats in the first place, then the default reader will be used for that file.

Advanced Options: Users can fine-tune this behavior using the advanced options in the job creation form.

  • No Input Format: This option is a bit of a misnomer as it uses the default behavior, attempting to match the file against existing Custom Input Formats.
  • Specific Format Selection: Users can manually select a specific Custom Input Format to use. This is particularly useful when a file could potentially match multiple formats, ensuring predictable processing.

2. Using FTP

Enable SDM FTP for Inputs

FTP access needs to be enabled at 2 different levels:

  • Organization level First check the Ftp allowed checkbox

  • User level, check “ftp allowed”

Connect to SDM FTP

  • Connect to ftp.sdm.akeneo.cloud (ftp + environment URL) and the username and password that you use to connect to the app.
  • Once connected to the FTP you should be seeing 2 directories inputs and outputs
  • Navigate to the inputs directory. You should see a list of all the projects in your organization as directories. Simply drop an input file (CSV or XLS) in a directory to trigger a job creation for that project.

FTP Jobs do not support medias, and will only accept CSV or XLS files.

 

Retrieving the outputs of a job through FTP

Similarly, the outputs directory lists all the existing projects in your organization. If you navigate to one of them you should be able to download all of the output files created for the jobs of that project.

 

Input Upload Process

The following diagram illustrates the process of uploading a file and matching it with Custom Input Formats:

Configuration

The configuration outlined in this document aims at providing a more digestable version of what is presented in the API docs. However the API docs are always up to date and should remain a source of truth. It is recommended to double check the info presented here against it.

 

Configuration Structure

The configuration object has the following structure:

{
  "sheets": [ ... ],
  "unique_columns": [ ... ]
}

Configuration Options

Sheets (required)

An array of objects, each representing a sheet in the Excel file. Each sheet object has the following properties:

  • name (string, required): The name of the sheet in the Excel file.
  • columns (array of objects, required): Specifies the columns to be processed in the sheet. Each column object can have one of two structures:
    • Standard column:
      • name (string): The name of the column in the Excel file.
    • Name-value pair column:
      • name_from (string): The name of the column containing the attribute names.
      • value_from (string): The name of the column containing the attribute values.

        You cannot use name and name_from/value_from in the same column object.

         
  • skip_rows (integer, optional): The number of rows to skip at the beginning of the sheet. Useful for files with header information at the top.
  • drop_duplicated_columns (boolean, optional, default: false): When set to true, it removes duplicate columns across multiple sheets. This is useful when the same information appears in multiple sheets.
  • join (object, optional): Specifies how this sheet should be joined with the previous sheet. Properties include:
    • on (array of strings): The column(s) to join on.
    • how (string): The type of join to perform (e.g., "left", "right", "inner", "outer").

      For more information on the join options, please refer to the Pandas docs as this is what is used behind the scenes.

       
  • required (boolean, optional, default: true): Indicates whether this sheet must be present in the Excel file for the format to match.

Unique_columns (optional)

An array of strings representing the column names that serve as unique identifiers across all sheets. These columns are typically used for joining data between sheets.

Configuration Example

Here's an example configuration demonstrating various features:

{
  "params": {
    "sheets": [
      {
        "name": "Products",
        "columns": [
          { "name": "ProductID" },
          { "name": "ProductName" },
          { "name": "Category" }
        ],
        "skip_rows": 2,
        "required": true
      },
      {
        "name": "Attributes",
        "columns": [
          { "name": "ProductID" },
          { "name_from": "AttributeName_3", "value_from": "AttributeValue_3" }
        ],
        "drop_duplicated_columns": true,
        "join": {
          "on": ["ProductID"],
          "how": "left"
        }
      }
    ],
    "unique_columns": ["ProductID"]
  }
}

Use case

Consider a Custom Input Format that specifies:

{
  "sheets": [
    {
      "name": "01_COMMERCE",
      "columns": ["REFCIALE", "LIBELLE240", "LIBELLE80", "GTIN13", null],
      "required": true
    },
    {
      "name": "02_LOGISTIQUE",
      "columns": ["MARQUE", "REFCIALE", null],
      "required": true,
      "join": {
        "on": ["REFCIALE"],
        "how": "left"
      }
    }
  ],
  "unique_columns": ["REFCIALE"]
}

This format will:

  1. Match files containing sheets named "01_COMMERCE" and "02_LOGISTIQUE".
  2. Ensure specified columns exist in each sheet.
  3. Include all columns from both sheets in the final dataset (due to the null in columns).
  4. Join the "02_LOGISTIQUE" sheet to "01_COMMERCE" using a left join on the "REFCIALE" column.

Implications for what’s next in the job

  • The resulting dataframe used in jobs will contain only the data specified by the Custom Input Format.
  • By removing null entries from the columns list, you can restrict the dataframe to only specified columns.