Schema Layer

The Schema layer is the foundational element of the Denzing Metadata framework. It contains the physical schema-level information for your data sources, encompassing column-level details, primary keys, and brief descriptions.

Purpose

To enable LLMs to accurately comprehend the physical structure of input data from sources like database tables and CSV files.
To establish a clear mapping from raw data fields to logical representations that the model can process.

File Format and Registry

All schema definitions must be maintained in a valid YAML file format. For a single data source, multiple YAML files can be created, with each corresponding to a specific table or view.

Allowed Top-Level Keys

When defining a schema file, only the following lowercase, top-level keys are permitted:

subject_area
table_info
columns

Components of the Schema Layer

`subject_area` (Required)

This key identifies the specific subdomain or business area to which the schema belongs (e.g., 'sales' or 'inventory').

subject_area: sales

`table_info` (Required)

This key defines one or more tables and their associations based on foreign key relationships. It must be a list containing definitions for each table, and each table within table_info must have a unique name.

table (Required): The actual name of the table.
grain (Optional): Defines the data granularity. The hierarchy mentioned must correspond to an existing definition in a hierarchy.yml file.
joins (Optional): A list of join conditions that define relationships between this table and other tables.

Example: table_info with grain and joins
table_info:
  grain:
    merchandise_hierarchy: item
    item_hierarchy: style
  table: sales_ild_f
  joins:
    - join: product_d
      on: product_d.item_id sales_ild_f.item_id
    - join: calendar_d
      on: calendar_d.day_id sales_ild_f.day_id
    - join: org_d
      on: org_d.organization_id sales_ild_f.organization_id

`columns` (Optional)

This key is a collection of database columns that serve as the building blocks for defining metrics and attributes. Each column is uniquely identified by an identity name.

name (Required): The human-readable and searchable display name for the column.
type (Required): The data type of the column (e.g., string, number, varchar).
column (Required): The actual column name as it appears in the database table.
desc (Required): A brief explanation of the column's purpose or content.
primary_key (Optional): A boolean flag (true) indicating the column is a primary key.
foreign_key (Optional): A boolean flag (true) indicating the column is a foreign key.
fetch (Optional): A boolean flag (true) indicating to cache the unique values of columns.

Example: Column Definitions
columns:
  order_id:
    name: Order ID
    type: number
    column: order_id
    primary_key: true
    desc: Represents the sales order ID.
    fetch: false
  product_name:
    name: Product Name
    table: product_t
    type: varchar
    column: pdt_name
    desc: Represents the product name.
    fetch: true

YAML Formatting and Validation

Indentation: Correct indentation is paramount for valid YAML, as it ensures the hierarchical structure is accurately defined and parsed.
Key Validation: Every key defined within the column definitions (like name, type, desc) must have a non-empty value if it is present in the schema.

Best Practices for Schema Design

Adhering to these best practices is crucial for ensuring robust, consistent, and usable Denzing metadata.

Clear Column Names: Use descriptive and unambiguous names for all columns.
- Example: Use customer_name instead of cust_nm; order_id instead of id.
Consistent Naming Conventions: Employ a uniform naming style across your schema. For multi-word names, separate words using an underscore (_).
- Example: Use customer_name instead of customer name.
Appropriate Data Types: Ensure that the correct data type is assigned to each field. Review all records to identify the most suitable data type for every column.
Primary & Foreign Keys: Explicitly define primary and foreign keys within your table's Data Definition Language (DDL) to establish clear relationships between tables.
Consistent Column Types: Avoid mixing different data types (e.g., strings and integers) within the same column. Columns should maintain consistent data types across all records.

Schema Validation Scenarios

This section provides practical YAML examples for validating schema definitions, illustrating adherence to the framework rules.

1. Valid Schema Use Cases

Use Case 1: Complete and Correctly Structured Schema

Description: Validates a schema with a non-empty subject_area, a list for table_info, and a dictionary of valid columns.
Expected Outcome: Validation passes.

subject_area: sales
table_info:
  - table: sales_transactions
    joins:
      - join: products
        on: sales_transactions.product_id  products.product_id
  - table: customers
    joins:
      - join: sales_transactions
        on: customers.customer_id = sales_transactions.customer_id
columns:
  transaction_id:
    name: Transaction ID
    type: string
    column: transaction_id
    desc: Unique identifier for each sales transaction
    primary_key: true
  sale_date:
    name: Sale Date
    type: date
    column: sale_date
    desc: Date of the sales transaction

2. Missing Required Keys Use Cases

Use Case 2: Missing subject_area Key

Description: Schema is missing the subject_area key.
Expected Outcome: Validation fails.

# subject_area: sales (MISSING)
table_info:
  - table: sales_transactions
    joins: []
columns:
  transaction_id:
    name: Transaction ID
    type: string

3. Invalid Format Use Cases

Use Case 5: table_info is Not a List
- Description: table_info is defined as a string or dictionary instead of a list.
- Expected Outcome: Validation fails.
```
subject_area: sales
table_info: "sales_transactions" # Invalid: Should be a list
columns:
  transaction_id:
    name: Transaction ID
```

4. Column Structure Use Cases

Use Case 13: Column Entry Missing Required Keys

Description: A column entry is missing one or more of the required keys (name, type, column, desc).
Expected Outcome: Validation fails.

subject_area: sales
table_info:
  - table: sales_transactions
    joins: []
columns:
  transaction_id:
    # name: Transaction ID (MISSING)
    type: string
    column: transaction_id
    desc: Unique identifier for each sales transaction

Use Case 14: Column Entry Has Empty Values for Required Keys

Description: Required keys within a column entry have empty values.
Expected Outcome: Validation fails.

subject_area: sales
table_info:
  - table: sales_transactions
    joins: []
columns:
  transaction_id:
    name: "" # Invalid: Empty value for a required key
    type: string

5. Duplicate Column IDs Use Cases

Use Case 15: Duplicate Column IDs within the Same Schema

Description: Two or more column entries have the same identity name within the same schema file.
Expected Outcome: Validation fails.

subject_area: sales
table_info:
  - table: sales_transactions
    joins: []
columns:
  transaction_id:
    name: Transaction ID
    type: string
  transaction_id: # Invalid: Duplicate column ID
    name: Another Transaction ID

Purpose​

File Format and Registry​

Allowed Top-Level Keys​

Components of the Schema Layer​

subject_area (Required)​

table_info (Required)​

columns (Optional)​

YAML Formatting and Validation​

Best Practices for Schema Design​

Schema Validation Scenarios​

1. Valid Schema Use Cases​

2. Missing Required Keys Use Cases​

3. Invalid Format Use Cases​

4. Column Structure Use Cases​

5. Duplicate Column IDs Use Cases​