Skip to main content

Schema Layer

The Schema layer is the foundational element of the Denzing Metadata framework. It contains the physical schema-level information for your data sources, encompassing column-level details, primary keys, and brief descriptions.

Purpose

  • To enable LLMs to accurately comprehend the physical structure of input data from sources like database tables and CSV files.
  • To establish a clear mapping from raw data fields to logical representations that the model can process.

File Format and Registry

All schema definitions must be maintained in a valid YAML file format. For a single data source, multiple YAML files can be created, with each corresponding to a specific table or view.

Allowed Top-Level Keys

When defining a schema file, only the following lowercase, top-level keys are permitted:

  • subject_area
  • table_info
  • columns

Components of the Schema Layer

subject_area (Required)

This key identifies the specific subdomain or business area to which the schema belongs (e.g., 'sales' or 'inventory').

subject_area: sales

table_info (Required)

This key defines one or more tables and their associations based on foreign key relationships. It must be a list containing definitions for each table, and each table within table_info must have a unique name.

  • table (Required): The actual name of the table.
  • grain (Optional): Defines the data granularity. The hierarchy mentioned must correspond to an existing definition in a hierarchy.yml file.
  • joins (Optional): A list of join conditions that define relationships between this table and other tables.
Example: table_info with grain and joins
table_info:
grain:
merchandise_hierarchy: item
item_hierarchy: style
table: sales_ild_f
joins:
- join: product_d
on: product_d.item_id sales_ild_f.item_id
- join: calendar_d
on: calendar_d.day_id sales_ild_f.day_id
- join: org_d
on: org_d.organization_id sales_ild_f.organization_id

columns (Optional)

This key is a collection of database columns that serve as the building blocks for defining metrics and attributes. Each column is uniquely identified by an identity name.

  • name (Required): The human-readable and searchable display name for the column.
  • type (Required): The data type of the column (e.g., string, number, varchar).
  • column (Required): The actual column name as it appears in the database table.
  • desc (Required): A brief explanation of the column's purpose or content.
  • primary_key (Optional): A boolean flag (true) indicating the column is a primary key.
  • foreign_key (Optional): A boolean flag (true) indicating the column is a foreign key.
  • fetch (Optional): A boolean flag (true) indicating to cache the unique values of columns.
Example: Column Definitions
columns:
order_id:
name: Order ID
type: number
column: order_id
primary_key: true
desc: Represents the sales order ID.
fetch: false
product_name:
name: Product Name
table: product_t
type: varchar
column: pdt_name
desc: Represents the product name.
fetch: true

YAML Formatting and Validation

  • Indentation: Correct indentation is paramount for valid YAML, as it ensures the hierarchical structure is accurately defined and parsed.
  • Key Validation: Every key defined within the column definitions (like name, type, desc) must have a non-empty value if it is present in the schema.

Best Practices for Schema Design

Adhering to these best practices is crucial for ensuring robust, consistent, and usable Denzing metadata.

  • Clear Column Names: Use descriptive and unambiguous names for all columns.
    • Example: Use customer_name instead of cust_nm; order_id instead of id.
  • Consistent Naming Conventions: Employ a uniform naming style across your schema. For multi-word names, separate words using an underscore (_).
    • Example: Use customer_name instead of customer name.
  • Appropriate Data Types: Ensure that the correct data type is assigned to each field. Review all records to identify the most suitable data type for every column.
  • Primary & Foreign Keys: Explicitly define primary and foreign keys within your table's Data Definition Language (DDL) to establish clear relationships between tables.
  • Consistent Column Types: Avoid mixing different data types (e.g., strings and integers) within the same column. Columns should maintain consistent data types across all records.

Schema Validation Scenarios

This section provides practical YAML examples for validating schema definitions, illustrating adherence to the framework rules.

1. Valid Schema Use Cases

  • Use Case 1: Complete and Correctly Structured Schema

    • Description: Validates a schema with a non-empty subject_area, a list for table_info, and a dictionary of valid columns.
    • Expected Outcome: Validation passes.
    subject_area: sales
    table_info:
    - table: sales_transactions
    joins:
    - join: products
    on: sales_transactions.product_id products.product_id
    - table: customers
    joins:
    - join: sales_transactions
    on: customers.customer_id = sales_transactions.customer_id
    columns:
    transaction_id:
    name: Transaction ID
    type: string
    column: transaction_id
    desc: Unique identifier for each sales transaction
    primary_key: true
    sale_date:
    name: Sale Date
    type: date
    column: sale_date
    desc: Date of the sales transaction

2. Missing Required Keys Use Cases

  • Use Case 2: Missing subject_area Key

    • Description: Schema is missing the subject_area key.
    • Expected Outcome: Validation fails.
    # subject_area: sales (MISSING)
    table_info:
    - table: sales_transactions
    joins: []
    columns:
    transaction_id:
    name: Transaction ID
    type: string

3. Invalid Format Use Cases

  • Use Case 5: table_info is Not a List

    • Description: table_info is defined as a string or dictionary instead of a list.
    • Expected Outcome: Validation fails.
    subject_area: sales
    table_info: "sales_transactions" # Invalid: Should be a list
    columns:
    transaction_id:
    name: Transaction ID

4. Column Structure Use Cases

  • Use Case 13: Column Entry Missing Required Keys

    • Description: A column entry is missing one or more of the required keys (name, type, column, desc).
    • Expected Outcome: Validation fails.
    subject_area: sales
    table_info:
    - table: sales_transactions
    joins: []
    columns:
    transaction_id:
    # name: Transaction ID (MISSING)
    type: string
    column: transaction_id
    desc: Unique identifier for each sales transaction
  • Use Case 14: Column Entry Has Empty Values for Required Keys

    • Description: Required keys within a column entry have empty values.
    • Expected Outcome: Validation fails.
    subject_area: sales
    table_info:
    - table: sales_transactions
    joins: []
    columns:
    transaction_id:
    name: "" # Invalid: Empty value for a required key
    type: string

5. Duplicate Column IDs Use Cases

  • Use Case 15: Duplicate Column IDs within the Same Schema

    • Description: Two or more column entries have the same identity name within the same schema file.
    • Expected Outcome: Validation fails.
    subject_area: sales
    table_info:
    - table: sales_transactions
    joins: []
    columns:
    transaction_id:
    name: Transaction ID
    type: string
    transaction_id: # Invalid: Duplicate column ID
    name: Another Transaction ID