Schema Layer
The Schema layer is the foundational element of the Denzing Metadata framework. It contains the physical schema-level information for your data sources, encompassing column-level details, primary keys, and brief descriptions.
Purpose
- To enable LLMs to accurately comprehend the physical structure of input data from sources like database tables and CSV files.
 - To establish a clear mapping from raw data fields to logical representations that the model can process.
 
File Format and Registry
All schema definitions must be maintained in a valid YAML file format. For a single data source, multiple YAML files can be created, with each corresponding to a specific table or view.
Allowed Top-Level Keys
When defining a schema file, only the following lowercase, top-level keys are permitted:
subject_areatable_infocolumns
Components of the Schema Layer
subject_area (Required)
This key identifies the specific subdomain or business area to which the schema belongs (e.g., 'sales' or 'inventory').
subject_area: sales
table_info (Required)
This key defines one or more tables and their associations based on foreign key relationships. It must be a list containing definitions for each table, and each table within table_info must have a unique name.
table(Required): The actual name of the table.grain(Optional): Defines the data granularity. The hierarchy mentioned must correspond to an existing definition in ahierarchy.ymlfile.joins(Optional): A list of join conditions that define relationships between this table and other tables.
table_info:
  grain:
    merchandise_hierarchy: item
    item_hierarchy: style
  table: sales_ild_f
  joins:
    - join: product_d
      on: product_d.item_id sales_ild_f.item_id
    - join: calendar_d
      on: calendar_d.day_id sales_ild_f.day_id
    - join: org_d
      on: org_d.organization_id sales_ild_f.organization_id
columns (Optional)
This key is a collection of database columns that serve as the building blocks for defining metrics and attributes. Each column is uniquely identified by an identity name.
name(Required): The human-readable and searchable display name for the column.type(Required): The data type of the column (e.g.,string,number,varchar).column(Required): The actual column name as it appears in the database table.desc(Required): A brief explanation of the column's purpose or content.primary_key(Optional): A boolean flag (true) indicating the column is a primary key.foreign_key(Optional): A boolean flag (true) indicating the column is a foreign key.fetch(Optional): A boolean flag (true) indicating to cache the unique values of columns.
columns:
  order_id:
    name: Order ID
    type: number
    column: order_id
    primary_key: true
    desc: Represents the sales order ID.
    fetch: false
  product_name:
    name: Product Name
    table: product_t
    type: varchar
    column: pdt_name
    desc: Represents the product name.
    fetch: true
YAML Formatting and Validation
- Indentation: Correct indentation is paramount for valid YAML, as it ensures the hierarchical structure is accurately defined and parsed.
 - Key Validation: Every key defined within the column definitions (like 
name,type,desc) must have a non-empty value if it is present in the schema. 
Best Practices for Schema Design
Adhering to these best practices is crucial for ensuring robust, consistent, and usable Denzing metadata.
- Clear Column Names: Use descriptive and unambiguous names for all columns.
- Example: Use 
customer_nameinstead ofcust_nm;order_idinstead ofid. 
 - Example: Use 
 - Consistent Naming Conventions: Employ a uniform naming style across your schema. For multi-word names, separate words using an underscore (
_).- Example: Use 
customer_nameinstead ofcustomer name. 
 - Example: Use 
 - Appropriate Data Types: Ensure that the correct data type is assigned to each field. Review all records to identify the most suitable data type for every column.
 - Primary & Foreign Keys: Explicitly define primary and foreign keys within your table's Data Definition Language (DDL) to establish clear relationships between tables.
 - Consistent Column Types: Avoid mixing different data types (e.g., strings and integers) within the same column. Columns should maintain consistent data types across all records.
 
Schema Validation Scenarios
This section provides practical YAML examples for validating schema definitions, illustrating adherence to the framework rules.
1. Valid Schema Use Cases
- 
Use Case 1: Complete and Correctly Structured Schema
- Description: Validates a schema with a non-empty 
subject_area, a list fortable_info, and a dictionary of valid columns. - Expected Outcome: Validation passes.
 
subject_area: sales
table_info:
- table: sales_transactions
joins:
- join: products
on: sales_transactions.product_id products.product_id
- table: customers
joins:
- join: sales_transactions
on: customers.customer_id = sales_transactions.customer_id
columns:
transaction_id:
name: Transaction ID
type: string
column: transaction_id
desc: Unique identifier for each sales transaction
primary_key: true
sale_date:
name: Sale Date
type: date
column: sale_date
desc: Date of the sales transaction - Description: Validates a schema with a non-empty 
 
2. Missing Required Keys Use Cases
- 
Use Case 2: Missing
subject_areaKey- Description: Schema is missing the 
subject_areakey. - Expected Outcome: Validation fails.
 
# subject_area: sales (MISSING)
table_info:
- table: sales_transactions
joins: []
columns:
transaction_id:
name: Transaction ID
type: string - Description: Schema is missing the 
 
3. Invalid Format Use Cases
- 
Use Case 5:
table_infois Not a List- Description: 
table_infois defined as a string or dictionary instead of a list. - Expected Outcome: Validation fails.
 
subject_area: sales
table_info: "sales_transactions" # Invalid: Should be a list
columns:
transaction_id:
name: Transaction ID - Description: 
 
4. Column Structure Use Cases
- 
Use Case 13: Column Entry Missing Required Keys
- Description: A column entry is missing one or more of the required keys (
name,type,column,desc). - Expected Outcome: Validation fails.
 
subject_area: sales
table_info:
- table: sales_transactions
joins: []
columns:
transaction_id:
# name: Transaction ID (MISSING)
type: string
column: transaction_id
desc: Unique identifier for each sales transaction - Description: A column entry is missing one or more of the required keys (
 - 
Use Case 14: Column Entry Has Empty Values for Required Keys
- Description: Required keys within a column entry have empty values.
 - Expected Outcome: Validation fails.
 
subject_area: sales
table_info:
- table: sales_transactions
joins: []
columns:
transaction_id:
name: "" # Invalid: Empty value for a required key
type: string 
5. Duplicate Column IDs Use Cases
- 
Use Case 15: Duplicate Column IDs within the Same Schema
- Description: Two or more column entries have the same identity name within the same schema file.
 - Expected Outcome: Validation fails.
 
subject_area: sales
table_info:
- table: sales_transactions
joins: []
columns:
transaction_id:
name: Transaction ID
type: string
transaction_id: # Invalid: Duplicate column ID
name: Another Transaction ID