Skip to main content

Guidelines to creating schema and semantics

The guidelines provided below outline best practices to follow when creating your own metadata or modifying existing metadata generated by the agent. Adhering to these practices ensures consistency, accuracy, and clarity in metadata management. This helps improve data discoverability and usability.

Key Points to Note During Schema Creation

The schema of the data provides a clear definition of all the columns in the data source, making it easily understandable and interpretable by the agent. It is generated based on the data's DDL (Data Definition Language). While the model-generated schema is reliable, its effectiveness depends on how well the column names are chosen—ideally, they should be self-explanatory and unambiguous. Here are some points to note while tweaking the metadata.

1. Primary Key and Foreign Key Flags

  • All columns default to primary_key: false and foreign_key: false.
  • Update them according to the DDL (data definition language of the database).

Example from games.yml:

GAME_ID:
name: Game ID
type: VARCHAR
column: GAME_ID
desc: Unique identifier for the game
primary_key: true
foreign_key: false
HOME_TEAM_ID:
name: Home Team ID
type: VARCHAR
column: HOME_TEAM_ID
desc: Identifier for the home team
primary_key: false
foreign_key: true

2. Using fetch: true for Categorical or Repeated Values

  • Set fetch: true for values that are:

    • Categorical (e.g., status, game type)
    • Repetitive (e.g., team names)
    • Frequently queried
  • These values are used for fuzzy matching in key-value stores by the agent.

Example from games.yml:

GAME_TYPE:
name: Game Type
type: VARCHAR
column: GAME_TYPE
desc: Type of the game (e.g., regular, playoff)
fetch: true
HOME_TEAM:
name: Home Team
type: VARCHAR
column: HOME_TEAM
desc: Name of the home team
fetch: true

3. Defining Joins for Multi-Table Queries

  • Add joins manually under table_info.

  • Format for joins:

    - join: <TABLE_NAME>
    on: <JOIN_CONDITION>

Example from games.yml:

table_info:
- table: GAMES
joins:
- join: TEAMS
on: TEAMS.ID = GAMES.HOME_TEAM_ID
- join: TEAMS
on: TEAMS.ID = GAMES.VISITOR_TEAM_ID

Note: When joining to the same table multiple times, use precise conditions (for example: HOME_TEAM_ID, VISITOR_TEAM_ID, etc.).

📄 Schema Template Reference

Below is the general structure for a schema YAML file:

<SCHEMA_NAME>:
subject_area: <Domain>
table_info:
- table: <TABLE_NAME>
joins:
- join: <OTHER_TABLE>
on: <JOIN_CONDITION>
columns:
<COLUMN_NAME>:
name: <Readable Name>
type: <Data Type>
column: <DB Column>
desc: <Description>
primary_key: <true|false>
foreign_key: <true|false>
fetch: <true|false>

Example Reference

See the following schema file for a complete reference:

GAMES.yml
GAMES:
subject_area: Sports
table_info:
- table: GAMES
joins:
- join: TEAMS
on: TEAMS.ID = GAMES.HOME_TEAM_ID
- join: TEAMS
on: TEAMS.ID = GAMES.VISITOR_TEAM_ID

columns:
GAME_ID:
name: Game ID
type: VARCHAR
column: GAME_ID
desc: Unique identifier for the game
primary_key: true
foreign_key: false
GAME_DATE:
name: Game Date
type: DATE
column: GAME_DATE
desc: Date when the game was played
primary_key: false
foreign_key: false
SEASON:
name: Season
type: VARCHAR
column: SEASON
desc: Season during which the game took place
primary_key: false
foreign_key: false
GAME_TYPE:
name: Game Type
type: VARCHAR
column: GAME_TYPE
desc: Type of the game (e.g., regular, playoff)
primary_key: false
foreign_key: false
fetch: true
STATUS:
name: Status
type: VARCHAR
column: STATUS
desc: Current status of the game
primary_key: false
foreign_key: false
fetch: true
START_TIME:
name: Start Time
type: TIMESTAMP_NTZ
column: START_TIME
desc: Scheduled start time of the game
primary_key: false
foreign_key: false
HOME_TEAM_ID:
name: Home Team ID
type: VARCHAR
column: HOME_TEAM_ID
desc: Identifier for the home team
primary_key: false
foreign_key: true
HOME_TEAM:
name: Home Team
type: VARCHAR
column: HOME_TEAM
desc: Name of the home team
primary_key: false
foreign_key: false
fetch: true
HOME_SCORE:
name: Home Score
type: NUMBER
column: HOME_SCORE
desc: Score of the home team
primary_key: false
foreign_key: false
VISITOR_TEAM_ID:
name: Visitor Team ID
type: VARCHAR
column: VISITOR_TEAM_ID
desc: Identifier for the visitor team
primary_key: false
foreign_key: true
VISITOR_TEAM:
name: Visitor Team
type: VARCHAR
column: VISITOR_TEAM
desc: Name of the visitor team
primary_key: false
foreign_key: false
fetch: true
VISITOR_POINT:
name: Visitor Points
type: NUMBER
column: VISITOR_POINT
desc: Points scored by the visitor team
primary_key: false
foreign_key: false

Key Points to Note During Semantics Creation

The agent generates the semantics after constructing the schema, so the quality of the semantics depends on how well the schema is defined. The semantics section includes domain-specific attributes and metrics, which are essential for generating insights and extracting key performance indicators (KPIs) from the data. Some points to consider while creating and chaging the semantics are:

1. Extent of Metrics and Attributes

  • Metrics and attributes follow a similar format and must be clear and unambigious.
  • The attributes/metrics generated by the agent depends on the schema clarity.
  • Well-described columns = better semantic generation.

2. Default Column Reference

  • By default, all columns in the schema are included.
columns:
- <all>

Limiting Column References

  • To restrict which columns are referenced, list them explicitly under columns:.
columns:
- GAME_ID
- GAME_DATE

3. Overwriting Schema Definitions

  • An attribute with the same key as a schema column will override it.

  • For example: If you define GAME_ID in sematics as the game identifier, it replaces the schema-level metadata for GAME_ID.

4. Using include in Attributes

  • The include: keyword specifies which columns to be included for the attribute (like SELECT in SQL).
game_info:
include:
- GAME_ID
- GAME_DATE
- GAME_TYPE

5. Calculation Field

  • Use SQL-like functions in calculation:, and wrap columns in [].
calculation: "SUM([HOME_SCORE])"

6. Using Filters (Optional)

  • Use filters: for conditions, and write them in conditional expression or natural language.
filters: "[YEAR]=2025"
-- or --
filters: "Only include games where the home team scored above 100"

Example Snippets from semantics for the NBA Games table

Attribute: Current Season

current_season:
name: Current season
synonym:
- This season
- Running season
- Current season games
description: The current season or the running season.
fliters: "[SEASON]=2024"
include:
- SEASON

Metric: Visitor Team Wins

visitor_team_wins:
name: Visitor Team Wins
synonym:
- Visitor Wins
- Visitor Team Victories
description: Count of games where the visitor team scored more points than the
home team.
calculation: "COUNT([GAME_ID]) WHERE [VISITOR_POINTS] > [HOME_SCORE]"

📄 Semantic Template Reference

<TABLE_NAME>:
folder: <Name>
type: fact | dimension
source:
schema.<TABLE_NAME>:
columns:
- <column_name> | <all>
attributes:
<attribute_key>:
name: <Readable Name>
synonym: [<Synonyms>]
description: <Description>
include: [<Schema Columns>]

metrics:
<metric_key>:
name: <Metric Name>
synonym: [<Synonyms>]
description: <Metric Description>
calculation: "<SQL-Like Calculation>"
filters: <Optional Natural Language Filter>

Example Reference

See the following semantics file for a complete reference:

GAMES.yml
GAMES:
folder: GAMES
type: fact
source:
schema.GAMES:
columns:
- <all>
attributes:
game_identifier:
name: Game ID
synonym:
- Game Identifier
- Match ID
- Contest ID
description: Unique identifier for the game.
include:
- GAME_ID

game_info:
name: Game Information
synonym:
- Information of Games
- Games Info
description: Summarized information about the games
include:
- GAME_ID
- GAME_DATE
- SEASON
- GAME_TYPE
- HOME_TEAM
- HOME_SCORE
- VISITOR_TEAM
- VISITOR_POINTS
game_category:
name: Game Type
synonym:
- Type of Game
- Game Classification
- Match Type
description: Type of the game (e.g., regular, playoff).
include:
- GAME_TYPE

home_team_score:
name: Home Score
synonym:
- Home Points
- Home Team Points
- Points Scored by Home
description: Score of the home team.
include:
- HOME_SCORE

visitor_team_score:
name: Visitor Points
synonym:
- Visitor Score
- Away Points
- Points Scored by Visitor
description: Points scored by the visitor team.
include:
- VISITOR_POINTS

current_season:
name: Current season
synonym:
- This season
- Running season
- Current season games
description: The current season or the running season.
fliters: "[SEASON]=2024"

metrics:
total_games:
name: Total Games Played
synonym:
- Game Count
- Number of Games
description: Total number of games played in the dataset.
calculation: "COUNT([GAME_ID])"

total_home_score:
name: Total Home Score
synonym:
- Home Points Scored
- Home Team Points
description: Total points scored by home teams across all games.
calculation: "SUM([HOME_SCORE])"

total_visitor_score:
name: Total Visitor Score
synonym:
- Visitor Points Scored
- Visitor Team Points
description: Total points scored by visitor teams across all games.
calculation: "SUM([VISITOR_POINTS])"


home_team_wins:
name: Home Team Wins
synonym:
- Home Wins
- Home Team Victories
description: Count of games where the home team scored more points than the visitor
team.
calculation: "COUNT([GAME_ID]) WHERE [HOME_SCORE] > [VISITOR_POINTS]"

visitor_team_wins:
name: Visitor Team Wins
synonym:
- Visitor Wins
- Visitor Team Victories
description: Count of games where the visitor team scored more points than the
home team.
calculation: "COUNT([GAME_ID]) WHERE [VISITOR_POINTS] > [HOME_SCORE]"

games_by_status:
name: Games by Status
synonym:
- Status Count
- Games Status Distribution
description: Count of games grouped by their current status.
calculation: "COUNT([GAME_ID]) GROUP BY [STATUS]"

average_home_score:
name: Average Home Score
synonym:
- Average Points by Home Team
- Home Team Average Score
description: Average points scored by home teams across all games.
calculation: "AVG([HOME_SCORE])"

average_visitor_score:
name: Average Visitor Score
synonym:
- Average Points by Visitor Team
- Visitor Team Average Score
description: Average points scored by visitor teams across all games.
calculation: "AVG([VISITOR_POINTS])"