Building Semantics: Adding Business Context 🧠

If the schema is your data's blueprint, the semantics layer is its brain. It adds the business context, logic, and human-friendly language needed to transform raw data columns into meaningful insights. The quality of your semantics directly depends on a well-defined schema.

This guide will cover the foundational concepts and then walk through practical scenarios to build a complete semantics file for our NBA dataset.

Foundational Concepts

Before building, it's crucial to understand the main tools at your disposal.

1. Overwriting Schema Definitions

You can define a semantic attribute with the same name as a schema column. When you do this, the semantic definition overrides the schema's definition.

Why do this? To enrich a basic schema column with business-friendly synonyms, a better description, or to apply a filter. For example, a column named TXN_ID in the schema could be overridden in semantics to have the name "Transaction ID" and synonyms like "Order Number."

2. Using `include` in Attributes

The include keyword lets you bundle multiple schema columns into a single, logical attribute.

Why do this? It's perfect for creating quick summaries. A user can ask for "Game Info," and the agent knows to pull all the relevant columns you've included, like the date, teams, and scores, all at once. It acts like a pre-defined SELECT statement for a group of related columns.

3. The `calculation` Field

This is where you embed business logic to create metrics. You can use SQL-like functions, and you must wrap any referenced column or attribute in square brackets [].

Why do this? This is how you define your Key Performance Indicators (KPIs). Instead of just showing the raw score, you can calculate home_team_wins, average_score, or total_points, which are far more valuable for analysis.

4. Using `filters`

The filters keyword applies a permanent condition to an attribute or metric.

Why do this? It allows you to create powerful "shortcuts." For instance, you can create an attribute called current_season_games with a filter for "[SEASON]=2024". Now, users can ask questions about the "current season" without needing to remember and specify the year, making queries faster and less error-prone.

Building Semantics: Practical Scenarios 💡

Let's build our games.yml semantics file step-by-step, from the simplest attributes to more complex metrics.

Scenario A: Basic Attributes - Aliases and Groups

First, let's make our schema columns more accessible and create a useful summary group.

Create a Simple Alias: The column GAME_ID is a technical name. Let's create a user-friendly attribute with synonyms.

attributes:
  game_identifier:
    name: Game ID
    synonym:
      - Game Identifier
      - Match ID
    description: Unique identifier for the game.
    include:
      - GAME_ID

Group Columns for a Summary: Let's create a game_info attribute that provides a quick overview of a game.

  game_info:
    name: Game Information
    synonym:
      - Information of Games
      - Games Info
    description: Summarized information about the games.
    include:
      - GAME_ID
      - GAME_DATE
      - SEASON
      - HOME_TEAM
      - HOME_SCORE
      - VISITOR_TEAM
      - VISITOR_POINT

Scenario B: Simple Metrics - Basic Aggregations

Now, let's start calculating basic KPIs using simple aggregation functions.

Count the Total Number of Games:

metrics:
  total_games:
    name: Total Games Played
    synonym:
      - Game Count
      - Number of Games
    description: Total number of games played in the dataset.
    calculation: "COUNT([GAME_ID])"

Sum the Total Scores:

  total_home_score:
    name: Total Home Score
    synonym:
      - Home Points Scored
    description: Total points scored by home teams across all games.
    calculation: "SUM([HOME_SCORE])"

Scenario C: Conditional Metrics - Adding Business Logic

This is where semantics become truly powerful. Let's define metrics that answer more specific questions by embedding a WHERE clause in the calculation.

Goal: Calculate the number of wins for home teams vs. visitor teams.

metrics:
  home_team_wins:
    name: Home Team Wins
    synonym:
      - Home Wins
      - Home Team Victories
    description: Count of games where the home team scored more points than the visitor team.
    calculation: "COUNT([GAME_ID]) WHERE [HOME_SCORE] > [VISITOR_POINT]"

  visitor_team_wins:
    name: Visitor Team Wins
    synonym:
      - Visitor Wins
    description: Count of games where the visitor team scored more points than the home team.
    calculation: "COUNT([GAME_ID]) WHERE [VISITOR_POINT] > [HOME_SCORE]"

Explanation: The calculation here isn't just a simple aggregation; it contains conditional logic that compares two columns to determine the outcome, directly answering a common business question.

Scenario D: Advanced Metrics - Grouping and Filtering

Finally, let's create metrics that provide categorical breakdowns and use pre-filtered data.

Group Data by Category: Let's see how many games fall into each status category (e.g., Final, Scheduled).

metrics:
  games_by_status:
    name: Games by Status
    synonym:
      - Status Count
    description: Count of games grouped by their current status.
    calculation: "COUNT([GAME_ID]) GROUP BY [STATUS]"

Use a Pre-Filtered Attribute: Let's create our current_season attribute with a filter.

attributes:
  current_season:
    name: Current season
    synonym:
      - This season
      - Running season
    description: The current season or the running season.
    filters: "[SEASON]=2024"

Now you could create a metric that uses this attribute to, for example, count wins only in the current season.

The Complete Picture: Full Example Reference

When you combine all these scenarios, you get a comprehensive and powerful semantics file that makes your data truly conversational.

GAMES:
  folder: GAMES
  type: fact
  source:
    schema.GAMES:
      columns:
        - <all>
  attributes:
    game_identifier:
      name: Game ID
      synonym: [Game Identifier, Match ID]
      description: Unique identifier for the game.
      include: [GAME_ID]
    game_info:
      name: Game Information
      synonym: [Information of Games, Games Info]
      description: Summarized information about the games.
      include: [GAME_ID, GAME_DATE, SEASON, HOME_TEAM, HOME_SCORE, VISITOR_TEAM, VISITOR_POINT]
    current_season:
      name: Current season
      synonym: [This season, Running season]
      description: The current season or the running season.
      filters: "[SEASON]=2024"
  metrics:
    total_games:
      name: Total Games Played
      synonym: [Game Count, Number of Games]
      description: Total number of games played in the dataset.
      calculation: "COUNT([GAME_ID])"
    total_home_score:
      name: Total Home Score
      synonym: [Home Points Scored]
      description: Total points scored by home teams across all games.
      calculation: "SUM([HOME_SCORE])"
    home_team_wins:
      name: Home Team Wins
      synonym: [Home Wins, Home Team Victories]
      description: Count of games where the home team scored more.
      calculation: "COUNT([GAME_ID]) WHERE [HOME_SCORE] > [VISITOR_POINT]"
    visitor_team_wins:
      name: Visitor Team Wins
      synonym: [Visitor Wins, Visitor Team Victories]
      description: Count of games where the visitor team scored more.
      calculation: "COUNT([GAME_ID]) WHERE [VISITOR_POINT] > [HOME_SCORE]"
    games_by_status:
      name: Games by Status
      synonym: [Status Count, Games Status Distribution]
      description: Count of games grouped by their current status.
      calculation: "COUNT([GAME_ID]) GROUP BY [STATUS]"
    average_home_score:
      name: Average Home Score
      synonym: [Average Points by Home Team]
      description: Average points scored by home teams across all games.
      calculation: "AVG([HOME_SCORE])"

Foundational Concepts​

1. Overwriting Schema Definitions​

2. Using include in Attributes​

3. The calculation Field​

4. Using filters​

Building Semantics: Practical Scenarios 💡​

Scenario A: Basic Attributes - Aliases and Groups​

Scenario B: Simple Metrics - Basic Aggregations​

Scenario C: Conditional Metrics - Adding Business Logic​

Scenario D: Advanced Metrics - Grouping and Filtering​

The Complete Picture: Full Example Reference​