Skip to main content

Building Semantics: Adding Business Context 🧠

If the schema is your data's blueprint, the semantics layer is its brain. It adds the business context, logic, and human-friendly language needed to transform raw data columns into meaningful insights. The quality of your semantics directly depends on a well-defined schema.

This guide will cover the foundational concepts and then walk through practical scenarios to build a complete semantics file for our NBA dataset.


Foundational Concepts​

Before building, it's crucial to understand the main tools at your disposal.

1. Overwriting Schema Definitions​

You can define a semantic attribute with the same name as a schema column. When you do this, the semantic definition overrides the schema's definition.

  • Why do this? To enrich a basic schema column with business-friendly synonyms, a better description, or to apply a filter. For example, a column named TXN_ID in the schema could be overridden in semantics to have the name "Transaction ID" and synonyms like "Order Number."

2. Using include in Attributes​

The include keyword lets you bundle multiple schema columns into a single, logical attribute.

  • Why do this? It's perfect for creating quick summaries. A user can ask for "Game Info," and the agent knows to pull all the relevant columns you've included, like the date, teams, and scores, all at once. It acts like a pre-defined SELECT statement for a group of related columns.

3. The calculation Field​

This is where you embed business logic to create metrics. You can use SQL-like functions, and you must wrap any referenced column or attribute in square brackets [].

  • Why do this? This is how you define your Key Performance Indicators (KPIs). Instead of just showing the raw score, you can calculate home_team_wins, average_score, or total_points, which are far more valuable for analysis.

4. Using filters​

The filters keyword applies a permanent condition to an attribute or metric.

  • Why do this? It allows you to create powerful "shortcuts." For instance, you can create an attribute called current_season_games with a filter for "[SEASON]=2024". Now, users can ask questions about the "current season" without needing to remember and specify the year, making queries faster and less error-prone.

Building Semantics: Practical Scenarios 💡​

Let's build our games.yml semantics file step-by-step, from the simplest attributes to more complex metrics.

Scenario A: Basic Attributes - Aliases and Groups​

First, let's make our schema columns more accessible and create a useful summary group.

  1. Create a Simple Alias: The column GAME_ID is a technical name. Let's create a user-friendly attribute with synonyms.
    attributes:
    game_identifier:
    name: Game ID
    synonym:
    - Game Identifier
    - Match ID
    description: Unique identifier for the game.
    include:
    - GAME_ID
  2. Group Columns for a Summary: Let's create a game_info attribute that provides a quick overview of a game.
      game_info:
    name: Game Information
    synonym:
    - Information of Games
    - Games Info
    description: Summarized information about the games.
    include:
    - GAME_ID
    - GAME_DATE
    - SEASON
    - HOME_TEAM
    - HOME_SCORE
    - VISITOR_TEAM
    - VISITOR_POINT

Scenario B: Simple Metrics - Basic Aggregations​

Now, let's start calculating basic KPIs using simple aggregation functions.

  1. Count the Total Number of Games:
    metrics:
    total_games:
    name: Total Games Played
    synonym:
    - Game Count
    - Number of Games
    description: Total number of games played in the dataset.
    calculation: "COUNT([GAME_ID])"
  2. Sum the Total Scores:
      total_home_score:
    name: Total Home Score
    synonym:
    - Home Points Scored
    description: Total points scored by home teams across all games.
    calculation: "SUM([HOME_SCORE])"

Scenario C: Conditional Metrics - Adding Business Logic​

This is where semantics become truly powerful. Let's define metrics that answer more specific questions by embedding a WHERE clause in the calculation.

Goal: Calculate the number of wins for home teams vs. visitor teams.

metrics:
home_team_wins:
name: Home Team Wins
synonym:
- Home Wins
- Home Team Victories
description: Count of games where the home team scored more points than the visitor team.
calculation: "COUNT([GAME_ID]) WHERE [HOME_SCORE] > [VISITOR_POINT]"

visitor_team_wins:
name: Visitor Team Wins
synonym:
- Visitor Wins
description: Count of games where the visitor team scored more points than the home team.
calculation: "COUNT([GAME_ID]) WHERE [VISITOR_POINT] > [HOME_SCORE]"

Explanation: The calculation here isn't just a simple aggregation; it contains conditional logic that compares two columns to determine the outcome, directly answering a common business question.

Scenario D: Advanced Metrics - Grouping and Filtering​

Finally, let's create metrics that provide categorical breakdowns and use pre-filtered data.

  1. Group Data by Category: Let's see how many games fall into each status category (e.g., Final, Scheduled).
    metrics:
    games_by_status:
    name: Games by Status
    synonym:
    - Status Count
    description: Count of games grouped by their current status.
    calculation: "COUNT([GAME_ID]) GROUP BY [STATUS]"
  2. Use a Pre-Filtered Attribute: Let's create our current_season attribute with a filter.
    attributes:
    current_season:
    name: Current season
    synonym:
    - This season
    - Running season
    description: The current season or the running season.
    filters: "[SEASON]=2024"
    Now you could create a metric that uses this attribute to, for example, count wins only in the current season.

The Complete Picture: Full Example Reference​

When you combine all these scenarios, you get a comprehensive and powerful semantics file that makes your data truly conversational.

GAMES:
folder: GAMES
type: fact
source:
schema.GAMES:
columns:
- <all>
attributes:
game_identifier:
name: Game ID
synonym: [Game Identifier, Match ID]
description: Unique identifier for the game.
include: [GAME_ID]
game_info:
name: Game Information
synonym: [Information of Games, Games Info]
description: Summarized information about the games.
include: [GAME_ID, GAME_DATE, SEASON, HOME_TEAM, HOME_SCORE, VISITOR_TEAM, VISITOR_POINT]
current_season:
name: Current season
synonym: [This season, Running season]
description: The current season or the running season.
filters: "[SEASON]=2024"
metrics:
total_games:
name: Total Games Played
synonym: [Game Count, Number of Games]
description: Total number of games played in the dataset.
calculation: "COUNT([GAME_ID])"
total_home_score:
name: Total Home Score
synonym: [Home Points Scored]
description: Total points scored by home teams across all games.
calculation: "SUM([HOME_SCORE])"
home_team_wins:
name: Home Team Wins
synonym: [Home Wins, Home Team Victories]
description: Count of games where the home team scored more.
calculation: "COUNT([GAME_ID]) WHERE [HOME_SCORE] > [VISITOR_POINT]"
visitor_team_wins:
name: Visitor Team Wins
synonym: [Visitor Wins, Visitor Team Victories]
description: Count of games where the visitor team scored more.
calculation: "COUNT([GAME_ID]) WHERE [VISITOR_POINT] > [HOME_SCORE]"
games_by_status:
name: Games by Status
synonym: [Status Count, Games Status Distribution]
description: Count of games grouped by their current status.
calculation: "COUNT([GAME_ID]) GROUP BY [STATUS]"
average_home_score:
name: Average Home Score
synonym: [Average Points by Home Team]
description: Average points scored by home teams across all games.
calculation: "AVG([HOME_SCORE])"