Building Semantics: Adding Business Context ðŸ§
If the schema is your data's blueprint, the semantics layer is its brain. It adds the business context, logic, and human-friendly language needed to transform raw data columns into meaningful insights. The quality of your semantics directly depends on a well-defined schema.
This guide will cover the foundational concepts and then walk through practical scenarios to build a complete semantics file for our NBA dataset.
Foundational Concepts​
Before building, it's crucial to understand the main tools at your disposal.
1. Overwriting Schema Definitions​
You can define a semantic attribute with the same name as a schema column. When you do this, the semantic definition overrides the schema's definition.
- Why do this? To enrich a basic schema column with business-friendly
synonyms, a betterdescription, or to apply afilter. For example, a column namedTXN_IDin the schema could be overridden in semantics to have the name "Transaction ID" and synonyms like "Order Number."
2. Using include in Attributes​
The include keyword lets you bundle multiple schema columns into a single, logical attribute.
- Why do this? It's perfect for creating quick summaries. A user can ask for "Game Info," and the agent knows to pull all the relevant columns you've included, like the date, teams, and scores, all at once. It acts like a pre-defined
SELECTstatement for a group of related columns.
3. The calculation Field​
This is where you embed business logic to create metrics. You can use SQL-like functions, and you must wrap any referenced column or attribute in square brackets [].
- Why do this? This is how you define your Key Performance Indicators (KPIs). Instead of just showing the raw score, you can calculate
home_team_wins,average_score, ortotal_points, which are far more valuable for analysis.
4. Using filters​
The filters keyword applies a permanent condition to an attribute or metric.
- Why do this? It allows you to create powerful "shortcuts." For instance, you can create an attribute called
current_season_gameswith a filter for"[SEASON]=2024". Now, users can ask questions about the "current season" without needing to remember and specify the year, making queries faster and less error-prone.
Building Semantics: Practical Scenarios 💡​
Let's build our games.yml semantics file step-by-step, from the simplest attributes to more complex metrics.
Scenario A: Basic Attributes - Aliases and Groups​
First, let's make our schema columns more accessible and create a useful summary group.
- Create a Simple Alias: The column
GAME_IDis a technical name. Let's create a user-friendly attribute with synonyms.attributes:
game_identifier:
name: Game ID
synonym:
- Game Identifier
- Match ID
description: Unique identifier for the game.
include:
- GAME_ID - Group Columns for a Summary: Let's create a
game_infoattribute that provides a quick overview of a game.game_info:
name: Game Information
synonym:
- Information of Games
- Games Info
description: Summarized information about the games.
include:
- GAME_ID
- GAME_DATE
- SEASON
- HOME_TEAM
- HOME_SCORE
- VISITOR_TEAM
- VISITOR_POINT
Scenario B: Simple Metrics - Basic Aggregations​
Now, let's start calculating basic KPIs using simple aggregation functions.
- Count the Total Number of Games:
metrics:
total_games:
name: Total Games Played
synonym:
- Game Count
- Number of Games
description: Total number of games played in the dataset.
calculation: "COUNT([GAME_ID])" - Sum the Total Scores:
total_home_score:
name: Total Home Score
synonym:
- Home Points Scored
description: Total points scored by home teams across all games.
calculation: "SUM([HOME_SCORE])"
Scenario C: Conditional Metrics - Adding Business Logic​
This is where semantics become truly powerful. Let's define metrics that answer more specific questions by embedding a WHERE clause in the calculation.
Goal: Calculate the number of wins for home teams vs. visitor teams.
metrics:
home_team_wins:
name: Home Team Wins
synonym:
- Home Wins
- Home Team Victories
description: Count of games where the home team scored more points than the visitor team.
calculation: "COUNT([GAME_ID]) WHERE [HOME_SCORE] > [VISITOR_POINT]"
visitor_team_wins:
name: Visitor Team Wins
synonym:
- Visitor Wins
description: Count of games where the visitor team scored more points than the home team.
calculation: "COUNT([GAME_ID]) WHERE [VISITOR_POINT] > [HOME_SCORE]"
Explanation: The calculation here isn't just a simple aggregation; it contains conditional logic that compares two columns to determine the outcome, directly answering a common business question.
Scenario D: Advanced Metrics - Grouping and Filtering​
Finally, let's create metrics that provide categorical breakdowns and use pre-filtered data.
- Group Data by Category: Let's see how many games fall into each status category (e.g., Final, Scheduled).
metrics:
games_by_status:
name: Games by Status
synonym:
- Status Count
description: Count of games grouped by their current status.
calculation: "COUNT([GAME_ID]) GROUP BY [STATUS]" - Use a Pre-Filtered Attribute: Let's create our
current_seasonattribute with a filter.Now you could create a metric that uses this attribute to, for example, count wins only in the current season.attributes:
current_season:
name: Current season
synonym:
- This season
- Running season
description: The current season or the running season.
filters: "[SEASON]=2024"
The Complete Picture: Full Example Reference​
When you combine all these scenarios, you get a comprehensive and powerful semantics file that makes your data truly conversational.
GAMES:
folder: GAMES
type: fact
source:
schema.GAMES:
columns:
- <all>
attributes:
game_identifier:
name: Game ID
synonym: [Game Identifier, Match ID]
description: Unique identifier for the game.
include: [GAME_ID]
game_info:
name: Game Information
synonym: [Information of Games, Games Info]
description: Summarized information about the games.
include: [GAME_ID, GAME_DATE, SEASON, HOME_TEAM, HOME_SCORE, VISITOR_TEAM, VISITOR_POINT]
current_season:
name: Current season
synonym: [This season, Running season]
description: The current season or the running season.
filters: "[SEASON]=2024"
metrics:
total_games:
name: Total Games Played
synonym: [Game Count, Number of Games]
description: Total number of games played in the dataset.
calculation: "COUNT([GAME_ID])"
total_home_score:
name: Total Home Score
synonym: [Home Points Scored]
description: Total points scored by home teams across all games.
calculation: "SUM([HOME_SCORE])"
home_team_wins:
name: Home Team Wins
synonym: [Home Wins, Home Team Victories]
description: Count of games where the home team scored more.
calculation: "COUNT([GAME_ID]) WHERE [HOME_SCORE] > [VISITOR_POINT]"
visitor_team_wins:
name: Visitor Team Wins
synonym: [Visitor Wins, Visitor Team Victories]
description: Count of games where the visitor team scored more.
calculation: "COUNT([GAME_ID]) WHERE [VISITOR_POINT] > [HOME_SCORE]"
games_by_status:
name: Games by Status
synonym: [Status Count, Games Status Distribution]
description: Count of games grouped by their current status.
calculation: "COUNT([GAME_ID]) GROUP BY [STATUS]"
average_home_score:
name: Average Home Score
synonym: [Average Points by Home Team]
description: Average points scored by home teams across all games.
calculation: "AVG([HOME_SCORE])"