Skip to main content

Building Metadata for NBA

The guidelines provided below outline best practices to follow when creating your own metadata or modifying existing metadata generated by the agent. Adhering to these practices ensures consistency, accuracy, and clarity in metadata management. This helps improve data discoverability and usability.

Overview

Before building the metadata, it's important to understand the data you're working with. This NBA dataset captures detailed information about basketball games. It contains game-level data such as dates, seasons, teams (home and visitor), and final scores.

The structure suggests there are at least two primary tables, GAMES and TEAMS, which can be linked. This allows you to not only see the results of a single game but also to analyze performance across multiple games for specific teams. The goal of the metadata is to make this raw data easily searchable and understandable for asking complex questions in natural language.

Before a you create metadata schema, you need to know the exact structure of your database tables. This structure is defined by the DDL (Data Definition Language), which is the set of SQL commands (CREATE TABLE, ALTER TABLE, etc.) used to build and modify your database.

The DDL is the source of truth for your data's physical layout. The schema file (games.yml) is simply a YAML representation of that DDL, designed to be readable by both humans and the agent. The agent uses the DDL to auto-generate an initial schema, which you then refine. An accurate schema must perfectly mirror the DDL's definitions for column names, data types, and key constraints.

Example DDL for the GAMES Table

Here is a sample CREATE TABLE statement that could be the DDL for our NBA GAMES table. Notice how it defines data types, primary keys, and foreign keys.

CREATE TABLE GAMES (
GAME_ID VARCHAR PRIMARY KEY,
GAME_DATE DATE,
SEASON VARCHAR,
GAME_TYPE VARCHAR,
STATUS VARCHAR,
HOME_TEAM_ID VARCHAR,
HOME_TEAM VARCHAR,
HOME_SCORE NUMBER,
VISITOR_TEAM_ID VARCHAR,
VISITOR_TEAM VARCHAR,
VISITOR_POINT NUMBER,

-- Defining the relationships to the TEAMS table
FOREIGN KEY (HOME_TEAM_ID) REFERENCES TEAMS(ID),
FOREIGN KEY (VISITOR_TEAM_ID) REFERENCES TEAMS(ID)
);

This DDL is then directly translated into the games.yml schema file. For example, the PRIMARY KEY and FOREIGN KEY constraints in the DDL are what inform the primary_key: true and foreign_key: true flags you set during schema refinement.

The Schema Layer

The Schema layer is the foundational element of the metadata framework. It contains the purely physical, structural information about your data sources, including details about columns, their data types, and primary keys.

Think of the Schema layer as the architectural blueprint of a building. It tells you exactly how the building is constructed: the number of floors, the layout of rooms, the location of doors and windows, and the materials used. It describes the physical facts but doesn't explain the purpose of the rooms.

The primary goals of the Schema layer are:

  • To enable a Large Language Model (LLM) to accurately understand the physical structure of your data, whether it's in a database table or a CSV file.
  • To establish a direct mapping from the raw data fields to a logical representation that the model can process.

The Semantics Layer

The Semantics layer is the second level of the hierarchy, and it's where you add business meaning and context to the physical structure defined in the schema. This layer builds directly on the schema, and it can even override schema-level information to provide a more business-centric definition.

Following our building analogy, if the Schema is the blueprint, the Semantics layer is the interior designer's plan. It explains the purpose of each room. It tells you, "this room is the kitchen, and its purpose is for cooking," and "that area is the master bedroom for resting." It can also define new, abstract concepts like "total living area" by combining the square footage of the living room and the dining room.

The primary goals of the Semantics layer are:

  • To provide deep contextual understanding of Key Performance Indicators (KPIs) and metrics, going beyond just their database definitions.
  • To ensure the LLM interprets business terms correctly, aligning with their intended purpose and meaning.
  • To allow users to define new, complex metrics and attributes by creating calculations that use existing columns from the schema or even other pre-defined semantic items.