Skip to main content

Building Schema: The Map to Your Logical Structure of Data 🗺️

Think of your data as a vast library of books. The schema is the library's cataloging system—the table of contents and index. It doesn't contain the data itself, but it provides a clear definition of every table (book) and column (chapter) in your data source.

This "map" is generated from your database's DDL (Data Definition Language). While the auto-generated schema is a good start, refining it is the most critical step to ensure the agent can understand and query your data accurately. A well-crafted schema, with clear, unambiguous column names, is the foundation for everything else.


Key Refinements for an Effective Schema​

Here are the key points to focus on when refining your schema, with explanations on why they matter.

1. Defining Keys: The Foundation of Relationships​

Your schema needs to know which columns are primary keys and which are foreign keys. These flags are the bedrock of table relationships.

  • primary_key: true: Marks the column that uniquely identifies each row in a table (e.g., GAME_ID in the GAMES table).
  • foreign_key: true: Marks a column that links to a primary key in another table (e.g., HOME_TEAM_ID links from GAMES to the TEAMS table).

Why It Matters: Without these flags, the agent sees your tables as disconnected islands of data. It cannot form relationships or perform joins. A simple question like, "Which games did the Lakers play last season?" would be impossible to answer because the agent wouldn't know how to connect the GAMES table to the TEAMS table to find the team named "Lakers." Correctly setting these keys is what makes multi-table queries possible.

Example:

# In games.yml
GAME_ID:
name: Game ID
type: VARCHAR
column: GAME_ID
desc: Unique identifier for the game
primary_key: true # This tells the agent it's the unique ID for a game.

HOME_TEAM_ID:
name: Home Team ID
type: VARCHAR
column: HOME_TEAM_ID
desc: Identifier for the home team
foreign_key: true # This tells the agent it can be used to link to another table.

2. Optimizing Queries with fetch: true​

The fetch: true flag is a powerful optimization for columns with categorical or frequently repeated values. It tells the agent to cache the unique values in that column.

Why It Matters: This feature enables intelligent fuzzy matching, which is essential for a smooth conversational experience. Users rarely type things perfectly. A user might ask for "Lakers games," but the data might store the team name as "Los Angeles Lakers."

By setting fetch: true on the HOME_TEAM and VISITOR_TEAM columns, the agent caches all team names. It can then intelligently map "Lakers" to "Los Angeles Lakers," successfully finding the data. Without this, queries that aren't exact matches would likely fail. It makes the agent more forgiving and a lot smarter.

Example:

# In games.yml
GAME_TYPE:
name: Game Type
type: VARCHAR
column: GAME_TYPE
desc: Type of the game (e.g., regular, playoff)
fetch: true # Cache values like "regular" and "playoff" for better matching.

HOME_TEAM:
name: Home Team
type: VARCHAR
column: HOME_TEAM
desc: Name of the home team
fetch: true # Cache all team names to handle user variations like "Lakers" vs "Los Angeles Lakers".

3. Enabling Complex Questions with joins​

Once your keys are defined, you must specify the joins in the table_info section. This is where you explicitly tell the agent how to connect the tables using those keys.

Why It Matters: Defining joins is the final step that unlocks the ability to ask complex, cross-table questions.

  • Without Joins, you can only ask questions about a single table:

    • "Show me all the game IDs." (Works, GAMES table only)
    • "List all team names." (Works, TEAMS table only)
    • "Show me games played by the Lakers." (Fails, requires both tables)
  • With Joins, you can ask rich, analytical questions:

    • "What was the score of the last game between the Lakers and the Celtics?"
    • "Show me all the playoff games from the 2023 season where the home team scored over 120 points."

The join instructions are the agent's guide to combining information from different sources to answer a single, complex question.

Example:

# In games.yml
table_info:
- table: GAMES
joins:
# This line tells the agent how to connect GAMES to TEAMS for the home team
- join: TEAMS
on: TEAMS.ID = GAMES.HOME_TEAM_ID
# This line tells the agent how to connect GAMES to TEAMS again for the visitor team
- join: TEAMS
on: TEAMS.ID = GAMES.VISITOR_TEAM_ID

The Complete Picture: Schema Template and Example​

When you apply all these refinements, you get a robust schema that serves as a solid foundation for your semantics.

📄 Schema Template Reference​

<SCHEMA_NAME>:
subject_area: <Domain>
table_info:
- table: <TABLE_NAME>
joins:
- join: <OTHER_TABLE>
on: <JOIN_CONDITION>
columns:
<COLUMN_NAME>:
name: <Readable Name>
type: <Data Type>
column: <DB Column>
desc: <Description>
primary_key: <true|false>
foreign_key: <true|false>
fetch: <true|false>

Full Example: games.yml​

GAMES:
subject_area: Sports
table_info:
- table: GAMES
joins:
- join: TEAMS
on: TEAMS.ID = GAMES.HOME_TEAM_ID
- join: TEAMS
on: TEAMS.ID = GAMES.VISITOR_TEAM_ID

columns:
GAME_ID:
name: Game ID
type: VARCHAR
column: GAME_ID
desc: Unique identifier for the game
primary_key: true
GAME_DATE:
name: Game Date
type: DATE
column: GAME_DATE
desc: Date when the game was played
SEASON:
name: Season
type: VARCHAR
column: SEASON
desc: Season during which the game took place
GAME_TYPE:
name: Game Type
type: VARCHAR
column: GAME_TYPE
desc: Type of the game (e.g., regular, playoff)
fetch: true
STATUS:
name: Status
type: VARCHAR
column: STATUS
desc: Current status of the game
fetch: true
HOME_TEAM_ID:
name: Home Team ID
type: VARCHAR
column: HOME_TEAM_ID
desc: Identifier for the home team
foreign_key: true
HOME_TEAM:
name: Home Team
type: VARCHAR
column: HOME_TEAM
desc: Name of the home team
fetch: true
HOME_SCORE:
name: Home Score
type: NUMBER
column: HOME_SCORE
desc: Score of the home team
VISITOR_TEAM_ID:
name: Visitor Team ID
type: VARCHAR
column: VISITOR_TEAM_ID
desc: Identifier for the visitor team
foreign_key: true
VISITOR_TEAM:
name: Visitor Team
type: VARCHAR
column: VISITOR_TEAM
desc: Name of the visitor team
fetch: true
VISITOR_POINT:
name: Visitor Points
type: NUMBER
column: VISITOR_POINT
desc: Points scored by the visitor team