Skip to main content

Data Transformation for KPI

To enable effective KPI tracking and analytics, the transformation layer should:

  • Extract and clean raw data to ensure consistency and reliability.

  • Compute intermediate columns (e.g., derived metrics) based on rules.

  • Pre-calculate useful fields to make downstream view generation more efficient.

Example: NBA Dataset

The NBA dataset is used here as an example to illustrate the cleansing and transformation process.

To support accurate and performant KPI calculations related to player performance (e.g., Points per Minute, Efficiency per Time Played), raw playing time in the format "MM:SS" needs to be normalized.

Transformation

The column Minutes (string format like "34:15") has been converted into total Seconds as a numeric value. It enables calculation of KPIs such as:

  • Points per Minute = POINTS / (SECONDS / 60)

  • Rebounds per Minute

  • Player efficiency metrics normalized by time played

  • Handles malformed or missing values gracefully using TRY_TO_NUMBER and default 0.

Before transformation

alt text

Transformation Logic

CASE 
WHEN REGEXP_LIKE(MINUTES, '^[0-9]+(\.[0-9]+)?:[0-5][0-9]$') THEN
TRY_TO_NUMBER(SPLIT_PART(SPLIT_PART(MINUTES, ':', 1), '.', 1)) * 60 +
TRY_TO_NUMBER(SPLIT_PART(MINUTES, ':', 2))
ELSE 0
END AS SECONDS

After transformation

alt text

Data Cleaning and Standardization

To ensure the data fed is clean, consistent, and reliable for analytics and KPI computation, we can apply the following cleansing and standardization rules across our NBA data pipeline.

RuleDescriptionNBA Data Example
Uniform UnitsUse consistent units across the dataset (e.g., USD, kg, seconds).Converted MINUTES (string in "MM:SS" format) to numeric SECONDS in V_BOX_SCORE using a CASE + REGEXP logic.
Standard Date FormatStore dates in the format YYYY-MM-DD HH:MM:SS, ideally in UTC.All timestamps like START_TIME and INSERTED_DATE in views are stored as TIMESTAMP_NTZ(9).
Boolean NormalizationNormalize boolean fields to consistent values like Y/N.Fields like IS_HOME_WIN can use Y/N.
Remove RedundancyEliminate duplicate rows and conflicting records.Creating final tables (BOX_SCORE, GAMES, TEAMS) from views ensures deduplicated, consistent records per game ID.
Case StandardizationEnforce consistent casing for textual fields.Fields like PLAYER_NAME, START_POSITION, CITY, and TEAM_NAME are standardized to Title Case or UPPER as needed.
Consistent CategoriesValidate categorical fields (e.g., position, game type) against allowed domain values.START_POSITION is expected to be from a known set (G, F, C, etc.). Invalid entries can be cleaned or flagged.