Data Transformation for KPI
To enable effective KPI tracking and analytics, the transformation layer should:
-
Extract and clean raw data to ensure consistency and reliability.
-
Compute intermediate columns (e.g., derived metrics) based on rules.
-
Pre-calculate useful fields to make downstream view generation more efficient.
Example: NBA Dataset
The NBA dataset is used here as an example to illustrate the cleansing and transformation process.
To support accurate and performant KPI calculations related to player performance (e.g., Points per Minute, Efficiency per Time Played), raw playing time in the format "MM:SS" needs to be normalized.
Transformation
The column Minutes
(string format like "34:15") has been converted into total Seconds
as a numeric value. It enables calculation of KPIs such as:
-
Points per Minute = POINTS / (SECONDS / 60)
-
Rebounds per Minute
-
Player efficiency metrics normalized by time played
-
Handles malformed or missing values gracefully using TRY_TO_NUMBER and default 0.
Before transformation
Transformation Logic
CASE
WHEN REGEXP_LIKE(MINUTES, '^[0-9]+(\.[0-9]+)?:[0-5][0-9]$') THEN
TRY_TO_NUMBER(SPLIT_PART(SPLIT_PART(MINUTES, ':', 1), '.', 1)) * 60 +
TRY_TO_NUMBER(SPLIT_PART(MINUTES, ':', 2))
ELSE 0
END AS SECONDS
After transformation
Data Cleaning and Standardization
To ensure the data fed is clean, consistent, and reliable for analytics and KPI computation, we can apply the following cleansing and standardization rules across our NBA data pipeline.
Rule | Description | NBA Data Example |
---|---|---|
Uniform Units | Use consistent units across the dataset (e.g., USD, kg, seconds). | Converted MINUTES (string in "MM:SS" format) to numeric SECONDS in V_BOX_SCORE using a CASE + REGEXP logic. |
Standard Date Format | Store dates in the format YYYY-MM-DD HH:MM:SS , ideally in UTC. | All timestamps like START_TIME and INSERTED_DATE in views are stored as TIMESTAMP_NTZ(9) . |
Boolean Normalization | Normalize boolean fields to consistent values like Y/N . | Fields like IS_HOME_WIN can use Y/N . |
Remove Redundancy | Eliminate duplicate rows and conflicting records. | Creating final tables (BOX_SCORE , GAMES , TEAMS ) from views ensures deduplicated, consistent records per game ID. |
Case Standardization | Enforce consistent casing for textual fields. | Fields like PLAYER_NAME , START_POSITION , CITY , and TEAM_NAME are standardized to Title Case or UPPER as needed. |
Consistent Categories | Validate categorical fields (e.g., position, game type) against allowed domain values. | START_POSITION is expected to be from a known set (G , F , C , etc.). Invalid entries can be cleaned or flagged. |