data_transformation
The Goal of Data Transformation: From Raw Ingredients to a Finished Meal 🍳
Think of your raw data as a collection of random grocery items—flour, eggs, vegetables, spices. In this raw state, they aren't very useful. You can't learn much by looking at them on the counter.
The data transformation layer is the kitchen. It’s where you perform the essential work of cleaning, measuring, combining, and cooking those raw ingredients to prepare a finished meal. This transformed "meal"—a clean, reliable, and structured dataset—is what you can actually serve to your stakeholders for analysis and KPI tracking. Without this step, you're just serving them raw flour and expecting them to understand its potential.
The goal is to make the data reliable, consistent, and ready for calculation.
Deep Dive: A Practical Transformation Example
Let's elaborate on the common retail scenario of cleaning price data to make it usable for financial KPIs.
The Business Problem
Imagine you're a retail analyst asked to calculate the total revenue for the last quarter. You query the database and find that the PRICE column in your TRANSACTIONS table is a text field. The data is a mess:
| TRANSACTION_ID | PRICE | 
|---|---|
| 1001 | "$49.99" | 
| 1002 | "€25.00" | 
| 1003 | "1,200.50" | 
| 1004 | "FREE" | 
| 1005 | "Contact for price" | 
You cannot perform a SUM() aggregation on this column because it's text, not a number. The different currency symbols, commas, and non-numeric text make direct calculation impossible. This is where transformation is essential.
The Solution: A Step-by-Step Logic
The goal is to convert the messy PRICE string into a clean, numeric SALE_PRICE column.
TRY_TO_NUMBER(
    REGEXP_REPLACE(PRICE, '[^0-9.]', ''),
    10, 2 -- Specifies max digits and decimal places
) AS SALE_PRICE
Let's break this down:
Step 1: Isolate the Numbers with REGEXP_REPLACE
REGEXP_REPLACE(PRICE, '[^0-9.]', '')is the core cleaning function.- What it does: It uses a regular expression 
[^0-9.]to find any character that is not a digit (0-9) and not a decimal point (.). It then replaces these characters with nothing (''), effectively deleting them. - Why it's first: This sanitizes the string, stripping away currency symbols (
$,€), thousand-separators (,), and any other text, leaving only the characters needed to form a valid number."$49.99"becomes"49.99", and"1,200.50"becomes"1200.50". 
Step 2: Safely Convert to a Number with TRY_TO_NUMBER
- The 
TRY_TO_NUMBER()function attempts to convert the cleaned string into a numeric data type. - Why it's "TRY": This is a crucial safety feature. If the cleaning step results in a non-numeric string (like for the "FREE" or "Contact for price" values, which become empty strings), a standard 
TO_NUMBER()function would cause the entire query to fail with an error.TRY_TO_NUMBER()gracefully handles these errors by returningNULLinstead, ensuring your transformation process doesn't halt unexpectedly. 
The Impact: Unlocking Financial KPIs
After this transformation, your data is clean and ready for analysis. The new SALE_PRICE column enables you to reliably and accurately calculate critical business KPIs, such as:
- Total Revenue: 
SUM(SALE_PRICE * QUANTITY) - Average Transaction Value: 
AVG(SALE_PRICE) - Profit Margin: 
(SUM(SALE_PRICE) - SUM(COST_PRICE)) / SUM(SALE_PRICE) 
Data Cleaning and Standardization: A Detailed Checklist
Beyond specific transformations, applying a consistent set of cleaning rules across your entire dataset is vital for data integrity.
| Rule | Detailed Explanation | Retail Sales Example | 
|---|---|---|
| Uniform Units | Problem: Data from different sources may use different units (e.g., prices in USD vs. EUR, weights in kg vs. lbs). Aggregating these values directly leads to incorrect results. Solution: Establish a standard unit for each type of measure and convert all values to that standard during transformation.  | Convert all prices to a single currency (SALE_PRICE_USD) and all product weights to a single unit (WEIGHT_KG). | 
| Standard Date Format | Problem: Dates can be stored in many formats (MM/DD/YY, YYYY-MM-DD, Unix timestamps), making time-based analysis impossible.Solution: Convert all date and time fields to a standard TIMESTAMP format, ideally standardized to a single timezone like UTC. | A TRANSACTION_TIME column is converted to the YYYY-MM-DD HH:MM:SS format, allowing for accurate filtering by day, week, or month. | 
| Boolean Normalization | Problem: Boolean states can be represented in many ways (Y, Yes, 1, TRUE). The database treats these as distinct string values.Solution: Standardize all boolean fields to a single format, such as TRUE/FALSE or Y/N. | A field like IS_RETURNED is cleaned using a CASE statement so that all variations are converted to a consistent Y or N. | 
| Remove Redundancy | Problem: System glitches or API retries can create duplicate records (e.g., the same transaction logged twice). This inflates counts and sums. Solution: Use window functions like ROW_NUMBER() partitioned by a unique identifier (TRANSACTION_ID) to identify and remove duplicates. | A query selects only the records where ROW_NUMBER() OVER(PARTITION BY TRANSACTION_ID ORDER BY TIMESTAMP DESC) = 1 to keep only the most recent entry. | 
| Case Standardization | Problem: The database is case-sensitive. Nike, nike, and NIKE would be treated as three different brands, fragmenting your reports.Solution: Apply UPPER() or a Title Case function to all categorical text fields to ensure consistency. | Product BRAND and CATEGORY names are converted to Title Case (e.g., "Running Shoes") so that GROUP BY operations work correctly. | 
| Consistent Categories | Problem: Users may enter categorical data with typos or variations (Credit Card, CCard, Credit).Solution: Validate categorical fields against a known list of accepted values. Use CASE statements or a mapping table to standardize variations. | The PAYMENT_METHOD column is cleaned to ensure all values map to one of the approved categories: Credit Card, PayPal, Cash, or Gift Card. |