Hands-on exercises apply principles of data transformation, including cleaning, structuring, enriching, and standardizing. Practice with a sample dataset solidifies understanding of common transformation techniques. Code writing is not the focus; instead, the emphasis is on applying transformation logic.Scenario: Preparing Customer DataImagine you've extracted the following customer data from a legacy system. The goal is to clean and standardize this data before loading it into a new marketing database.Here's our raw data:CustIDFullNameRegistrationDateEmailStateOrders101Alice Smith2022-05-15alice.s@mail.comCA5102Bob Johnson11/20/21bobny31032023-01-10charlie@mail.comTexas-2104Diana LeeMarch 3, 2022diana.lee@mailca10105Evan Green2022-08-01FL7106Fiona Adams07/10/23fiona.a@mail.comNevada0107George Miller2021-12-25george@mailCAnoneThis data has several issues typical of raw extracts: missing values, inconsistent date formats, inconsistent state representations, invalid email formats, and potentially incorrect order counts. Our task is to apply transformations to fix these problems.Activity 1: Data CleaningFirst, let's address the most obvious errors and missing information.Missing Values: Identify rows with missing FullName and Email. How would you handle these? A common approach is to replace them with a default value like 'Unknown' or NULL, depending on the target system's requirements. Let's decide to use 'Unknown' for FullName and leave Email as potentially empty if it's truly missing (but we'll validate formats later).Incorrect Values: Look at the Orders column. CustID 103 has -2 orders, which is likely an error. CustID 107 has 'none'. How should these be handled? We might decide that negative orders are invalid and should be set to 0 or flagged for investigation. Text values like 'none' should definitely be converted to a numerical representation, likely 0.Applying the Cleaning:CustID 103: Set FullName to 'Unknown'. Set Orders to 0.CustID 105: Set Email to an empty string or NULL (let's represent as empty for now).CustID 107: Set Orders to 0.Our data now looks like this:CustIDFullNameRegistrationDateEmailStateOrders101Alice Smith2022-05-15alice.s@mail.comCA5102Bob Johnson11/20/21bobny3103Unknown2023-01-10charlie@mail.comTexas0104Diana LeeMarch 3, 2022diana.lee@mailca10105Evan Green2022-08-01FL7106Fiona Adams07/10/23fiona.a@mail.comNevada0107George Miller2021-12-25george@mailCA0Activity 2: Data Validation and StandardizationNext, let's enforce consistency in formats and validate certain fields.Email Validation: Examine the Email column. CustID 102 ('bob') and CustID 104/107 ('diana.lee@mail', 'george@mail') seem incomplete or invalid. A simple validation rule might check for the presence of an '@' symbol and a period '.' after the '@'. Emails failing this check could be flagged or set to a default invalid marker. Let's assume we flag them for review or set them to empty/NULL if they fail basic validation.Date Standardization: The RegistrationDate column uses multiple formats ('YYYY-MM-DD', 'MM/DD/YY', 'Month Day, YYYY'). We need to standardize this, perhaps to the 'YYYY-MM-DD' format.'11/20/21' becomes '2021-11-20''March 3, 2022' becomes '2022-03-03''07/10/23' becomes '2023-07-10'State Standardization: The State column has 'CA', 'ny', 'Texas', 'ca', 'FL', 'Nevada'. We need to standardize this to a consistent format, like the 2-letter uppercase abbreviation. We'll need a mapping (lookup) for full names or lowercase versions.'ny' becomes 'NY''Texas' becomes 'TX''ca' becomes 'CA''Nevada' becomes 'NV'Applying Validation and Standardization:CustID 102: Email 'bob' fails validation. Set to empty. Date becomes '2021-11-20'. State 'ny' becomes 'NY'.CustID 104: Email 'diana.lee@mail' fails validation (no period after @). Set to empty. Date becomes '2022-03-03'. State 'ca' becomes 'CA'.CustID 103: State 'Texas' becomes 'TX'.CustID 106: Date '07/10/23' becomes '2023-07-10'. State 'Nevada' becomes 'NV'.CustID 107: Email 'george@mail' fails validation. Set to empty.The data after this stage:CustIDFullNameRegistrationDateEmailStateOrders101Alice Smith2022-05-15alice.s@mail.comCA5102Bob Johnson2021-11-20NY3103Unknown2023-01-10charlie@mail.comTX0104Diana Lee2022-03-03CA10105Evan Green2022-08-01FL7106Fiona Adams2023-07-10fiona.a@mail.comNV0107George Miller2021-12-25CA0Activity 3: Data Structuring and EnrichmentFinally, let's restructure the data slightly to better fit our target system's schema. Assume the target database requires separate FirstName and LastName fields instead of FullName.Splitting Fields: We need to split the FullName column into FirstName and LastName. We can split the string at the first space. For 'Alice Smith', FirstName is 'Alice' and LastName is 'Smith'. What about 'Unknown'? We can set both FirstName and LastName to 'Unknown'.Adding Fields (Enrichment): Let's add a simple derived field. Suppose we want to categorize customers based on their registration year. We can extract the year from the standardized RegistrationDate and create a new RegistrationYear column.Applying Structuring and Enrichment:Split FullName into FirstName and LastName.Create RegistrationYear from RegistrationDate.The final transformed data, ready for loading:CustIDFirstNameLastNameRegistrationDateRegistrationYearEmailStateOrders101AliceSmith2022-05-152022alice.s@mail.comCA5102BobJohnson2021-11-202021NY3103UnknownUnknown2023-01-102023charlie@mail.comTX0104DianaLee2022-03-032022CA10105EvanGreen2022-08-012022FL7106FionaAdams2023-07-102023fiona.a@mail.comNV0107GeorgeMiller2021-12-252021CA0SummaryThrough these activities, we simulated several fundamental transformation steps:Cleaning: Handled missing values and corrected obviously incorrect data.Validation: Checked email formats against a simple rule.Standardization: Converted dates and state names to consistent formats.Structuring: Split a single field into multiple fields.Enrichment: Derived a new field (RegistrationYear) from existing data."This transformed dataset is now much more consistent, reliable, and structured appropriately for loading into the target marketing database. While this example used simple rules and a small dataset, the principles apply to larger, more complex transformation tasks encountered in ETL processes. The key is to understand the target requirements and apply the necessary cleaning, validation, standardization, structuring, and enrichment steps systematically."