System DesignMedium
Let's explore a scenario involving data transformation. Imagine you're receiving a stream of customer data from various sources. This data includes customer IDs, names, email addresses, and purchase histories. However, the data is inconsistent: some sources use different formats for dates, some have missing fields, and others use abbreviations for states. Your task is to design a robust and efficient system to clean and transform this data into a consistent format suitable for analysis. Specifically: Data Cleaning: How would you handle missing values, inconsistent date formats (e.g., MM/DD/YYYY vs. YYYY-MM-DD), and variations in state abbreviations (e.g., CA vs. California)? Provide code examples (Python is preferred) demonstrating how you would address these issues. Data Transformation: How would you transform the data to ensure consistency? For example, you might need to convert all dates to a standard format, expand state abbreviations to their full names, and ensure all customer IDs are in a uniform format. Scalability: How would you design the system to handle a large volume of data (e.g., millions of records per day)? Consider the technologies and architectures you would use to ensure scalability and performance. Think about potential bottlenecks and how to address them. Error Handling: Describe how you would implement error handling and logging to identify and address data quality issues. What metrics would you track to monitor the quality of the transformed data? For instance, suppose you receive the following data snippets: Source 1: {customer_id: 123, name: Alice, email: alice@example.com, purchase_date: 01/01/2023, state: CA} Source 2: {CustomerID: 456, Name: Bob, Email: bob@example.com, PurchaseDate: 2023-01-01, State: California} How would your system handle these variations and transform them into a unified format like this: {customer_id: 123, name: Alice, email: alice@example.com, purchase_date: 2023-01-01, state: California}