For this challenge, we will use the Brazilian E-commerce Public Dataset by Olist from Kaggle. The dataset contains multiple tables, each with specific attributes related to e-commerce activities.
-
Customers Table
customer_id: Unique identifier for each customercustomer_unique_id: Unique identifier for each unique customercustomer_zip_code_prefix: Zip code prefix of the customercustomer_city: City of the customercustomer_state: State of the customer
-
Geolocation Table
geolocation_zip_code_prefix: Zip code prefix of the locationgeolocation_lat: Latitude coordinate of the locationgeolocation_lng: Longitude coordinate of the locationgeolocation_city: City of the locationgeolocation_state: State of the location
-
Order Items Table
order_id: Unique identifier for each orderorder_item_id: Identifier for each item in the orderproduct_id: Unique identifier for each productseller_id: Unique identifier for each sellershipping_limit_date: Last date for shipping the itemprice: Price of the itemfreight_value: Freight value of the item
-
Order Payments Table
order_id: Unique identifier for each orderpayment_sequential: Sequential number of the paymentpayment_type: Type of paymentpayment_installments: Number of installments for paymentpayment_value: Value of the payment
-
Order Reviews Table
review_id: Unique identifier for each revieworder_id: Unique identifier for each orderreview_score: Score given in the reviewreview_comment_title: Title of the review commentreview_comment_message: Message of the review commentreview_creation_date: Creation date of the reviewreview_answer_timestamp: Timestamp of the review answer
-
Orders Table
order_id: Unique identifier for each ordercustomer_id: Unique identifier for each customerorder_status: Status of the orderorder_purchase_timestamp: Timestamp of the order purchaseorder_approved_at: Timestamp when the order was approvedorder_delivered_carrier_date: Date when the order was delivered to the carrierorder_delivered_customer_date: Date when the order was delivered to the customerorder_estimated_delivery_date: Estimated delivery date of the order
-
Products Table
product_id: Unique identifier for each productproduct_category_name: Name of the product categoryproduct_name_lenght: Length of the product nameproduct_description_lenght: Length of the product descriptionproduct_photos_qty: Number of photos of the productproduct_weight_g: Weight of the product in gramsproduct_length_cm: Length of the product in centimetersproduct_height_cm: Height of the product in centimetersproduct_width_cm: Width of the product in centimeters
-
Sellers Table
seller_id: Unique identifier for each sellerseller_zip_code_prefix: Zip code prefix of the sellerseller_city: City of the sellerseller_state: State of the seller
-
Product Category Translation Table
product_category_name: Name of the product categoryproduct_category_name_english: Name of the product category in English
-
Data Exploration
- Load the dataset and explore the structure and relationships between tables.
- Provide a summary of each table and its attributes.
-
Customer Analysis
- Analyze the distribution of customers by city and state.
- Identify the top 10 cities with the highest number of customers.
-
Sales Analysis
- Determine the total sales value for each product category.
- Identify the top 10 best-selling products based on the number of items sold.
-
Order Analysis
- Analyze the order status distribution.
- Calculate the average delivery time for orders.
-
Review Analysis
- Analyze the distribution of review scores.
- Identify common themes in review comments using text analysis.
-
Payment Analysis
- Analyze the distribution of payment types.
- Calculate the average payment value for different payment types.
-
Geolocation Analysis
- Visualize the geolocation data to identify regions with high order density.
- Analyze the correlation between geolocation and delivery times.