Olist Store: Exploratory Data Analysis using SQL

Maulana Yusuf
8 min readJul 26, 2023
Image Credits: Kaggle

Table of Content

  1. Introduction
  2. About Dataset
  3. Data Preparation
  4. Data Exploration

Introduction

Olist is the largest department store in Brazilian marketplaces. Olist connects small businesses from all over Brazil to channels without hassle and with a single contract. Those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners. See more on our website: www.olist.com

After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note about the purchase experience and write down some comments.

About Dataset

The dataset contains information on 100k orders made at multiple marketplaces in Brazil from 2016 to 2018. Its features allow for viewing orders from various dimensions, including order status, price, payment and freight performance, customer location, product attributes, and customer reviews. Additionally, we have released a geolocation dataset that maps Brazilian zip codes to latitude/longitude coordinates. You can download the dataset here.

The data is divided into multiple datasets for better understanding and organization. Refer to the following data schema when working with it:

Image Credits: Kaggle
  • olist_customer_dataset: This dataset has information about the customer and its location. Use it to identify unique customers in the orders dataset and to find the order's delivery location.
  • olist_geolocation_dataset: This dataset has information on Brazilian zip codes and their lat/long coordinates. Use it to plot maps and find distances between sellers and customers.
  • olist_order_items_dataset: This dataset includes data about the items purchased within each order.
  • olist_order_payments_dataset: This dataset includes data about the order payment options.
  • olist_orders_dataset: This is the core dataset. From each order, you might find all the other information.
  • olist_products_dataset: This dataset includes data about the products sold by Olist.
  • olist_sellers_dataset: This dataset includes data about the sellers that fulfilled orders made at Olist. Use it to find the seller's location and to identify which seller fulfilled each product.
  • product_category_name_translation: Translates the product_category_name to english.

Data Preparation

Once we’ve delved into the data, it’s time to leverage the power of PostgreSQL in PGAdmin 4 for our analysis. Let’s kick things off by crafting our very own database!

Create Database

We need to first create a database. Right-click on ‘Databases’ and ‘Add a Database’. Give it a name and hit SAVE. If you expand Databases, you should see your new database in the list.

Import Dataset

Let’s take our data game to the next level by importing it using some slick queries. Get ready to flex your database skills: simply right-click on the databases you’ve created earlier, and select the “Query Tool” option.

Congrats, your dataset is now imported. Time for the fun stuff!

Data Exploration

We’ll dive deep into answering critical business questions.

1. How much is the total revenue, orders, items sold, and customers over time?

As we observe the data above, we have a total revenue of $20.308.135, consisting of 98.665 orders, 117.601 items sold, and 95.419 unique customers.

Interactive Dashboard Preview

Additionally, I’ve developed an interactive dashboard that facilitates in-depth analysis for each month. It enables comparisons between the current month and the previous month, the current year and the previous year, as well as the same month in the previous year.

2. What is the growth rate of the total revenue, orders, items sold, and customers over the month?

For example, we are comparing March 2018 and February 2018. The growth rate of total revenue, number of orders, number of items sold, and number of customers in March 2018 compared to February 2018 indicates a positive trend, showcasing a potential business expansion.

  • The total revenue in March 2018 increased by 13% compared to February 2018. This growth suggests that the company experienced higher sales and generated more revenue during this period.
  • In March 2018, we observed a notable 7.4% surge in the number of orders as compared to February 2018. This upward trend signifies a discernible uptick in customer demand and signifies an expanding customer base.
  • The number of items sold in March 2018 witnessed a substantial growth of 7.9% compared to February, indicating increased product popularity and higher sales volume.
  • During March 2018, we witnessed a noteworthy 8.6% rise in the number of unique customers in comparison to February 2018. This finding suggests an expanding customer base and hints at the potential implementation of enhanced customer retention strategies.

3. What is the trend in total revenue, orders, items sold, and customers across different weekdays over time?

The data provides a comprehensive analysis of various metrics for each day of the week. It is evident that customers exhibit a preference for making purchases on weekdays rather than weekends (Saturday and Sunday). Notably, Saturday consistently records the lowest values across all metrics, with Sunday following as the second lowest. In contrast, Monday consistently demonstrates the highest values across all metrics, suggesting a higher level of activity or productivity on that day.

4. How do different product categories perform in terms of total revenue, orders, items sold, and customers?

The data provided above indicates that Bed Bath Table stands out as the leading product category across all metrics, presenting a significant opportunity for further growth. However, it is crucial to note that this high-level insight alone may not suffice. Therefore, to gain a more comprehensive understanding, we should analyze the performance of the top product category on a monthly basis.

Interactive Dashboard Preview

By utilizing the dashboard I have developed, we can conduct a monthly analysis of the product category’s performance. This interactive tool facilitates comparisons between the current month and the previous month, the current year and the previous year, as well as the corresponding month in the previous year.

5. What are the preferred payment types?

The data reveals that the most popular payment method among customers is Credit Card, generating the highest total revenue, total orders, total items sold, and total number of customers. This indicates that Credit Card is the preferred choice for a majority of customers, highlighting its convenience and widespread usage.

6. How do the delivery types affect the revenue, orders, items sold and customers?

The data highlights the significant revenue, order volume, items sold, and customer base associated with inter-state delivery. This indicates the successful expansion of the business across state boundaries, tapping into a larger market and attracting customers beyond the immediate locality.

Conversely, within-state delivery demonstrates lower values across all metrics when compared to inter-state delivery. This emphasizes the significance of implementing localized strategies to foster growth and capitalize on opportunities within the immediate area.

7. How does delivery timeliness impact revenue, orders, items sold, and customers?

The data clearly demonstrates that on-time delivery has a substantial positive impact on various performance metrics. While on-time delivery plays a crucial role in enhancing overall performance, it is important not to underestimate the negative impact of late delivery.

Companies should prioritize and streamline their delivery processes to meet customer expectations and avoid potential revenue losses associated with late deliveries. Identifying and addressing the root causes of late delivery can help improve overall performance and customer satisfaction.

8. How does state performance based on customer and seller impact total revenue, number of orders, number of items sold, and number of customers?

Sao Paulo (SP) state is a key driver of customer-based performance, significantly contributing to the overall metrics of total revenue, orders, items sold, and customer engagement. Additionally, Rio de Janeiro (RJ) and Minas Gerais (MG) also play substantial roles in the overall performance.

When it comes to seller-based performance, Sao Paulo (SP) state continues to have a significant impact on the overall results. Moreover, states like Parana (PR) and Minas Gerais (MG) have proven to be noteworthy contributors to the overall performance as well.

9. How does city performance based on customer and seller impact total revenue, number of orders, number of items sold, and number of customers?

Sao Paulo city is a key driver of customer-based performance, significantly contributing to the overall metrics of total revenue, orders, items sold, and customer engagement. Additionally, Rio de Janeiro and Belo Horizonte also play substantial roles in the overall performance.

When it comes to seller-based performance, Sao Paulo city continues to have a significant impact on the overall results. Moreover, cities like Ibitinga and Curitiba have proven to be noteworthy contributors to the overall performance as well.

10. What is the relationship between the number of reviews and the average review score? Are products with a higher number of reviews receiving more average review scores?

Based on the results, it appears that the highest number of reviews did not necessarily correspond to the highest average review score. To further investigate and provide a comprehensive answer to the question, a scatter plot was created in Tableau, allowing us to visually analyze the relationship between the number of reviews and the average review score.

Interactive Dashboard Preview

This interactive scatter plot, created using Tableau, shows the relationship between the number of reviews and the average review score for May 2018. The size of each bubble represents the corresponding revenue value, offering additional insights into the data. By utilizing Tableau’s interactive features, we can easily modify the metrics and date range for further analysis.

Upon examining the scatter plot, we observe a weak positive correlation between the number of reviews and the average review score. This implies that, in general, a higher number of reviews tends to be associated with higher average review scores.

--

--