Skip to content

Instantly share code, notes, and snippets.

@Visnusah
Last active January 14, 2026 15:22
Show Gist options
  • Select an option

  • Save Visnusah/53b565e3f7fb4ea7a55b575a604f8fe5 to your computer and use it in GitHub Desktop.

Select an option

Save Visnusah/53b565e3f7fb4ea7a55b575a604f8fe5 to your computer and use it in GitHub Desktop.
Data Science Final Exam Mastery Guide
๐Ÿš€ Data Science Final Exam Mastery Guide
@Visnusah
Copy link
Author

๐Ÿš€ Data Science Final Exam Mastery Guide

Target Exam: BSc Computing (Year 2) - Data Science
Focus Areas: Statistics, Big Data (HDFS/MapReduce), Visualization, and Methodologies.


๐Ÿ“Š Part 1: Analytics & Visualization

Tip

Visualization Rule of Thumb: Always choose the chart that makes comparison easiest for the human eye.

1. Choosing the Right Chart

  • [cite_start]Comparing Values: When comparing exact values across categories (e.g., Museum Attendance), use a Horizontal Bar Chart[cite: 2, 7]. [cite_start]Pie charts are often poor choices because it is difficult to compare angles accurately[cite: 5, 44].
  • [cite_start]Analyzing Trends: To see how data changes over time (e.g., monthly defect rates), use a Line Chart[cite: 64, 65].
  • [cite_start]Distribution & Outliers: To understand the spread of data (e.g., purchase amounts) and spot anomalies, use a Box Plot[cite: 15, 68].

2. Analytics Maturity Model

  • [cite_start]Descriptive: Summarizing historical data to see "what happened"[cite: 25].
  • [cite_start]Diagnostic: Analyzing data to understand "why it happened"[cite: 26].
  • [cite_start]Predictive: Using historical data to forecast future trends (e.g., predicting hospital patient volume)[cite: 22, 23].
  • [cite_start]Prescriptive: Recommending specific actions based on the predictions[cite: 27].

3. The "Correlation Trap"

Warning

[cite_start]Correlation $\neq$ Causation > If ice cream sales and drowning incidents both rise, it does not mean ice cream causes drowning[cite: 104, 109].
[cite_start]Correct Answer: A confounding variable (like hot weather) causes both to increase[cite: 110].


๐Ÿ—๏ธ Part 2: Big Data & Infrastructure

1. HDFS (Hadoop Distributed File System)

  • Replication Strategy: Data blocks (usually 128MB) are replicated across multiple nodes (typically 3x).
    • [cite_start]Why? To provide Fault Tolerance[cite: 2, 16]. [cite_start]If one node crashes, the data can be retrieved from another[cite: 91].
  • [cite_start]Geographic Redundancy: Replicating blocks across different data centers protects against catastrophic site failures (like fire or power loss)[cite: 88, 90].

2. Database Architecture

  • [cite_start]Scaling for Growth: When a SQL database bottlenecks under millions of transactions, the best solution is a Distributed NoSQL architecture with sharding[cite: 77, 78].
    • [cite_start]Sharding: Splits data across multiple servers to handle high velocity and volume[cite: 81].
  • [cite_start]Graph Databases: Best for data involving complex relationships, such as finding the shortest path between airports or social network connections[cite: 96, 98].

3. MapReduce

  • [cite_start]The Mapper: The function that reads input files and produces intermediate key-value pairs (e.g., genre, count)[cite: 17].

๐Ÿงช Part 3: Methodologies & Lifecycle

1. OSEMN Framework

  • [cite_start]Explore: The stage where you create visualizations and perform EDA (Exploratory Data Analysis) to understand patterns before modeling[cite: 30, 31].

2. CRISP-DM

  • [cite_start]Business Understanding: The initial phase where you define goals (e.g., "reduce downtime by 30%") and identify pain points[cite: 70, 71].

3. Hypothesis Testing

Important

[cite_start]The p-value Rule > * Scenario: p-value = 0.08, Significance Level ($\alpha$) = 0.05[cite: 53].
[cite_start]* Conclusion: Since $0.08 > 0.05$, you fail to reject the null hypothesis[cite: 55]. There is insufficient evidence to prove the effect.


๐Ÿงฎ Part 4: Mathematical Solvers (Corrected)

The calculations in the source text contained errors. Below are the corrected steps.

1. Linear Regression

Scenario: Predict Salary Increase ($Y$) based on Training Hours ($X$).
Dataset:

  • $X$: [5, 8, 10, 12, 15]
  • $Y$: [12, 18, 24, 28, 35]

Step 1: Means

  • $\bar{X} = 10$
  • $\bar{Y} = 23.4$

Step 2: Slope ($\beta_1$)

  • Formula: $\frac{\sum(X - \bar{X})(Y - \bar{Y})}{\sum(X - \bar{X})^2}$
  • Numerator: $135$
  • Denominator: $58$
  • $\beta_1 \approx 2.33$

Step 3: Intercept ($\beta_0$)

  • Formula: $\bar{Y} - \beta_1\bar{X}$
  • $23.4 - (2.33 \times 10) = 0.1$

Step 4: Prediction (for 11 hours)

  • [cite_start]$Y = 0.1 + 2.33(11) \approx \mathbf{25.73}$ ($25,730)[cite: 140, 141].

2. Probability (Bayes/Total Prob)

Scenario:

  • Type A: 60% production, 5% defect rate.
  • [cite_start]Type B: 40% production, 10% defect rate[cite: 146, 147].

Question A: Probability of ANY defect?
$$P(Defect) = (0.60 \times 0.05) + (0.40 \times 0.10)$$
$$P(Defect) = 0.03 + 0.04 = \mathbf{0.07} \text{ (or 7%)}$$

Question B: If defective, prob it is Type A?
$$P(A | Defect) = \frac{P(A \cap Defect)}{P(Defect)}$$
$$P(A | Defect) = \frac{0.03}{0.07} \approx \mathbf{0.43} \text{ (or 43%)}$$


๐Ÿ’ก Part 5: Rapid Review Mnemonics

The "ACID" Test (Transactions)

Note

Atomicity: "All or Nothing." [cite_start]If a payment system crashes mid-transaction, Atomicity ensures the transaction is completely reversed[cite: 119, 120].

The "CAP" Theorem

Note

[cite_start]Availability + Partition Tolerance (AP): If a system must stay operational (accepting orders) during a network crash, it sacrifices Consistency for Availability[cite: 132, 134].

Anscombe's Quartet

Note

Lesson: "Never trust summary statistics alone." [cite_start]Different datasets can have the exact same mean and correlation but look completely different on a scatter plot[cite: 37, 38].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment