Visnusah/Data science

Last active January 14, 2026 15:22

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/Visnusah/53b565e3f7fb4ea7a55b575a604f8fe5.js"></script>
Save Visnusah/53b565e3f7fb4ea7a55b575a604f8fe5 to your computer and use it in GitHub Desktop.

Download ZIP

Data Science Final Exam Mastery Guide

Raw

Data science

🚀 Data Science Final Exam Mastery Guide

Author

Visnusah commented Jan 14, 2026

🚀 Data Science Final Exam Mastery Guide

Target Exam: BSc Computing (Year 2) - Data Science
Focus Areas: Statistics, Big Data (HDFS/MapReduce), Visualization, and Methodologies.

📊 Part 1: Analytics & Visualization

Tip

Visualization Rule of Thumb: Always choose the chart that makes comparison easiest for the human eye.

1. Choosing the Right Chart

[cite_start]Comparing Values: When comparing exact values across categories (e.g., Museum Attendance), use a Horizontal Bar Chart[cite: 2, 7]. [cite_start]Pie charts are often poor choices because it is difficult to compare angles accurately[cite: 5, 44].
[cite_start]Analyzing Trends: To see how data changes over time (e.g., monthly defect rates), use a Line Chart[cite: 64, 65].
[cite_start]Distribution & Outliers: To understand the spread of data (e.g., purchase amounts) and spot anomalies, use a Box Plot[cite: 15, 68].

2. Analytics Maturity Model

[cite_start]Descriptive: Summarizing historical data to see "what happened"[cite: 25].
[cite_start]Diagnostic: Analyzing data to understand "why it happened"[cite: 26].
[cite_start]Predictive: Using historical data to forecast future trends (e.g., predicting hospital patient volume)[cite: 22, 23].
[cite_start]Prescriptive: Recommending specific actions based on the predictions[cite: 27].

3. The "Correlation Trap"

Warning

[cite_start]Correlation $\neq$ Causation > If ice cream sales and drowning incidents both rise, it does not mean ice cream causes drowning[cite: 104, 109].
[cite_start]Correct Answer: A confounding variable (like hot weather) causes both to increase[cite: 110].

🏗️ Part 2: Big Data & Infrastructure

1. HDFS (Hadoop Distributed File System)

Replication Strategy: Data blocks (usually 128MB) are replicated across multiple nodes (typically 3x).
- [cite_start]Why? To provide Fault Tolerance[cite: 2, 16]. [cite_start]If one node crashes, the data can be retrieved from another[cite: 91].
[cite_start]Geographic Redundancy: Replicating blocks across different data centers protects against catastrophic site failures (like fire or power loss)[cite: 88, 90].

2. Database Architecture

[cite_start]Scaling for Growth: When a SQL database bottlenecks under millions of transactions, the best solution is a Distributed NoSQL architecture with sharding[cite: 77, 78].
- [cite_start]Sharding: Splits data across multiple servers to handle high velocity and volume[cite: 81].
[cite_start]Graph Databases: Best for data involving complex relationships, such as finding the shortest path between airports or social network connections[cite: 96, 98].

3. MapReduce

[cite_start]The Mapper: The function that reads input files and produces intermediate key-value pairs (e.g., genre, count)[cite: 17].

🧪 Part 3: Methodologies & Lifecycle

1. OSEMN Framework

[cite_start]Explore: The stage where you create visualizations and perform EDA (Exploratory Data Analysis) to understand patterns before modeling[cite: 30, 31].

2. CRISP-DM

[cite_start]Business Understanding: The initial phase where you define goals (e.g., "reduce downtime by 30%") and identify pain points[cite: 70, 71].

3. Hypothesis Testing

Important

[cite_start]The p-value Rule > * Scenario: p-value = 0.08, Significance Level ($\alpha$) = 0.05[cite: 53].
[cite_start]* Conclusion: Since $0.08 > 0.05$, you fail to reject the null hypothesis[cite: 55]. There is insufficient evidence to prove the effect.

🧮 Part 4: Mathematical Solvers (Corrected)

The calculations in the source text contained errors. Below are the corrected steps.

1. Linear Regression

Scenario: Predict Salary Increase ($Y$) based on Training Hours ($X$).
Dataset:

$X$: [5, 8, 10, 12, 15]
$Y$: [12, 18, 24, 28, 35]

Step 1: Means

$\bar{X} = 10$
$\bar{Y} = 23.4$

Step 2: Slope ($\beta_1$)

Formula: $\frac{\sum(X - \bar{X})(Y - \bar{Y})}{\sum(X - \bar{X})^2}$
Numerator: $135$
Denominator: $58$
$\beta_1 \approx 2.33$

Step 3: Intercept ($\beta_0$)

Formula: $\bar{Y} - \beta_1\bar{X}$
$23.4 - (2.33 \times 10) = 0.1$

Step 4: Prediction (for 11 hours)

[cite_start]$Y = 0.1 + 2.33(11) \approx \mathbf{25.73}$ ($25,730)[cite: 140, 141].

2. Probability (Bayes/Total Prob)

Scenario:

Type A: 60% production, 5% defect rate.
[cite_start]Type B: 40% production, 10% defect rate[cite: 146, 147].

Question A: Probability of ANY defect?
$$P(Defect) = (0.60 \times 0.05) + (0.40 \times 0.10)$$
$$P(Defect) = 0.03 + 0.04 = \mathbf{0.07} \text{ (or 7%)}$$

Question B: If defective, prob it is Type A?
$$P(A | Defect) = \frac{P(A \cap Defect)}{P(Defect)}$$
$$P(A | Defect) = \frac{0.03}{0.07} \approx \mathbf{0.43} \text{ (or 43%)}$$

💡 Part 5: Rapid Review Mnemonics

The "ACID" Test (Transactions)

Note

Atomicity: "All or Nothing." [cite_start]If a payment system crashes mid-transaction, Atomicity ensures the transaction is completely reversed[cite: 119, 120].

The "CAP" Theorem

Note

[cite_start]Availability + Partition Tolerance (AP): If a system must stay operational (accepting orders) during a network crash, it sacrifices Consistency for Availability[cite: 132, 134].

Anscombe's Quartet

Note

Lesson: "Never trust summary statistics alone." [cite_start]Different datasets can have the exact same mean and correlation but look completely different on a scatter plot[cite: 37, 38].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment