Data Science with Python Cheatsheet
Data Science Cheatsheet (Beginner Friendly)
This document is written for absolute beginners. Every concept is explained in simple language, with context of why it is needed in real projects.
1. What is Data Science (Very Simple Explanation)
Data Science means using data to solve real business problems.
Companies collect huge amounts of data every day: customer details, purchases, clicks, reviews, sensor readings, etc. Raw data alone is useless. Data Science converts this raw data into useful information.
A Data Scientist:
Collects data
Cleans messy data
Finds patterns
Builds models
Makes predictions
Helps managers take decisions
Example business questions:
Which customers may leave?
Which product sells most?
What will be next month’s sales?
Python example:
print("Data Science = Data + Python + Statistics + Machine Learning")
2. Complete Real‑World Data Science Workflow
Every professional project follows these steps:
Understand the business problem
Collect data (CSV, database, API)
Clean data (missing values, duplicates)
Explore data (EDA)
Create features
Train model
Evaluate model
Deploy model
Beginners usually jump directly to ML. This is wrong. 70% of real work is cleaning + understanding data.
3. Python Basics for Data Science
Python is easy to read and widely used in Data Science.
Variables
age = 25
salary = 50000
Lists
numbers = [1, 2, 3, 4]
Loop
for n in numbers:
print(n)
Function
def add(a, b):
return a + b
print(add(3, 4))
Functions help reuse code.
4. NumPy (Numerical Calculations)
NumPy is used for fast mathematical operations.
Why NumPy:
Faster than Python lists
Used inside ML algorithms
import numpy as np
arr = np.array([10, 20, 30])
print(arr.mean())
5. Pandas (Working With Tables)
Pandas works like Excel but with programming.
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())
Common operations:
df["age"]
df[df["age"] > 30]
6. Handling Missing Values
Missing values confuse ML models.
Check missing:
df.isnull().sum()
Fill missing:
df["age"].fillna(df["age"].mean(), inplace=True)
Always understand why data is missing before filling.
7. Descriptive Statistics
Used to summarize data.
Mean = average
Median = middle
Mode = most frequent
print(df.describe())
Median is better for salary data because of outliers.
8. Exploratory Data Analysis (EDA)
EDA helps you understand patterns before ML.
You check:
Distribution
Relationships
Outliers
df.corr()
Never skip EDA.
9. Feature Engineering
Creating new useful columns from existing data.
df["income_per_age"] = df["income"] / df["age"]
Better features = better model.
10. Train Test Split
We never train on full data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
80% training
20% testing
11. Supervised Learning
Data has answers (labels).
Examples:
House price
Spam detection
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
12. Classification
Predict categories.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
13. Prediction
pred = model.predict(X_test)
14. Model Evaluation
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)
Accuracy alone is not enough for imbalanced data.
15. Confusion Matrix
Shows correct vs wrong predictions.
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)
16. Overfitting vs Underfitting
Overfitting: model memorizes.
Underfitting: model too simple.
Solutions:
More data
Regularization
Better features
17. Cross Validation
More reliable evaluation.
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)
18. Feature Scaling
Important for distance-based models.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
19. Unsupervised Learning
No labels.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
20. Pipelines (Professional Practice)
Prevents data leakage.
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression())
])
Final Beginner Advice
Learn in this order:
Python → Pandas → EDA → Statistics → Machine Learning → Projects
Build projects. Explain your thinking. Practice daily.