Complete Data Analysis Course with Detailed Answers
डेटा एनालिसिस की परिभाषा
डेटा एनालिसिस एक systematic process है जिसमें हम raw data को collect करते हैं, उसे clean करते हैं, analyze करते हैं और meaningful insights निकालते हैं। यह process businesses और organizations को better decisions लेने में help करती है।
डेटा एनालिसिस के मुख्य चरण:
Data Collection: विभिन्न sources से data gather करना
Data Cleaning: Errors, duplicates, और missing values को handle करना
Data Processing: Data को analysis के लिए suitable format में convert करना
Data Analysis: Statistical methods और algorithms का use करके patterns find करना
Data Interpretation: Results को business context में समझना
Reporting: Findings को stakeholders के साथ share करना
महत्व और Applications
Business में Applications:
Sales Analysis: कौन से products ज्यादा sell हो रहे हैं
Customer Segmentation: Different customer groups की identification
Market Research: Market trends और consumer behavior की समझ
Financial Analysis: Revenue, profit, और cost optimization
Risk Assessment: Potential risks की identification और mitigation
विभिन्न Industries में उपयोग:
Healthcare: Patient data analysis, drug effectiveness studies
Finance: Credit scoring, fraud detection, investment strategies
E-commerce: Recommendation systems, price optimization
Manufacturing: Quality control, supply chain optimization
Education: Student performance analysis, curriculum improvement
Government: Policy making, resource allocation
Career Opportunities
Job Roles:
Data Analyst: Entry-level position, basic analysis और reporting
Average Salary: ₹3-6 लाख per annum
Skills Required: Excel, SQL, Basic statistics
Business Analyst: Business problems को data से solve करना
Average Salary: ₹5-10 लाख per annum
Skills Required: Domain knowledge, communication skills
Data Scientist: Advanced analytics और machine learning
Average Salary: ₹8-20 लाख per annum
Skills Required: Python/R, ML algorithms, statistics
Data Engineer: Data infrastructure और pipelines
Average Salary: ₹6-15 लाख per annum
Skills Required: Programming, databases, cloud platforms
Data vs Information vs Knowledge
Data: Raw facts और figures
Example: Sales figure "₹50,000"
यह isolated fact है जिसका कोई context नहीं है
Information: Processed data with context
Example: "Last month sales were ₹50,000, which is 20% increase from previous month"
यह data को meaningful context में present करता है
Knowledge: Information + Experience + Insight
Example: "Sales increase is due to festive season demand, we should increase inventory for next month"
यह actionable insights provide करता है
Quantitative Data (संख्यात्मक डेटा)
Continuous Data (लगातार डेटा):
यह data किसी भी value ले सकता है एक range में
Decimal points हो सकते हैं
Examples: Height (170.5 cm), Weight (65.3 kg), Temperature (25.7°C)
Measurement: Precise measurements possible
Analysis: Mean, median, standard deviation calculate कर सकते हैं
Discrete Data (असतत डेटा):
यह data केवल specific values ले सकता है
Usually whole numbers
Examples: Number of students (25), Number of cars (5), Number of orders (150)
Characteristics: Countable और finite values
Analysis: Frequency distributions, percentages
Qualitative Data (गुणात्मक डेटा)
Nominal Data:
Categories हैं जिनका कोई order नहीं है
Examples: Gender (Male, Female), Colors (Red, Blue, Green), Marital Status (Married, Single)
Analysis: Frequency counts, percentages, mode
Visualization: Pie charts, bar charts
Ordinal Data:
Categories हैं जिनका एक specific order है
Examples: Education Level (High School, Bachelor's, Master's), Satisfaction Rating (Poor, Average, Good, Excellent)
Analysis: Median, percentiles, correlation
Visualization: Ordered bar charts, stacked charts
Structured vs Unstructured Data
Structured Data:
Well-organized format में होता है
Easily searchable और analyzable
Examples: Database tables, Excel spreadsheets, CSV files
Characteristics: Rows और columns में organized
Analysis: SQL queries, statistical analysis
Unstructured Data:
No predefined format
Text, images, videos, audio files
Examples: Social media posts, emails, documents, images
Challenges: Requires special processing techniques
Analysis: Natural Language Processing, image recognition
Primary vs Secondary Data
Primary Data:
Directly collected for specific research purpose
Collection Methods: Surveys, interviews, observations, experiments
Advantages: Highly relevant, current, reliable
Disadvantages: Time-consuming, expensive
Examples: Customer survey data, sales transactions, website analytics
Secondary Data:
Already collected by someone else for different purpose
Sources: Government reports, industry studies, academic research
Advantages: Quick access, cost-effective, large datasets
Disadvantages: May not be exactly relevant, could be outdated
Examples: Census data, market research reports, financial statements
Data Collection (डेटा संग्रह)
Data Collection Methods:
1. Surveys और Questionnaires:
Online surveys (Google Forms, SurveyMonkey)
Phone interviews
Face-to-face interviews
Best Practices: Clear questions, unbiased language, proper sampling
2. Observational Studies:
Direct observation
Hidden observation
Participant observation
Applications: Customer behavior, usability testing
3. Experiments:
Controlled experiments
A/B testing
Randomized controlled trials
Applications: Product testing, website optimization
4. Secondary Data Sources:
Government databases
Industry reports
Academic research
Examples: Census data, economic indicators
Data Cleaning (डेटा सफाई)
Common Data Quality Issues:
1. Missing Values:
Identification: Check for blank cells, null values, "N/A" entries
Handling Strategies:
Delete rows with missing values (if small percentage)
Fill with mean/median (for numerical data)
Fill with mode (for categorical data)
Use interpolation for time series data
2. Duplicate Records:
Identification: Look for identical rows या similar entries
Handling: Remove duplicates, keep most recent या most complete record
3. Inconsistent Data:
Examples: "Male" vs "M" vs "male"
Solutions: Standardize formats, create data dictionaries
4. Outliers:
Identification: Values that are significantly different from others
Detection Methods: Box plots, Z-scores, IQR method
Handling: Remove, transform, या separately analyze
Data Cleaning Process:
Data Profiling: Understand data structure और quality
Data Validation: Check for completeness और accuracy
Data Standardization: Consistent formats और units
Data Transformation: Convert to suitable formats
Data Exploration (डेटा अन्वेषण)
Exploratory Data Analysis (EDA) Steps:
1. Univariate Analysis:
Analyze each variable individually
For Numerical Variables: Histograms, box plots, summary statistics
For Categorical Variables: Frequency tables, bar charts
2. Bivariate Analysis:
Relationship between two variables
Numerical vs Numerical: Scatter plots, correlation analysis
Categorical vs Numerical: Box plots, group comparisons
Categorical vs Categorical: Cross-tabulation, chi-square tests
3. Multivariate Analysis:
Analyze multiple variables together
Techniques: Correlation matrices, heatmaps, parallel coordinates
Key Questions to Answer:
What is the distribution of each variable?
Are there any patterns या trends?
What are the relationships between variables?
Are there any anomalies या outliers?
Data Analysis Techniques
Descriptive Analysis:
Summarize और describe data characteristics
Measures: Mean, median, mode, standard deviation
Purpose: Understanding current state
Diagnostic Analysis:
Understand why something happened
Techniques: Drill-down analysis, root cause analysis
Purpose: Identify causes of problems या trends
Predictive Analysis:
Forecast future outcomes
Techniques: Regression analysis, time series forecasting
Purpose: Planning और decision making
Prescriptive Analysis:
Recommend actions to take
Techniques: Optimization algorithms, simulation
Purpose: Actionable insights और recommendations
Data Visualization
Chart Selection Guidelines:
For Comparisons:
Bar charts: Compare categories
Column charts: Compare over time
Horizontal bar charts: Long category names
For Relationships:
Scatter plots: Correlation between two variables
Line charts: Trends over time
Bubble charts: Three-dimensional relationships
For Compositions:
Pie charts: Parts of a whole (limited categories)
Stacked bar charts: Components over time
Area charts: Cumulative values
For Distributions:
Histograms: Frequency distribution
Box plots: Outliers और quartiles
Violin plots: Distribution shape
Visualization Best Practices:
Choose appropriate chart type
Use clear और descriptive titles
Label axes properly
Use consistent colors
Avoid 3D effects unnecessarily
Consider color-blind accessibility
Interpretation और Reporting
Key Elements of Good Analysis:
1. Context Setting:
Business background
Objectives और goals
Data sources और limitations
2. Methodology:
Analysis approach
Tools और techniques used
Assumptions made
3. Findings:
Key insights discovered
Statistical significance
Practical significance
4. Recommendations:
Actionable steps
Expected outcomes
Implementation timeline
5. Limitations:
Data quality issues
Methodological constraints
Scope limitations
Reporting Formats:
Executive summary for leadership
Detailed technical report for analysts
Dashboard for ongoing monitoring
Presentation for stakeholders
Measures of Central Tendency
Mean (औसत):
Definition: सभी values का sum divided by number of values
Formula: Mean = (Sum of all values) / (Number of values) Mean = (x₁ + x₂ + x₃ + ... + xₙ) / n
Example: Marks of 5 students: 85, 90, 78, 92, 85 Mean = (85 + 90 + 78 + 92 + 85) / 5 = 430 / 5 = 86
Types of Mean:
Arithmetic Mean: Regular average
Geometric Mean: Used for growth rates
Formula: ⁿ√(x₁ × x₂ × ... × xₙ)
Harmonic Mean: Used for rates
Formula: n / (1/x₁ + 1/x₂ + ... + 1/xₙ)
When to Use Mean:
Data is normally distributed
No extreme outliers present
Need to use all data points
Advantages:
Uses all data points
Algebraically defined
Unique value
Disadvantages:
Affected by outliers
May not represent actual data point
Median (मध्यक):
Definition: Middle value when data is arranged in ascending order
Calculation Steps:
Arrange data in ascending order
If n is odd: Median = middle value
If n is even: Median = average of two middle values
Example 1 (Odd number of values): Data: 12, 15, 18, 20, 25 Median = 18 (middle value)
Example 2 (Even number of values): Data: 10, 15, 20, 25, 30, 35 Median = (20 + 25) / 2 = 22.5
When to Use Median:
Data has outliers
Skewed distribution
Ordinal data
Want typical value
Advantages:
Not affected by outliers
Easy to understand
Works with ordinal data
Disadvantages:
Ignores extreme values
Less stable than mean
Mode (बहुलक):
Definition: Most frequently occurring value in dataset
Examples:
Data: 2, 3, 3, 4, 5, 5, 5, 6
Mode = 5 (appears 3 times)
Types:
Unimodal: One mode
Bimodal: Two modes
Multimodal: More than two modes
No Mode: All values appear once
When to Use Mode:
Categorical data
Want most common value
Discrete data
Quality control
Advantages:
Easy to identify
Not affected by outliers
Useful for categorical data
Disadvantages:
May not exist
May not be unique
Ignores other values
Comparison और Selection:
Normal Distribution: Mean = Median = Mode
Right Skewed: Mean > Median > Mode
Left Skewed: Mean < Median < Mode
Measures of Variability
Range (परिसर):
Definition: Difference between maximum और minimum values
Formula: Range = Maximum value - Minimum value
Example: Data: 10, 15, 20, 25, 30 Range = 30 - 10 = 20
Advantages:
Easy to calculate
Quick measure of spread
Intuitive understanding
Disadvantages:
Only uses two values
Affected by outliers
Doesn't show distribution shape
Variance (प्रसरण):
Definition: Average of squared differences from mean
Population Variance Formula: σ² = Σ(xi - μ)² / N
Sample Variance Formula: s² = Σ(xi - x̄)² / (n-1)
Calculation Steps:
Calculate mean
Find difference of each value from mean
Square each difference
Sum all squared differences
Divide by n (population) या n-1 (sample)
Example: Data: 2, 4, 6, 8, 10 Mean = 6
Value
Difference
Squared Difference
2
-4
16
4
-2
4
6
0
0
8
2
4
10
4
16
Variance = (16 + 4 + 0 + 4 + 16) / 5 = 40 / 5 = 8
Standard Deviation (मानक विचलन):
Definition: Square root of variance
Formula:
Population: σ = √σ²
Sample: s = √s²
From above example: Standard Deviation = √8 = 2.83
Interpretation:
About 68% values within 1 SD of mean
About 95% values within 2 SD of mean
About 99.7% values within 3 SD of mean
When to Use Standard Deviation:
Normal distribution
Comparing variability
Quality control
Risk assessment
Coefficient of Variation:
Definition: Relative measure of variability
Formula: CV = (Standard Deviation / Mean) × 100%
Example: If Mean = 50, SD = 10 CV = (10/50) × 100% = 20%
Uses:
Compare variability between different datasets
Different units या scales
Relative risk assessment
Interpretation:
CV < 15%: Low variability
CV 15-35%: Moderate variability
CV > 35%: High variability
Probability की परिभाषा
Definition: Probability is measure of likelihood of an event occurring
Mathematical Definition: P(Event) = Number of favorable outcomes / Total number of possible outcomes
Range: 0 ≤ P(Event) ≤ 1
P = 0: Impossible event
P = 1: Certain event
P = 0.5: Equally likely
Example: Rolling a die, probability of getting 3: P(3) = 1/6 = 0.167 या 16.7%
Types of Events
Independent Events:
Outcome of one doesn't affect the other
Example: Coin flips, die rolls
P(A and B) = P(A) × P(B)
Dependent Events:
Outcome of one affects the other
Example: Drawing cards without replacement
P(A and B) = P(A) × P(B|A)
Mutually Exclusive Events:
Cannot occur simultaneously
Example: Getting heads या tails
P(A and B) = 0
P(A or B) = P(A) + P(B)
Complementary Events:
One या the other must occur
Example: Pass या fail
P(A) + P(A') = 1
P(A') = 1 - P(A)
Probability Rules
Addition Rule:
For Mutually Exclusive Events: P(A or B) = P(A) + P(B)
For Non-Mutually Exclusive Events: P(A or B) = P(A) + P(B) - P(A and B)
Example: Drawing a card: P(King or Heart) P(King) = 4/52, P(Heart) = 13/52, P(King of Hearts) = 1/52 P(King or Heart) = 4/52 + 13/52 - 1/52 = 16/52
Multiplication Rule:
For Independent Events: P(A and B) = P(A) × P(B)
For Dependent Events: P(A and B) = P(A) × P(B|A)
Example: Two coin flips: P(Heads and Heads) P(H and H) = P(H) × P(H) = 0.5 × 0.5 = 0.25
Conditional Probability
Definition: Probability of event A given that event B has occurred
Formula: P(A|B) = P(A and B) / P(B)
Example: In a class: 60% students pass math, 40% pass both math and science P(Science|Math) = P(Math and Science) / P(Math) = 0.4 / 0.6 = 0.67
Tree Diagrams:
Visual representation of conditional probabilities
Branches represent different outcomes
Probabilities multiply along branches
Bayes' Theorem
Formula: P(A|B) = P(B|A) × P(A) / P(B)
Components:
P(A|B): Posterior probability
P(B|A): Likelihood
P(A): Prior probability
P(B): Marginal probability
Medical Test Example:
Disease prevalence: 1% (prior)
Test accuracy: 95% (sensitivity)
False positive rate: 5%
Question: If test is positive, what's probability of having disease?
Solution: P(Disease|Positive) = P(Positive|Disease) × P(Disease) / P(Positive)
P(Positive) = P(Positive|Disease) × P(Disease) + P(Positive|No Disease) × P(No Disease) P(Positive) = 0.95 × 0.01 + 0.05 × 0.99 = 0.0095 + 0.0495 = 0.059
P(Disease|Positive) = 0.95 × 0.01 / 0.059 = 0.161 या 16.1%
Applications:
Medical diagnosis
Spam filtering
Machine learning
Business decision making
Normal Distribution
Characteristics:
Bell-shaped curve
Symmetrical around mean
Mean = Median = Mode
68-95-99.7 rule
Parameters:
μ (mu): Population mean
σ (sigma): Population standard deviation
Standard Normal Distribution:
Mean = 0, Standard Deviation = 1
Any normal distribution can be standardized
Z-score = (X - μ) / σ
Z-Score Interpretation:
Z = 0: Value equals mean
Z = 1: Value is 1 SD above mean
Z = -1: Value is 1 SD below mean
Example: Height follows normal distribution: μ = 170 cm, σ = 10 cm Person's height = 185 cm Z-score = (185 - 170) / 10 = 1.5
Applications:
Quality control
Test scores
Measurement errors
Natural phenomena
Binomial Distribution
Definition: Probability distribution for number of successes in fixed number of trials
Conditions:
Fixed number of trials (n)
Each trial has two outcomes (success/failure)
Probability of success (p) remains constant
Trials are independent
Parameters:
n: Number of trials
p: Probability of success
q = 1-p: Probability of failure
Formula: P(X = k) = C(n,k) × p^k × q^(n-k)
where C(n,k) = n! / (k!(n-k)!)
Mean: μ = n × p Variance: σ² = n × p × q
Example: Coin flip 10 times, probability of getting exactly 6 heads: n = 10, p = 0.5, k = 6 P(X = 6) = C(10,6) × (0.5)^6 × (0.5)^4 = 210 × 0.0156 × 0.0625 = 0.205
Applications:
Quality control
Survey sampling
Medical trials
Marketing campaigns
Poisson Distribution
Definition: Probability of number of events in fixed time interval
Conditions:
Events occur randomly
Average rate is constant
Events are independent
Parameter:
λ (lambda): Average number of events
Formula: P(X = k) = (e^(-λ) × λ^k) / k!
Mean: μ = λ Variance: σ² = λ
Example: Average 3 customers per hour at bank Probability of exactly 5 customers in next hour: λ = 3, k = 5 P(X = 5) = (e^(-3) × 3^5) / 5! = (0.0498 × 243) / 120 = 0.101
Applications:
Call center arrivals
Manufacturing defects
Website traffic
Natural disasters
Central Limit Theorem
Statement: Distribution of sample means approaches normal distribution as sample size increases
Key Points:
Works regardless of population distribution
Sample size ≥ 30 generally sufficient
Sampling distribution has same mean as population
Standard error = σ / √n
Implications:
Can use normal distribution for inference
Larger samples give more accurate estimates
Foundation for hypothesis testing
Example: Population: Any distribution, μ = 100, σ = 15 Sample size = 36 Sample mean distribution: Normal with μ = 100, σ = 15/√36 = 2.5
Applications:
Quality control
Market research
Political polling
A/B testing
Data Entry Best Practices
Column Headers:
Use clear, descriptive names
Avoid spaces (use underscores: first_name)
Keep headers in first row
Use consistent naming convention
Data Types:
Numbers: Right-aligned, no text mixing
Dates: Use standard date format (DD/MM/YYYY)
Text: Left-aligned, consistent spelling
Boolean: TRUE/FALSE या Yes/No
Data Organization:
One data type per column
No merged cells in data area
No blank rows या columns within data
Keep raw data separate from analysis
Common Mistakes to Avoid:
Mixing data types in same column
Using formatting instead of data
Leaving blank rows
Not backing up original data
Formatting और Organization
Table Creation:
Select data range
Insert → Table
Check "My table has headers"
Benefits: Auto-filtering, structured references
Conditional Formatting:
Highlight cells based on values
Data bars for quick visualization
Color scales for trends
Icon sets for categories
Data Validation:
Restrict input to specific values
Create dropdown lists
Set number ranges
Prevent invalid entries
Examples:
Age validation: Whole number between 0-120
Gender dropdown: Male, Female, Other
Date validation: Between specific dates
Named Ranges
Benefits:
Make formulas easier to read
Reduce errors
Easier to maintain
Creating Named Ranges:
Select range
Formulas → Define Name
Enter descriptive name
Verify scope (Workbook या Worksheet)
Example: Instead of: =SUM(B2:B100) Use: =SUM(Sales_Data)
Statistical Functions
AVERAGE Function:
Syntax: =AVERAGE(range)
Example: =AVERAGE(A1:A10)
Ignores text और empty cells
Alternative: =AVERAGEA() includes text as 0
MEDIAN Function:
Syntax: =MEDIAN(range)
Example: =MEDIAN(A1:A10)
Returns middle value
Better for skewed data
MODE Functions:
MODE.SNGL(): Single mode
MODE.MULT(): Multiple modes
Example: =MODE.SNGL(A1:A10)
Standard Deviation:
STDEV.S(): Sample standard deviation
STDEV.P(): Population standard deviation
Example: =STDEV.S(A1:A10)
Variance:
VAR.S(): Sample variance
VAR.P(): Population variance
Example: =VAR.S(A1:A10)
COUNT Functions:
COUNT(): Numbers only
COUNTA(): Non-empty cells
COUNTBLANK(): Empty cells
COUNTIF(): Conditional counting
COUNTIF Examples:
=COUNTIF(A1:A10,">50") - Count values > 50
=COUNTIF(A1:A10,"Male") - Count "Male" entries
=COUNTIF(A1:A10,">=100") - Count values ≥ 100
Logical Functions
IF Function:
Syntax: =IF(condition, value_if_true, value_if_false)
Example: =IF(A1>50, "Pass", "Fail")
Multiple Conditions:
AND(): All conditions must be true
OR(): At least one condition must be true
NOT(): Reverse the logic
Examples:
=IF(AND(A1>50, B1>60), "Both Pass", "Fail")
=IF(OR(A1="A", A1="B"), "Good Grade", "Poor Grade")
Nested IF:
=IF(A1>=90, "A", IF(A1>=80, "B", IF(A1>=70, "C", "F")))
IFS Function (Excel 2016+):
=IFS(A1>=90, "A", A1>=80, "B", A1>=70, "C", TRUE, "F")
Lookup Functions
VLOOKUP:
Searches in leftmost column
Returns value from specified column
Syntax: =VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])
Example: Employee table with ID, Name, Department =VLOOKUP(101, A2:C100, 2, FALSE) - Returns name for ID 101
HLOOKUP:
Searches in top row
Returns value from specified row
Useful for horizontally organized data
INDEX-MATCH Combination:
More flexible than VLOOKUP
Can look left या right
Syntax: =INDEX(return_array, MATCH(lookup_value, lookup_array, 0))
Example: =INDEX(C2:C100, MATCH(101, A2:A100, 0))
XLOOKUP (Excel 365):
Modern replacement for VLOOKUP
Can search in any direction
Better error handling
Pivot Tables
Creating Pivot Tables:
Select data range
Insert → PivotTable
Choose destination (new worksheet recommended)
Drag fields to appropriate areas
Pivot Table Areas:
Rows: Categories to group by
Columns: Sub-categories
Values: Numbers to summarize
Filters: Criteria to filter data
Common Aggregations:
Sum: Total values
Count: Number of records
Average: Mean values
Max/Min: Extreme values
Percentage: Proportion of total
Example Setup: Sales data with columns: Date, Product, Region, Sales
Rows: Product
Columns: Region
Values: Sum of Sales
Filters: Date
Grouping Data:
Dates: Group by month, quarter, year
Numbers: Create ranges (0-100, 101-200)
Text: Manual grouping
Calculated Fields:
Create new metrics
Example: Profit = Sales - Cost
Formula: =Sales-Cost
Pivot Charts:
Visual representation of pivot data
Updates automatically with pivot table
Insert → PivotChart
Data Analysis Toolpak
Enabling Toolpak:
File → Options → Add-ins
Manage Excel Add-ins → Go
Check Analysis ToolPak
Access via Data → Data Analysis
Descriptive Statistics:
Input range: Select data
Output: Summary statistics table
Includes: Mean, median, mode, standard deviation
Histogram:
Creates frequency distribution
Input range: Data values
Bin range: Intervals (optional)
Output: Frequency table और chart
Regression Analysis:
Linear relationship between variables
Input Y range: Dependent variable
Input X range: Independent variable(s)
Output: Regression statistics
Correlation Analysis:
Measures relationship strength between variables
Output: Correlation matrix
Values range from -1 to +1
Interpretation: -1 (perfect negative), 0 (no relation), +1 (perfect positive)
Sampling:
Random sampling from dataset
Periodic sampling (every nth value)
Useful for large datasets
Charts और Graphs
Chart Types और Usage:
Column Charts:
Compare categories
Best for: Discrete categories, limited number of series
Example: Sales by product category
Bar Charts:
Horizontal version of column charts
Best for: Long category names, ranking data
Example: Top 10 customers by revenue
Line Charts:
Show trends over time
Best for: Continuous data, time series
Example: Monthly sales trends
Scatter Plot:
Relationship between two variables
Best for: Correlation analysis, outlier detection
Example: Height vs Weight relationship
Pie Charts:
Parts of a whole
Best for: Few categories (≤7), percentage composition
Example: Market share by company
Area Charts:
Cumulative values over time
Best for: Multiple series, showing total and components
Example: Revenue breakdown by product lines
Chart Formatting Best Practices:
Clear, descriptive titles
Labeled axes with units
Legend placement
Consistent colors
Remove unnecessary gridlines
Appropriate chart size
Dashboard Creation:
Combine multiple charts
Use consistent formatting
Add slicers for interactivity
Include key metrics
Logical layout flow
Database Concepts
Database Definition: A database is organized collection of structured information stored electronically in a computer system.
Key Components:
Tables:
Store data in rows और columns
Each table represents an entity (customers, products, orders)
Example: Customer table with columns: ID, Name, Email, Phone
Records (Rows):
Individual entries in table
Each record represents one instance
Example: One customer's complete information
Fields (Columns):
Attributes of entity
Define data type और constraints
Example: Customer_Name field stores text data
Primary Key:
Unique identifier for each record
Cannot be NULL या duplicate
Example: Customer_ID, Product_ID
Foreign Key:
Links tables together
References primary key of another table
Maintains referential integrity
Relationships:
One-to-One: Each record in Table A relates to one record in Table B
One-to-Many: One record in Table A relates to multiple records in Table B
Many-to-Many: Multiple records in both tables can relate to each other
SQL Syntax Basics
SQL Structure:
Structured Query Language
Declarative language (what, not how)
Case-insensitive (convention: UPPERCASE for keywords)
Basic Syntax Rules:
Statements end with semicolon (;)
String values in single quotes ('text')
Comments: -- single line, /* multi-line */
SQL Categories:
DDL (Data Definition Language): CREATE, ALTER, DROP
DML (Data Manipulation Language): INSERT, UPDATE, DELETE
DQL (Data Query Language): SELECT
DCL (Data Control Language): GRANT, REVOKE
Data Types
Numeric Types:
INT: Whole numbers (-2,147,483,648 to 2,147,483,647)
BIGINT: Large whole numbers
DECIMAL(p,s): Fixed-point numbers (p=precision, s=scale)
FLOAT: Approximate numbers
String Types:
CHAR(n): Fixed length strings
VARCHAR(n): Variable length strings
TEXT: Large text data
Date/Time Types:
DATE: Date only (YYYY-MM-DD)
TIME: Time only (HH:MM:SS)
DATETIME: Date and time combined
TIMESTAMP: Date and time with timezone
Other Types:
BOOLEAN: TRUE/FALSE values
BLOB: Binary large objects (images, files)
Database Design Principles
Normalization: Process of organizing data to reduce redundancy
First Normal Form (1NF):
Each column contains atomic values
No repeating groups
Each row is unique
Second Normal Form (2NF):
Must be in 1NF
No partial dependencies on primary key
Non-key attributes depend on entire primary key
Third Normal Form (3NF):
Must be in 2NF
No transitive dependencies
Non-key attributes depend only on primary key
Example: Instead of: | Order_ID | Customer_Name | Customer_City | Product_Name | Price |
Better design: Customers: Customer_ID, Name, City Products: Product_ID, Name, Price Orders: Order_ID, Customer_ID, Product_ID
SELECT Statements
Basic SELECT:
SELECT column1, column2
FROM table_name;
Select All Columns:
SELECT *
FROM customers;
Select Specific Columns:
SELECT customer_name, email, city
FROM customers;
Column Aliases:
SELECT customer_name AS name,
email AS contact_email
FROM customers;
WHERE Clause Conditions
Comparison Operators:
-- Equal to
SELECT * FROM products WHERE price = 100;
-- Not equal to
SELECT * FROM products WHERE price != 100;
SELECT * FROM products WHERE price <> 100;
-- Greater than
SELECT * FROM products WHERE price > 100;
-- Less than or equal to
SELECT * FROM products WHERE price <= 100;
Logical Operators:
-- AND condition
SELECT * FROM customers
WHERE city = 'Mumbai' AND age > 25;
-- OR condition
SELECT * FROM customers
WHERE city = 'Delhi' OR city = 'Mumbai';
-- NOT condition
SELECT * FROM customers
WHERE NOT city = 'Chennai';
Pattern Matching with LIKE:
-- Starts with 'A'
SELECT * FROM customers WHERE customer_name LIKE 'A%';
-- Ends with 'son'
SELECT * FROM customers WHERE customer_name LIKE '%son';
-- Contains 'john'
SELECT * FROM customers WHERE customer_name LIKE '%john%';
-- Exactly 5 characters
SELECT * FROM customers WHERE customer_name LIKE '_____';
Range Conditions:
-- BETWEEN operator
SELECT * FROM products
WHERE price BETWEEN 100 AND 500;
-- IN operator
SELECT * FROM customers
WHERE city IN ('Mumbai', 'Delhi', 'Bangalore');
-- IS NULL
SELECT * FROM customers
WHERE phone IS NULL;
-- IS NOT NULL
SELECT * FROM customers
WHERE email IS NOT NULL;
Sorting with ORDER BY
Ascending Order (Default):
SELECT * FROM customers
ORDER BY customer_name;
SELECT * FROM customers
ORDER BY customer_name ASC;
Descending Order:
SELECT * FROM products
ORDER BY price DESC;
Multiple Column Sorting:
SELECT * FROM customers
ORDER BY city ASC, customer_name DESC;
Sorting by Column Position:
SELECT customer_name, city, age
FROM customers
ORDER BY 2, 1; -- Sort by city, then by name
Aggregate Functions
COUNT Function:
-- Count all rows
SELECT COUNT(*) FROM customers;
-- Count non-null values
SELECT COUNT(email) FROM customers;
-- Count distinct values
SELECT COUNT(DISTINCT city) FROM customers;
SUM Function:
-- Total sales
SELECT SUM(amount) FROM orders;
-- Sum with condition
SELECT SUM(amount) FROM orders
WHERE order_date >= '2024-01-01';
AVG Function:
-- Average price
SELECT AVG(price) FROM products;
-- Average with rounding
SELECT ROUND(AVG(price), 2) FROM products;
MIN और MAX Functions:
-- Minimum and maximum prices
SELECT MIN(price) AS lowest_price,
MAX(price) AS highest_price
FROM products;
GROUP BY Clause
Basic Grouping:
-- Count customers by city
SELECT city, COUNT(*) AS customer_count
FROM customers
GROUP BY city;
Multiple Column Grouping:
-- Sales by region and product category
SELECT region, category, SUM(sales) AS total_sales
FROM sales_data
GROUP BY region, category;
Grouping with Calculations:
-- Average order value by customer
SELECT customer_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_order_value,
SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id;
HAVING Clause
Difference from WHERE:
WHERE filters rows before grouping
HAVING filters groups after aggregation
Examples:
-- Cities with more than 10 customers
SELECT city, COUNT(*) AS customer_count
FROM customers
GROUP BY city
HAVING COUNT(*) > 10;
-- Customers with total purchases > 10000
SELECT customer_id, SUM(amount) AS total_purchases
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 10000;
Joins
INNER JOIN: Returns only matching records from both tables
-- Customer orders with customer names
SELECT c.customer_name, o.order_id, o.amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;
LEFT JOIN (LEFT OUTER JOIN): Returns all records from left table, matching records from right
-- All customers, including those without orders
SELECT c.customer_name, o.order_id, o.amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;
RIGHT JOIN (RIGHT OUTER JOIN): Returns all records from right table, matching records from left
-- All orders, including those without customer details
SELECT c.customer_name, o.order_id, o.amount
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id;
FULL OUTER JOIN: Returns all records when there's a match in either table
-- All customers and orders
SELECT c.customer_name, o.order_id, o.amount
FROM customers c
FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;
Multiple Table Joins:
-- Orders with customer and product details
SELECT c.customer_name, p.product_name, o.quantity, o.amount
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN products p ON o.product_id = p.product_id;
Self Join: Join table with itself
-- Employees and their managers
SELECT e1.employee_name AS employee,
e2.employee_name AS manager
FROM employees e1
LEFT JOIN employees e2 ON e1.manager_id = e2.employee_id;
Subqueries
Single Value Subqueries:
-- Customers who spent more than average
SELECT customer_name, total_spent
FROM customers
WHERE total_spent > (SELECT AVG(total_spent) FROM customers);
Multiple Value Subqueries:
-- Products in categories with high sales
SELECT product_name, price
FROM products
WHERE category_id IN (
SELECT category_id
FROM sales_summary
WHERE total_sales > 100000
);
Correlated Subqueries: Subquery references outer query
-- Customers with above-average orders for their city
SELECT customer_name, city, order_amount
FROM orders o1
WHERE order_amount > (
SELECT AVG(order_amount)
FROM orders o2
WHERE o2.city = o1.city
);
EXISTS Operator:
-- Customers who have placed orders
SELECT customer_name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);
Window Functions
ROW_NUMBER(): Assigns unique number to each row
-- Rank customers by total spending
SELECT customer_name, total_spent,
ROW_NUMBER() OVER (ORDER BY total_spent DESC) AS rank
FROM customers;
RANK() और DENSE_RANK():
-- Rank with ties
SELECT customer_name, total_spent,
RANK() OVER (ORDER BY total_spent DESC) AS rank,
DENSE_RANK() OVER (ORDER BY total_spent DESC) AS dense_rank
FROM customers;
PARTITION BY:
-- Rank customers within each city
SELECT customer_name, city, total_spent,
ROW_NUMBER() OVER (PARTITION BY city ORDER BY total_spent DESC) AS city_rank
FROM customers;
Aggregate Window Functions:
-- Running total of sales
SELECT order_date, daily_sales,
SUM(daily_sales) OVER (ORDER BY order_date) AS running_total
FROM daily_sales_summary;
LAG और LEAD:
-- Compare with previous month
SELECT month, sales,
LAG(sales) OVER (ORDER BY month) AS prev_month_sales,
sales - LAG(sales) OVER (ORDER BY month) AS growth
FROM monthly_sales;
Case Statements
Simple CASE:
-- Categorize customers by spending
SELECT customer_name, total_spent,
CASE
WHEN total_spent > 10000 THEN 'High Value'
WHEN total_spent > 5000 THEN 'Medium Value'
ELSE 'Low Value'
END AS customer_category
FROM customers;
Searched CASE:
-- Complex conditions
SELECT product_name, price, stock_quantity,
CASE
WHEN stock_quantity = 0 THEN 'Out of Stock'
WHEN stock_quantity < 10 THEN 'Low Stock'
WHEN price > 1000 AND stock_quantity > 50 THEN 'Premium Available'
ELSE 'In Stock'
END AS stock_status
FROM products;
CASE in Aggregations:
-- Conditional counting
SELECT
COUNT(CASE WHEN age < 30 THEN 1 END) AS young_customers,
COUNT(CASE WHEN age BETWEEN 30 AND 50 THEN 1 END) AS middle_age,
COUNT(CASE WHEN age > 50 THEN 1 END) AS senior_customers
FROM customers;
Python Installation और Setup
Python Installation:
Download from python.org
Choose Python 3.8+ version
Check "Add Python to PATH"
Verify installation: python --version
IDE Options:
Jupyter Notebook: Interactive development
PyCharm: Professional IDE
VS Code: Lightweight, extensible
Spyder: Scientific computing focus
Package Management:
# Install pip (usually included)
python -m pip install --upgrade pip
# Install packages
pip install pandas numpy matplotlib seaborn
# Install from requirements file
pip install -r requirements.txt
# Create virtual environment
python -m venv data_analysis_env
Basic Syntax और Data Types
Variables:
# Variable assignment
name = "John"
age = 25
height = 5.8
is_student = True
# Multiple assignment
x, y, z = 1, 2, 3
Data Types:
# Numeric types
integer_num = 42
float_num = 3.14
complex_num = 2 + 3j
# String
text = "Hello, World!"
multiline = """This is
a multiline
string"""
# Boolean
is_true = True
is_false = False
# None type
empty_value = None
Type Checking:
print(type(age)) # <class 'int'>
print(isinstance(age, int)) # True
String Operations:
# String methods
name = " John Doe "
print(name.lower()) # " john doe "
print(name.upper()) # " JOHN DOE "
print(name.strip()) # "John Doe"
print(name.replace("John", "Jane")) # " Jane Doe "
# String formatting
age = 25
message = f"I am {age} years old" # f-string
message = "I am {} years old".format(age) # format method
Lists:
# List creation
numbers = [1, 2, 3, 4, 5]
mixed_list = [1, "hello", 3.14, True]
empty_list = []
# List operations
numbers.append(6) # Add element
numbers.insert(0, 0) # Insert at position
numbers.remove(3) # Remove specific value
popped = numbers.pop() # Remove and return last element
# List slicing
print(numbers[0]) # First element
print(numbers[-1]) # Last element
print(numbers[1:4]) # Elements from index 1 to 3
print(numbers[:3]) # First 3 elements
print(numbers[2:]) # From index 2 to end
Dictionaries:
# Dictionary creation
student = {
"name": "John",
"age": 25,
"grades": [85, 90, 78]
}
# Dictionary operations
print(student["name"]) # Access value
student["city"] = "Mumbai" # Add new key-value
student.update({"phone": "123"}) # Update multiple
# Dictionary methods
print(student.keys()) # All keys
print(student.values()) # All values
print(student.items()) # Key-value pairs
Control Structures
Conditional Statements:
# if-elif-else
score = 85
if score >= 90:
grade = "A"
elif score >= 80:
grade = "B"
elif score >= 70:
grade = "C"
else:
grade = "F"
# Ternary operator
grade = "Pass" if score >= 60 else "Fail"
# Multiple conditions
if score >= 80 and attendance >= 75:
print("Eligible for scholarship")
Loops:
# for loop with range
for i in range(5): # 0 to 4
print(i)
for i in range(1, 11): # 1 to 10
print(i)
for i in range(0, 10, 2): # 0, 2, 4, 6, 8
print(i)
# for loop with lists
fruits = ["apple", "banana", "orange"]
for fruit in fruits:
print(fruit)
# for loop with enumerate
for index, fruit in enumerate(fruits):
print(f"{index}: {fruit}")
# while loop
count = 0
while count < 5:
print(count)
count += 1
# Loop control
for i in range(10):
if i == 3:
continue # Skip iteration
if i == 7:
break # Exit loop
print(i)
List Comprehensions:
# Basic list comprehension
squares = [x**2 for x in range(10)]
# With condition
even_squares = [x**2 for x in range(10) if x % 2 == 0]
# Nested comprehension
matrix = [[i*j for j in range(3)] for i in range(3)]
Functions और Modules
Function Definition:
# Basic function
def greet(name):
return f"Hello, {name}!"
# Function with default parameters
def calculate_area(length, width=1):
return length * width
# Function with multiple return values
def get_name_age():
return "John", 25
name, age = get_name_age()
# Function with variable arguments
def sum_numbers(*args):
return sum(args)
result = sum_numbers(1, 2, 3, 4, 5)
# Function with keyword arguments
def create_profile(**kwargs):
return kwargs
profile = create_profile(name="John", age=25, city="Mumbai")
Lambda Functions:
# Lambda function
square = lambda x: x**2
print(square(5)) # 25
# Lambda with map
numbers = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, numbers))
# Lambda with filter
even_numbers = list(filter(lambda x: x % 2 == 0, numbers))
Modules:
# Importing modules
import math
from datetime import datetime
import pandas as pd # Alias
# Using imported functions
print(math.sqrt(16))
print(datetime.now())
# Creating custom module
# File: my_functions.py
def add(a, b):
return a + b
# Using custom module
# from my_functions import add
# result = add(5, 3)
Arrays Creation और Manipulation
NumPy Installation:
import numpy as np
Array Creation:
# From list
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
# Built-in functions
zeros = np.zeros(5) # Array of zeros
ones = np.ones((3, 4)) # 3x4 array of ones
full = np.full((2, 3), 7) # Array filled with 7
empty = np.empty((2, 2)) # Uninitialized array
# Range arrays
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5) # 5 evenly spaced values
# Random arrays
random_arr = np.random.random(5) # Random floats 0-1
random_int = np.random.randint(1, 10, 5) # Random integers
normal_dist = np.random.normal(0, 1, 100) # Normal distribution
Array Properties:
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape) # (2, 3)
print(arr.size) # 6
print(arr.ndim) # 2
print(arr.dtype) # int64
print(arr.itemsize) # 8 bytes per element
Mathematical Operations
Element-wise Operations:
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([5, 6, 7, 8])
# Arithmetic operations
addition = arr1 + arr2 # [6, 8, 10, 12]
subtraction = arr1 - arr2 # [-4, -4, -4, -4]
multiplication = arr1 * arr2 # [5, 12, 21, 32]
division = arr2 / arr1 # [5.0, 3.0, 2.33, 2.0]
power = arr1 ** 2 # [1, 4, 9, 16]
# Scalar operations
scalar_mult = arr1 * 3 # [3, 6, 9, 12]
scalar_add = arr1 + 10 # [11, 12, 13, 14]
Mathematical Functions:
arr = np.array([1, 4, 9, 16, 25])
# Square root
sqrt_arr = np.sqrt(arr) # [1, 2, 3, 4, 5]
# Trigonometric functions
angles = np.array([0, np.pi/2, np.pi])
sin_values = np.sin(angles) # [0, 1, 0]
cos_values = np.cos(angles) # [1, 0, -1]
# Logarithmic functions
log_values = np.log(arr) # Natural log
log10_values = np.log10(arr) # Base 10 log
# Exponential
exp_values = np.exp([1, 2, 3]) # [e, e^2, e^3]
Aggregate Functions:
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Basic aggregations
print(np.sum(arr)) # Sum of all elements
print(np.mean(arr)) # Average
print(np.median(arr)) # Median
print(np.std(arr)) # Standard deviation
print(np.var(arr)) # Variance
print(np.min(arr)) # Minimum
print(np.max(arr)) # Maximum
# Axis-wise operations
print(np.sum(arr, axis=0)) # Sum along columns [12, 15, 18]
print(np.sum(arr, axis=1)) # Sum along rows [6, 15, 24]
print(np.mean(arr, axis=0)) # Column means [4, 5, 6]
Array Indexing और Slicing
Basic Indexing:
arr = np.array([10, 20, 30, 40, 50])
# Single element
print(arr[0]) # 10
print(arr[-1]) # 50
# Slicing
print(arr[1:4]) # [20, 30, 40]
print(arr[:3]) # [10, 20, 30]
print(arr[2:]) # [30, 40, 50]
print(arr[::2]) # [10, 30, 50] - every 2nd element
2D Array Indexing:
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Element access
print(arr_2d[0, 1]) # 2
print(arr_2d[1][2]) # 6
# Row/column access
print(arr_2d[0, :]) # First row [1, 2, 3]
print(arr_2d[:, 1]) # Second column [2, 5, 8]
# Subarray
print(arr_2d[0:2, 1:3]) # [[2, 3], [5, 6]]
Boolean Indexing:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# Boolean condition
condition = arr > 5
print(condition) # [False, False, False, False, False, True, True, True, True, True]
print(arr[condition]) # [6, 7, 8, 9, 10]
# Direct boolean indexing
print(arr[arr > 5]) # [6, 7, 8, 9, 10]
print(arr[(arr > 3) & (arr < 8)]) # [4, 5, 6, 7]
# Multiple conditions
print(arr[(arr < 3) | (arr > 8)]) # [1, 2, 9, 10]
Fancy Indexing:
arr = np.array([10, 20, 30, 40, 50])
# Index array
indices = [0, 2, 4]
print(arr[indices]) # [10, 30, 50]
# 2D fancy indexing
arr_2d = np.array([[1, 2], [3, 4], [5, 6]])
rows = [0, 2]
cols = [1, 0]
print(arr_2d[rows, cols]) # [2, 5]
Broadcasting
Broadcasting Rules: NumPy automatically broadcasts arrays of different shapes during arithmetic operations.
# Scalar with array
arr = np.array([1, 2, 3, 4])
result = arr + 10 # [11, 12, 13, 14]
# 1D with 2D
arr_1d = np.array([1, 2, 3])
arr_2d = np.array([[10], [20], [30]])
result = arr_1d + arr_2d # [[11, 12, 13], [21, 22, 23], [31, 32, 33]]
# Different shapes
a = np.array([[1, 2, 3]]) # Shape: (1, 3)
b = np.array([[4], [5]]) # Shape: (2, 1)
result = a + b # Shape: (2, 3)
print(result)
# [[5, 6, 7],
# [6, 7, 8]]
Broadcasting Examples:
# Normalize columns
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
column_means = np.mean(data, axis=0) # [4, 5, 6]
normalized = data - column_means # Broadcasting subtraction
# Distance calculation
points = np.array([[1, 2], [3, 4], [5, 6]])
center = np.array([3, 4])
distances = np.sqrt(np.sum((points - center)**2, axis=1))
DataFrames और Series
Pandas Installation:
import pandas as pd
import numpy as np
Series Creation:
# From list
series1 = pd.Series([1, 2, 3, 4, 5])
# With custom index
series2 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
# From dictionary
data_dict = {'A': 1, 'B': 2, 'C': 3}
series3 = pd.Series(data_dict)
# Series properties
print(series2.values) # Array of values
print(series2.index) # Index labels
print(series2.dtype) # Data type
print(series2.shape) # Dimensions
DataFrame Creation:
# From dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai'],
'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
# From list of lists
data_list = [
['Alice', 25, 'Mumbai', 50000],
['Bob', 30, 'Delhi', 60000],
['Charlie', 35, 'Bangalore', 70000]
]
df2 = pd.DataFrame(data_list, columns=['Name', 'Age', 'City',