Empowering You with Share Market Knowledge & Business Ideas.

Data Analysis

डेटा एनालिसिस - संपूर्ण कोर्स (विस्तृत सामग्री)

Complete Data Analysis Course with Detailed Answers

मॉड्यूल 1: डेटा एनालिसिस की बुनियादी बातें

अध्याय 1.1: डेटा एनालिसिस क्या है?

डेटा एनालिसिस की परिभाषा

डेटा एनालिसिस एक systematic process है जिसमें हम raw data को collect करते हैं, उसे clean करते हैं, analyze करते हैं और meaningful insights निकालते हैं। यह process businesses और organizations को better decisions लेने में help करती है।

डेटा एनालिसिस के मुख्य चरण:

Data Collection: विभिन्न sources से data gather करना
Data Cleaning: Errors, duplicates, और missing values को handle करना
Data Processing: Data को analysis के लिए suitable format में convert करना
Data Analysis: Statistical methods और algorithms का use करके patterns find करना
Data Interpretation: Results को business context में समझना
Reporting: Findings को stakeholders के साथ share करना

महत्व और Applications

Business में Applications:

Sales Analysis: कौन से products ज्यादा sell हो रहे हैं
Customer Segmentation: Different customer groups की identification
Market Research: Market trends और consumer behavior की समझ
Financial Analysis: Revenue, profit, और cost optimization
Risk Assessment: Potential risks की identification और mitigation

विभिन्न Industries में उपयोग:

Healthcare: Patient data analysis, drug effectiveness studies
Finance: Credit scoring, fraud detection, investment strategies
E-commerce: Recommendation systems, price optimization
Manufacturing: Quality control, supply chain optimization
Education: Student performance analysis, curriculum improvement
Government: Policy making, resource allocation

Career Opportunities

Job Roles:

Data Analyst: Entry-level position, basic analysis और reporting
- Average Salary: ₹3-6 लाख per annum
- Skills Required: Excel, SQL, Basic statistics
Business Analyst: Business problems को data से solve करना
- Average Salary: ₹5-10 लाख per annum
- Skills Required: Domain knowledge, communication skills
Data Scientist: Advanced analytics और machine learning
- Average Salary: ₹8-20 लाख per annum
- Skills Required: Python/R, ML algorithms, statistics
Data Engineer: Data infrastructure और pipelines
- Average Salary: ₹6-15 लाख per annum
- Skills Required: Programming, databases, cloud platforms

Data vs Information vs Knowledge

Data: Raw facts और figures

Example: Sales figure "₹50,000"
यह isolated fact है जिसका कोई context नहीं है

Information: Processed data with context

Example: "Last month sales were ₹50,000, which is 20% increase from previous month"
यह data को meaningful context में present करता है

Knowledge: Information + Experience + Insight

Example: "Sales increase is due to festive season demand, we should increase inventory for next month"
यह actionable insights provide करता है

अध्याय 1.2: डेटा के प्रकार

Quantitative Data (संख्यात्मक डेटा)

Continuous Data (लगातार डेटा):

यह data किसी भी value ले सकता है एक range में
Decimal points हो सकते हैं
Examples: Height (170.5 cm), Weight (65.3 kg), Temperature (25.7°C)
Measurement: Precise measurements possible
Analysis: Mean, median, standard deviation calculate कर सकते हैं

Discrete Data (असतत डेटा):

यह data केवल specific values ले सकता है
Usually whole numbers
Examples: Number of students (25), Number of cars (5), Number of orders (150)
Characteristics: Countable और finite values
Analysis: Frequency distributions, percentages

Qualitative Data (गुणात्मक डेटा)

Nominal Data:

Categories हैं जिनका कोई order नहीं है
Examples: Gender (Male, Female), Colors (Red, Blue, Green), Marital Status (Married, Single)
Analysis: Frequency counts, percentages, mode
Visualization: Pie charts, bar charts

Ordinal Data:

Categories हैं जिनका एक specific order है
Examples: Education Level (High School, Bachelor's, Master's), Satisfaction Rating (Poor, Average, Good, Excellent)
Analysis: Median, percentiles, correlation
Visualization: Ordered bar charts, stacked charts

Structured vs Unstructured Data

Structured Data:

Well-organized format में होता है
Easily searchable और analyzable
Examples: Database tables, Excel spreadsheets, CSV files
Characteristics: Rows और columns में organized
Analysis: SQL queries, statistical analysis

Unstructured Data:

No predefined format
Text, images, videos, audio files
Examples: Social media posts, emails, documents, images
Challenges: Requires special processing techniques
Analysis: Natural Language Processing, image recognition

Primary vs Secondary Data

Primary Data:

Directly collected for specific research purpose
Collection Methods: Surveys, interviews, observations, experiments
Advantages: Highly relevant, current, reliable
Disadvantages: Time-consuming, expensive
Examples: Customer survey data, sales transactions, website analytics

Secondary Data:

Already collected by someone else for different purpose
Sources: Government reports, industry studies, academic research
Advantages: Quick access, cost-effective, large datasets
Disadvantages: May not be exactly relevant, could be outdated
Examples: Census data, market research reports, financial statements

अध्याय 1.3: डेटा एनालिसिस प्रक्रिया

Data Collection (डेटा संग्रह)

Data Collection Methods:

1. Surveys और Questionnaires:

Online surveys (Google Forms, SurveyMonkey)
Phone interviews
Face-to-face interviews
Best Practices: Clear questions, unbiased language, proper sampling

2. Observational Studies:

Direct observation
Hidden observation
Participant observation
Applications: Customer behavior, usability testing

3. Experiments:

Controlled experiments
A/B testing
Randomized controlled trials
Applications: Product testing, website optimization

4. Secondary Data Sources:

Government databases
Industry reports
Academic research
Examples: Census data, economic indicators

Data Cleaning (डेटा सफाई)

Common Data Quality Issues:

1. Missing Values:

Identification: Check for blank cells, null values, "N/A" entries
Handling Strategies:
- Delete rows with missing values (if small percentage)
- Fill with mean/median (for numerical data)
- Fill with mode (for categorical data)
- Use interpolation for time series data

2. Duplicate Records:

Identification: Look for identical rows या similar entries
Handling: Remove duplicates, keep most recent या most complete record

3. Inconsistent Data:

Examples: "Male" vs "M" vs "male"
Solutions: Standardize formats, create data dictionaries

4. Outliers:

Identification: Values that are significantly different from others
Detection Methods: Box plots, Z-scores, IQR method
Handling: Remove, transform, या separately analyze

Data Cleaning Process:

Data Profiling: Understand data structure और quality
Data Validation: Check for completeness और accuracy
Data Standardization: Consistent formats और units
Data Transformation: Convert to suitable formats

Data Exploration (डेटा अन्वेषण)

Exploratory Data Analysis (EDA) Steps:

1. Univariate Analysis:

Analyze each variable individually
For Numerical Variables: Histograms, box plots, summary statistics
For Categorical Variables: Frequency tables, bar charts

2. Bivariate Analysis:

Relationship between two variables
Numerical vs Numerical: Scatter plots, correlation analysis
Categorical vs Numerical: Box plots, group comparisons
Categorical vs Categorical: Cross-tabulation, chi-square tests

3. Multivariate Analysis:

Analyze multiple variables together
Techniques: Correlation matrices, heatmaps, parallel coordinates

Key Questions to Answer:

What is the distribution of each variable?
Are there any patterns या trends?
What are the relationships between variables?
Are there any anomalies या outliers?

Data Analysis Techniques

Descriptive Analysis:

Summarize और describe data characteristics
Measures: Mean, median, mode, standard deviation
Purpose: Understanding current state

Diagnostic Analysis:

Understand why something happened
Techniques: Drill-down analysis, root cause analysis
Purpose: Identify causes of problems या trends

Predictive Analysis:

Forecast future outcomes
Techniques: Regression analysis, time series forecasting
Purpose: Planning और decision making

Prescriptive Analysis:

Recommend actions to take
Techniques: Optimization algorithms, simulation
Purpose: Actionable insights और recommendations

Data Visualization

Chart Selection Guidelines:

For Comparisons:

Bar charts: Compare categories
Column charts: Compare over time
Horizontal bar charts: Long category names

For Relationships:

Scatter plots: Correlation between two variables
Line charts: Trends over time
Bubble charts: Three-dimensional relationships

For Compositions:

Pie charts: Parts of a whole (limited categories)
Stacked bar charts: Components over time
Area charts: Cumulative values

For Distributions:

Histograms: Frequency distribution
Box plots: Outliers और quartiles
Violin plots: Distribution shape

Visualization Best Practices:

Choose appropriate chart type
Use clear और descriptive titles
Label axes properly
Use consistent colors
Avoid 3D effects unnecessarily
Consider color-blind accessibility

Interpretation और Reporting

Key Elements of Good Analysis:

1. Context Setting:

Business background
Objectives और goals
Data sources और limitations

2. Methodology:

Analysis approach
Tools और techniques used
Assumptions made

3. Findings:

Key insights discovered
Statistical significance
Practical significance

4. Recommendations:

Actionable steps
Expected outcomes
Implementation timeline

5. Limitations:

Data quality issues
Methodological constraints
Scope limitations

Reporting Formats:

Executive summary for leadership
Detailed technical report for analysts
Dashboard for ongoing monitoring
Presentation for stakeholders

मॉड्यूल 2: सांख्यिकी की बुनियादी बातें

अध्याय 2.1: Descriptive Statistics

Measures of Central Tendency

Mean (औसत):

Definition: सभी values का sum divided by number of values

Formula: Mean = (Sum of all values) / (Number of values) Mean = (x₁ + x₂ + x₃ + ... + xₙ) / n

Example: Marks of 5 students: 85, 90, 78, 92, 85 Mean = (85 + 90 + 78 + 92 + 85) / 5 = 430 / 5 = 86

Types of Mean:

Arithmetic Mean: Regular average
Geometric Mean: Used for growth rates
- Formula: ⁿ√(x₁ × x₂ × ... × xₙ)
Harmonic Mean: Used for rates
- Formula: n / (1/x₁ + 1/x₂ + ... + 1/xₙ)

When to Use Mean:

Data is normally distributed
No extreme outliers present
Need to use all data points

Advantages:

Uses all data points
Algebraically defined
Unique value

Disadvantages:

Affected by outliers
May not represent actual data point

Median (मध्यक):

Definition: Middle value when data is arranged in ascending order

Calculation Steps:

Arrange data in ascending order
If n is odd: Median = middle value
If n is even: Median = average of two middle values

Example 1 (Odd number of values): Data: 12, 15, 18, 20, 25 Median = 18 (middle value)

Example 2 (Even number of values): Data: 10, 15, 20, 25, 30, 35 Median = (20 + 25) / 2 = 22.5

When to Use Median:

Data has outliers
Skewed distribution
Ordinal data
Want typical value

Advantages:

Not affected by outliers
Easy to understand
Works with ordinal data

Disadvantages:

Ignores extreme values
Less stable than mean

Mode (बहुलक):

Definition: Most frequently occurring value in dataset

Examples:

Data: 2, 3, 3, 4, 5, 5, 5, 6
Mode = 5 (appears 3 times)

Types:

Unimodal: One mode
Bimodal: Two modes
Multimodal: More than two modes
No Mode: All values appear once

When to Use Mode:

Categorical data
Want most common value
Discrete data
Quality control

Advantages:

Easy to identify
Not affected by outliers
Useful for categorical data

Disadvantages:

May not exist
May not be unique
Ignores other values

Comparison और Selection:

Normal Distribution: Mean = Median = Mode
Right Skewed: Mean > Median > Mode
Left Skewed: Mean < Median < Mode

Measures of Variability

Range (परिसर):

Definition: Difference between maximum और minimum values

Formula: Range = Maximum value - Minimum value

Example: Data: 10, 15, 20, 25, 30 Range = 30 - 10 = 20

Advantages:

Easy to calculate
Quick measure of spread
Intuitive understanding

Disadvantages:

Only uses two values
Affected by outliers
Doesn't show distribution shape

Variance (प्रसरण):

Definition: Average of squared differences from mean

Population Variance Formula: σ² = Σ(xi - μ)² / N

Sample Variance Formula: s² = Σ(xi - x̄)² / (n-1)

Calculation Steps:

Calculate mean
Find difference of each value from mean
Square each difference
Sum all squared differences
Divide by n (population) या n-1 (sample)

Example: Data: 2, 4, 6, 8, 10 Mean = 6

Value

Difference

Squared Difference

-4

-2

Variance = (16 + 4 + 0 + 4 + 16) / 5 = 40 / 5 = 8

Standard Deviation (मानक विचलन):

Definition: Square root of variance

Formula:

Population: σ = √σ²
Sample: s = √s²

From above example: Standard Deviation = √8 = 2.83

Interpretation:

About 68% values within 1 SD of mean
About 95% values within 2 SD of mean
About 99.7% values within 3 SD of mean

When to Use Standard Deviation:

Normal distribution
Comparing variability
Quality control
Risk assessment

Coefficient of Variation:

Definition: Relative measure of variability

Formula: CV = (Standard Deviation / Mean) × 100%

Example: If Mean = 50, SD = 10 CV = (10/50) × 100% = 20%

Uses:

Compare variability between different datasets
Different units या scales
Relative risk assessment

Interpretation:

CV < 15%: Low variability
CV 15-35%: Moderate variability
CV > 35%: High variability

अध्याय 2.2: Probability Basics

Probability की परिभाषा

Definition: Probability is measure of likelihood of an event occurring

Mathematical Definition: P(Event) = Number of favorable outcomes / Total number of possible outcomes

Range: 0 ≤ P(Event) ≤ 1

P = 0: Impossible event
P = 1: Certain event
P = 0.5: Equally likely

Example: Rolling a die, probability of getting 3: P(3) = 1/6 = 0.167 या 16.7%

Types of Events

Independent Events:

Outcome of one doesn't affect the other
Example: Coin flips, die rolls
P(A and B) = P(A) × P(B)

Dependent Events:

Outcome of one affects the other
Example: Drawing cards without replacement
P(A and B) = P(A) × P(B|A)

Mutually Exclusive Events:

Cannot occur simultaneously
Example: Getting heads या tails
P(A and B) = 0
P(A or B) = P(A) + P(B)

Complementary Events:

One या the other must occur
Example: Pass या fail
P(A) + P(A') = 1
P(A') = 1 - P(A)

Probability Rules

Addition Rule:

For Mutually Exclusive Events: P(A or B) = P(A) + P(B)

For Non-Mutually Exclusive Events: P(A or B) = P(A) + P(B) - P(A and B)

Example: Drawing a card: P(King or Heart) P(King) = 4/52, P(Heart) = 13/52, P(King of Hearts) = 1/52 P(King or Heart) = 4/52 + 13/52 - 1/52 = 16/52

Multiplication Rule:

For Independent Events: P(A and B) = P(A) × P(B)

For Dependent Events: P(A and B) = P(A) × P(B|A)

Example: Two coin flips: P(Heads and Heads) P(H and H) = P(H) × P(H) = 0.5 × 0.5 = 0.25

Conditional Probability

Definition: Probability of event A given that event B has occurred

Formula: P(A|B) = P(A and B) / P(B)

Example: In a class: 60% students pass math, 40% pass both math and science P(Science|Math) = P(Math and Science) / P(Math) = 0.4 / 0.6 = 0.67

Tree Diagrams:

Visual representation of conditional probabilities
Branches represent different outcomes
Probabilities multiply along branches

Bayes' Theorem

Formula: P(A|B) = P(B|A) × P(A) / P(B)

Components:

P(A|B): Posterior probability
P(B|A): Likelihood
P(A): Prior probability
P(B): Marginal probability

Medical Test Example:

Disease prevalence: 1% (prior)
Test accuracy: 95% (sensitivity)
False positive rate: 5%

Question: If test is positive, what's probability of having disease?

Solution: P(Disease|Positive) = P(Positive|Disease) × P(Disease) / P(Positive)

P(Positive) = P(Positive|Disease) × P(Disease) + P(Positive|No Disease) × P(No Disease) P(Positive) = 0.95 × 0.01 + 0.05 × 0.99 = 0.0095 + 0.0495 = 0.059

P(Disease|Positive) = 0.95 × 0.01 / 0.059 = 0.161 या 16.1%

Applications:

Medical diagnosis
Spam filtering
Machine learning
Business decision making

अध्याय 2.3: Distributions

Normal Distribution

Characteristics:

Bell-shaped curve
Symmetrical around mean
Mean = Median = Mode
68-95-99.7 rule

Parameters:

μ (mu): Population mean
σ (sigma): Population standard deviation

Standard Normal Distribution:

Mean = 0, Standard Deviation = 1
Any normal distribution can be standardized
Z-score = (X - μ) / σ

Z-Score Interpretation:

Z = 0: Value equals mean
Z = 1: Value is 1 SD above mean
Z = -1: Value is 1 SD below mean

Example: Height follows normal distribution: μ = 170 cm, σ = 10 cm Person's height = 185 cm Z-score = (185 - 170) / 10 = 1.5

Applications:

Quality control
Test scores
Measurement errors
Natural phenomena

Binomial Distribution

Definition: Probability distribution for number of successes in fixed number of trials

Conditions:

Fixed number of trials (n)
Each trial has two outcomes (success/failure)
Probability of success (p) remains constant
Trials are independent

Parameters:

n: Number of trials
p: Probability of success
q = 1-p: Probability of failure

Formula: P(X = k) = C(n,k) × p^k × q^(n-k)

where C(n,k) = n! / (k!(n-k)!)

Mean: μ = n × p Variance: σ² = n × p × q

Example: Coin flip 10 times, probability of getting exactly 6 heads: n = 10, p = 0.5, k = 6 P(X = 6) = C(10,6) × (0.5)^6 × (0.5)^4 = 210 × 0.0156 × 0.0625 = 0.205

Applications:

Quality control
Survey sampling
Medical trials
Marketing campaigns

Poisson Distribution

Definition: Probability of number of events in fixed time interval

Conditions:

Events occur randomly
Average rate is constant
Events are independent

Parameter:

λ (lambda): Average number of events

Formula: P(X = k) = (e^(-λ) × λ^k) / k!

Mean: μ = λ Variance: σ² = λ

Example: Average 3 customers per hour at bank Probability of exactly 5 customers in next hour: λ = 3, k = 5 P(X = 5) = (e^(-3) × 3^5) / 5! = (0.0498 × 243) / 120 = 0.101

Applications:

Call center arrivals
Manufacturing defects
Website traffic
Natural disasters

Central Limit Theorem

Statement: Distribution of sample means approaches normal distribution as sample size increases

Key Points:

Works regardless of population distribution
Sample size ≥ 30 generally sufficient
Sampling distribution has same mean as population
Standard error = σ / √n

Implications:

Can use normal distribution for inference
Larger samples give more accurate estimates
Foundation for hypothesis testing

Example: Population: Any distribution, μ = 100, σ = 15 Sample size = 36 Sample mean distribution: Normal with μ = 100, σ = 15/√36 = 2.5

Applications:

Quality control
Market research
Political polling
A/B testing

मॉड्यूल 3: Microsoft Excel में डेटा एनालिसिस

अध्याय 3.1: Excel Basics for Data Analysis

Data Entry Best Practices

Column Headers:

Use clear, descriptive names
Avoid spaces (use underscores: first_name)
Keep headers in first row
Use consistent naming convention

Data Types:

Numbers: Right-aligned, no text mixing
Dates: Use standard date format (DD/MM/YYYY)
Text: Left-aligned, consistent spelling
Boolean: TRUE/FALSE या Yes/No

Data Organization:

One data type per column
No merged cells in data area
No blank rows या columns within data
Keep raw data separate from analysis

Common Mistakes to Avoid:

Mixing data types in same column
Using formatting instead of data
Leaving blank rows
Not backing up original data

Formatting और Organization

Table Creation:

Select data range
Insert → Table
Check "My table has headers"
Benefits: Auto-filtering, structured references

Conditional Formatting:

Highlight cells based on values
Data bars for quick visualization
Color scales for trends
Icon sets for categories

Data Validation:

Restrict input to specific values
Create dropdown lists
Set number ranges
Prevent invalid entries

Examples:

Age validation: Whole number between 0-120
Gender dropdown: Male, Female, Other
Date validation: Between specific dates

Named Ranges

Benefits:

Make formulas easier to read
Reduce errors
Easier to maintain

Creating Named Ranges:

Select range
Formulas → Define Name
Enter descriptive name
Verify scope (Workbook या Worksheet)

Example: Instead of: =SUM(B2:B100) Use: =SUM(Sales_Data)

अध्याय 3.2: Excel Functions और Formulas

Statistical Functions

AVERAGE Function:

Syntax: =AVERAGE(range)
Example: =AVERAGE(A1:A10)
Ignores text और empty cells
Alternative: =AVERAGEA() includes text as 0

MEDIAN Function:

Syntax: =MEDIAN(range)
Example: =MEDIAN(A1:A10)
Returns middle value
Better for skewed data

MODE Functions:

MODE.SNGL(): Single mode
MODE.MULT(): Multiple modes
Example: =MODE.SNGL(A1:A10)

Standard Deviation:

STDEV.S(): Sample standard deviation
STDEV.P(): Population standard deviation
Example: =STDEV.S(A1:A10)

Variance:

VAR.S(): Sample variance
VAR.P(): Population variance
Example: =VAR.S(A1:A10)

COUNT Functions:

COUNT(): Numbers only
COUNTA(): Non-empty cells
COUNTBLANK(): Empty cells
COUNTIF(): Conditional counting

COUNTIF Examples:

=COUNTIF(A1:A10,">50") - Count values > 50
=COUNTIF(A1:A10,"Male") - Count "Male" entries
=COUNTIF(A1:A10,">=100") - Count values ≥ 100

Logical Functions

IF Function:

Syntax: =IF(condition, value_if_true, value_if_false)
Example: =IF(A1>50, "Pass", "Fail")

Multiple Conditions:

AND(): All conditions must be true
OR(): At least one condition must be true
NOT(): Reverse the logic

Examples:

=IF(AND(A1>50, B1>60), "Both Pass", "Fail")
=IF(OR(A1="A", A1="B"), "Good Grade", "Poor Grade")

Nested IF:

=IF(A1>=90, "A", IF(A1>=80, "B", IF(A1>=70, "C", "F")))

IFS Function (Excel 2016+):

=IFS(A1>=90, "A", A1>=80, "B", A1>=70, "C", TRUE, "F")

Lookup Functions

VLOOKUP:

Searches in leftmost column
Returns value from specified column
Syntax: =VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])

Example: Employee table with ID, Name, Department =VLOOKUP(101, A2:C100, 2, FALSE) - Returns name for ID 101

HLOOKUP:

Searches in top row
Returns value from specified row
Useful for horizontally organized data

INDEX-MATCH Combination:

More flexible than VLOOKUP
Can look left या right
Syntax: =INDEX(return_array, MATCH(lookup_value, lookup_array, 0))

Example: =INDEX(C2:C100, MATCH(101, A2:A100, 0))

XLOOKUP (Excel 365):

Modern replacement for VLOOKUP
Can search in any direction
Better error handling

अध्याय 3.3: Data Analysis Tools

Pivot Tables

Creating Pivot Tables:

Select data range
Insert → PivotTable
Choose destination (new worksheet recommended)
Drag fields to appropriate areas

Pivot Table Areas:

Rows: Categories to group by
Columns: Sub-categories
Values: Numbers to summarize
Filters: Criteria to filter data

Common Aggregations:

Sum: Total values
Count: Number of records
Average: Mean values
Max/Min: Extreme values
Percentage: Proportion of total

Example Setup: Sales data with columns: Date, Product, Region, Sales

Rows: Product
Columns: Region
Values: Sum of Sales
Filters: Date

Grouping Data:

Dates: Group by month, quarter, year
Numbers: Create ranges (0-100, 101-200)
Text: Manual grouping

Calculated Fields:

Create new metrics
Example: Profit = Sales - Cost
Formula: =Sales-Cost

Pivot Charts:

Visual representation of pivot data
Updates automatically with pivot table
Insert → PivotChart

Data Analysis Toolpak

Enabling Toolpak:

File → Options → Add-ins
Manage Excel Add-ins → Go
Check Analysis ToolPak
Access via Data → Data Analysis

Descriptive Statistics:

Input range: Select data
Output: Summary statistics table
Includes: Mean, median, mode, standard deviation

Histogram:

Creates frequency distribution
Input range: Data values
Bin range: Intervals (optional)
Output: Frequency table और chart

Regression Analysis:

Linear relationship between variables
Input Y range: Dependent variable
Input X range: Independent variable(s)
Output: Regression statistics

Correlation Analysis:

Measures relationship strength between variables
Output: Correlation matrix
Values range from -1 to +1
Interpretation: -1 (perfect negative), 0 (no relation), +1 (perfect positive)

Sampling:

Random sampling from dataset
Periodic sampling (every nth value)
Useful for large datasets

Charts और Graphs

Chart Types और Usage:

Column Charts:

Compare categories
Best for: Discrete categories, limited number of series
Example: Sales by product category

Bar Charts:

Horizontal version of column charts
Best for: Long category names, ranking data
Example: Top 10 customers by revenue

Line Charts:

Show trends over time
Best for: Continuous data, time series
Example: Monthly sales trends

Scatter Plot:

Relationship between two variables
Best for: Correlation analysis, outlier detection
Example: Height vs Weight relationship

Pie Charts:

Parts of a whole
Best for: Few categories (≤7), percentage composition
Example: Market share by company

Area Charts:

Cumulative values over time
Best for: Multiple series, showing total and components
Example: Revenue breakdown by product lines

Chart Formatting Best Practices:

Clear, descriptive titles
Labeled axes with units
Legend placement
Consistent colors
Remove unnecessary gridlines
Appropriate chart size

Dashboard Creation:

Combine multiple charts
Use consistent formatting
Add slicers for interactivity
Include key metrics
Logical layout flow

मॉड्यूल 4: SQL Database Analysis

अध्याय 4.1: SQL Fundamentals

Database Concepts

Database Definition: A database is organized collection of structured information stored electronically in a computer system.

Key Components:

Tables:

Store data in rows और columns
Each table represents an entity (customers, products, orders)
Example: Customer table with columns: ID, Name, Email, Phone

Records (Rows):

Individual entries in table
Each record represents one instance
Example: One customer's complete information

Fields (Columns):

Attributes of entity
Define data type और constraints
Example: Customer_Name field stores text data

Primary Key:

Unique identifier for each record
Cannot be NULL या duplicate
Example: Customer_ID, Product_ID

Foreign Key:

Links tables together
References primary key of another table
Maintains referential integrity

Relationships:

One-to-One: Each record in Table A relates to one record in Table B
One-to-Many: One record in Table A relates to multiple records in Table B
Many-to-Many: Multiple records in both tables can relate to each other

SQL Syntax Basics

SQL Structure:

Structured Query Language
Declarative language (what, not how)
Case-insensitive (convention: UPPERCASE for keywords)

Basic Syntax Rules:

Statements end with semicolon (;)
String values in single quotes ('text')
Comments: -- single line, /* multi-line */

SQL Categories:

DDL (Data Definition Language): CREATE, ALTER, DROP
DML (Data Manipulation Language): INSERT, UPDATE, DELETE
DQL (Data Query Language): SELECT
DCL (Data Control Language): GRANT, REVOKE

Data Types

Numeric Types:

INT: Whole numbers (-2,147,483,648 to 2,147,483,647)
BIGINT: Large whole numbers
DECIMAL(p,s): Fixed-point numbers (p=precision, s=scale)
FLOAT: Approximate numbers

String Types:

CHAR(n): Fixed length strings
VARCHAR(n): Variable length strings
TEXT: Large text data

Date/Time Types:

DATE: Date only (YYYY-MM-DD)
TIME: Time only (HH:MM:SS)
DATETIME: Date and time combined
TIMESTAMP: Date and time with timezone

Other Types:

BOOLEAN: TRUE/FALSE values
BLOB: Binary large objects (images, files)

Database Design Principles

Normalization: Process of organizing data to reduce redundancy

First Normal Form (1NF):

Each column contains atomic values
No repeating groups
Each row is unique

Second Normal Form (2NF):

Must be in 1NF
No partial dependencies on primary key
Non-key attributes depend on entire primary key

Third Normal Form (3NF):

Must be in 2NF
No transitive dependencies
Non-key attributes depend only on primary key

Better design: Customers: Customer_ID, Name, City Products: Product_ID, Name, Price Orders: Order_ID, Customer_ID, Product_ID

अध्याय 4.2: Basic SQL Queries

SELECT Statements

Basic SELECT:

SELECT column1, column2

FROM table_name;

Select All Columns:

SELECT *

FROM customers;

Select Specific Columns:

SELECT customer_name, email, city

FROM customers;

Column Aliases:

SELECT customer_name AS name,

email AS contact_email

FROM customers;

WHERE Clause Conditions

Comparison Operators:

-- Equal to

SELECT * FROM products WHERE price = 100;

-- Not equal to

SELECT * FROM products WHERE price != 100;

SELECT * FROM products WHERE price <> 100;

-- Greater than

SELECT * FROM products WHERE price > 100;

-- Less than or equal to

SELECT * FROM products WHERE price <= 100;

Logical Operators:

-- AND condition

SELECT * FROM customers

WHERE city = 'Mumbai' AND age > 25;

-- OR condition

SELECT * FROM customers

WHERE city = 'Delhi' OR city = 'Mumbai';

-- NOT condition

SELECT * FROM customers

WHERE NOT city = 'Chennai';

Pattern Matching with LIKE:

-- Starts with 'A'

SELECT * FROM customers WHERE customer_name LIKE 'A%';

-- Ends with 'son'

SELECT * FROM customers WHERE customer_name LIKE '%son';

-- Contains 'john'

SELECT * FROM customers WHERE customer_name LIKE '%john%';

-- Exactly 5 characters

SELECT * FROM customers WHERE customer_name LIKE '_____';

Range Conditions:

-- BETWEEN operator

SELECT * FROM products

WHERE price BETWEEN 100 AND 500;

-- IN operator

SELECT * FROM customers

WHERE city IN ('Mumbai', 'Delhi', 'Bangalore');

-- IS NULL

SELECT * FROM customers

WHERE phone IS NULL;

-- IS NOT NULL

SELECT * FROM customers

WHERE email IS NOT NULL;

Sorting with ORDER BY

Ascending Order (Default):

SELECT * FROM customers

ORDER BY customer_name;

SELECT * FROM customers

ORDER BY customer_name ASC;

Descending Order:

SELECT * FROM products

ORDER BY price DESC;

Multiple Column Sorting:

SELECT * FROM customers

ORDER BY city ASC, customer_name DESC;

Sorting by Column Position:

SELECT customer_name, city, age

FROM customers

ORDER BY 2, 1; -- Sort by city, then by name

Aggregate Functions

COUNT Function:

-- Count all rows

SELECT COUNT(*) FROM customers;

-- Count non-null values

SELECT COUNT(email) FROM customers;

-- Count distinct values

SELECT COUNT(DISTINCT city) FROM customers;

SUM Function:

-- Total sales

SELECT SUM(amount) FROM orders;

-- Sum with condition

SELECT SUM(amount) FROM orders

WHERE order_date >= '2024-01-01';

AVG Function:

-- Average price

SELECT AVG(price) FROM products;

-- Average with rounding

SELECT ROUND(AVG(price), 2) FROM products;

MIN और MAX Functions:

-- Minimum and maximum prices

SELECT MIN(price) AS lowest_price,

MAX(price) AS highest_price

FROM products;

GROUP BY Clause

Basic Grouping:

-- Count customers by city

SELECT city, COUNT(*) AS customer_count

FROM customers

GROUP BY city;

Multiple Column Grouping:

-- Sales by region and product category

SELECT region, category, SUM(sales) AS total_sales

FROM sales_data

GROUP BY region, category;

Grouping with Calculations:

-- Average order value by customer

SELECT customer_id,

COUNT(*) AS order_count,

AVG(amount) AS avg_order_value,

SUM(amount) AS total_spent

FROM orders

GROUP BY customer_id;

HAVING Clause

Difference from WHERE:

WHERE filters rows before grouping
HAVING filters groups after aggregation

Examples:

-- Cities with more than 10 customers

SELECT city, COUNT(*) AS customer_count

FROM customers

GROUP BY city

HAVING COUNT(*) > 10;

-- Customers with total purchases > 10000

SELECT customer_id, SUM(amount) AS total_purchases

FROM orders

GROUP BY customer_id

HAVING SUM(amount) > 10000;

अध्याय 4.3: Advanced SQL

Joins

INNER JOIN: Returns only matching records from both tables

-- Customer orders with customer names

SELECT c.customer_name, o.order_id, o.amount

FROM customers c

INNER JOIN orders o ON c.customer_id = o.customer_id;

LEFT JOIN (LEFT OUTER JOIN): Returns all records from left table, matching records from right

-- All customers, including those without orders

SELECT c.customer_name, o.order_id, o.amount

FROM customers c

LEFT JOIN orders o ON c.customer_id = o.customer_id;

RIGHT JOIN (RIGHT OUTER JOIN): Returns all records from right table, matching records from left

-- All orders, including those without customer details

SELECT c.customer_name, o.order_id, o.amount

FROM customers c

RIGHT JOIN orders o ON c.customer_id = o.customer_id;

FULL OUTER JOIN: Returns all records when there's a match in either table

-- All customers and orders

SELECT c.customer_name, o.order_id, o.amount

FROM customers c

FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;

Multiple Table Joins:

-- Orders with customer and product details

SELECT c.customer_name, p.product_name, o.quantity, o.amount

FROM orders o

INNER JOIN customers c ON o.customer_id = c.customer_id

INNER JOIN products p ON o.product_id = p.product_id;

Self Join: Join table with itself

-- Employees and their managers

SELECT e1.employee_name AS employee,

e2.employee_name AS manager

FROM employees e1

LEFT JOIN employees e2 ON e1.manager_id = e2.employee_id;

Subqueries

Single Value Subqueries:

-- Customers who spent more than average

SELECT customer_name, total_spent

FROM customers

WHERE total_spent > (SELECT AVG(total_spent) FROM customers);

Multiple Value Subqueries:

-- Products in categories with high sales

SELECT product_name, price

FROM products

WHERE category_id IN (

SELECT category_id

FROM sales_summary

WHERE total_sales > 100000

);

Correlated Subqueries: Subquery references outer query

-- Customers with above-average orders for their city

SELECT customer_name, city, order_amount

FROM orders o1

WHERE order_amount > (

SELECT AVG(order_amount)

FROM orders o2

WHERE o2.city = o1.city

);

EXISTS Operator:

-- Customers who have placed orders

SELECT customer_name

FROM customers c

WHERE EXISTS (

SELECT 1

FROM orders o

WHERE o.customer_id = c.customer_id

);

Window Functions

ROW_NUMBER(): Assigns unique number to each row

-- Rank customers by total spending

SELECT customer_name, total_spent,

ROW_NUMBER() OVER (ORDER BY total_spent DESC) AS rank

FROM customers;

RANK() और DENSE_RANK():

-- Rank with ties

SELECT customer_name, total_spent,

RANK() OVER (ORDER BY total_spent DESC) AS rank,

DENSE_RANK() OVER (ORDER BY total_spent DESC) AS dense_rank

FROM customers;

PARTITION BY:

-- Rank customers within each city

SELECT customer_name, city, total_spent,

ROW_NUMBER() OVER (PARTITION BY city ORDER BY total_spent DESC) AS city_rank

FROM customers;

Aggregate Window Functions:

-- Running total of sales

SELECT order_date, daily_sales,

SUM(daily_sales) OVER (ORDER BY order_date) AS running_total

FROM daily_sales_summary;

LAG और LEAD:

-- Compare with previous month

SELECT month, sales,

LAG(sales) OVER (ORDER BY month) AS prev_month_sales,

sales - LAG(sales) OVER (ORDER BY month) AS growth

FROM monthly_sales;

Case Statements

Simple CASE:

-- Categorize customers by spending

SELECT customer_name, total_spent,

CASE

WHEN total_spent > 10000 THEN 'High Value'

WHEN total_spent > 5000 THEN 'Medium Value'

ELSE 'Low Value'

END AS customer_category

FROM customers;

Searched CASE:

-- Complex conditions

SELECT product_name, price, stock_quantity,

CASE

WHEN stock_quantity = 0 THEN 'Out of Stock'

WHEN stock_quantity < 10 THEN 'Low Stock'

WHEN price > 1000 AND stock_quantity > 50 THEN 'Premium Available'

ELSE 'In Stock'

END AS stock_status

FROM products;

CASE in Aggregations:

-- Conditional counting

SELECT

COUNT(CASE WHEN age < 30 THEN 1 END) AS young_customers,

COUNT(CASE WHEN age BETWEEN 30 AND 50 THEN 1 END) AS middle_age,

COUNT(CASE WHEN age > 50 THEN 1 END) AS senior_customers

FROM customers;

मॉड्यूल 5: Python में डेटा एनालिसिस

अध्याय 5.1: Python Basics

Python Installation और Setup

Python Installation:

Download from python.org
Choose Python 3.8+ version
Check "Add Python to PATH"
Verify installation: python --version

IDE Options:

Jupyter Notebook: Interactive development
PyCharm: Professional IDE
VS Code: Lightweight, extensible
Spyder: Scientific computing focus

Package Management:

# Install pip (usually included)

python -m pip install --upgrade pip

# Install packages

pip install pandas numpy matplotlib seaborn

# Install from requirements file

pip install -r requirements.txt

# Create virtual environment

python -m venv data_analysis_env

Basic Syntax और Data Types

Variables:

# Variable assignment

name = "John"

age = 25

height = 5.8

is_student = True

# Multiple assignment

x, y, z = 1, 2, 3

Data Types:

# Numeric types

integer_num = 42

float_num = 3.14

complex_num = 2 + 3j

# String

text = "Hello, World!"

multiline = """This is

a multiline

string"""

# Boolean

is_true = True

is_false = False

# None type

empty_value = None

Type Checking:

print(type(age)) # <class 'int'>

print(isinstance(age, int)) # True

String Operations:

# String methods

name = " John Doe "

print(name.lower()) # " john doe "

print(name.upper()) # " JOHN DOE "

print(name.strip()) # "John Doe"

print(name.replace("John", "Jane")) # " Jane Doe "

# String formatting

age = 25

message = f"I am {age} years old" # f-string

message = "I am {} years old".format(age) # format method

Lists:

# List creation

numbers = [1, 2, 3, 4, 5]

mixed_list = [1, "hello", 3.14, True]

empty_list = []

# List operations

numbers.append(6) # Add element

numbers.insert(0, 0) # Insert at position

numbers.remove(3) # Remove specific value

popped = numbers.pop() # Remove and return last element

# List slicing

print(numbers[0]) # First element

print(numbers[-1]) # Last element

print(numbers[1:4]) # Elements from index 1 to 3

print(numbers[:3]) # First 3 elements

print(numbers[2:]) # From index 2 to end

Dictionaries:

# Dictionary creation

student = {

"name": "John",

"age": 25,

"grades": [85, 90, 78]

}

# Dictionary operations

print(student["name"]) # Access value

student["city"] = "Mumbai" # Add new key-value

student.update({"phone": "123"}) # Update multiple

# Dictionary methods

print(student.keys()) # All keys

print(student.values()) # All values

print(student.items()) # Key-value pairs

Control Structures

Conditional Statements:

# if-elif-else

score = 85

if score >= 90:

grade = "A"

elif score >= 80:

grade = "B"

elif score >= 70:

grade = "C"

else:

grade = "F"

# Ternary operator

grade = "Pass" if score >= 60 else "Fail"

# Multiple conditions

if score >= 80 and attendance >= 75:

print("Eligible for scholarship")

Loops:

# for loop with range

for i in range(5): # 0 to 4

print(i)

for i in range(1, 11): # 1 to 10

print(i)

for i in range(0, 10, 2): # 0, 2, 4, 6, 8

print(i)

# for loop with lists

fruits = ["apple", "banana", "orange"]

for fruit in fruits:

print(fruit)

# for loop with enumerate

for index, fruit in enumerate(fruits):

print(f"{index}: {fruit}")

# while loop

count = 0

while count < 5:

print(count)

count += 1

# Loop control

for i in range(10):

if i == 3:

continue # Skip iteration

if i == 7:

break # Exit loop

print(i)

List Comprehensions:

# Basic list comprehension

squares = [x**2 for x in range(10)]

# With condition

even_squares = [x**2 for x in range(10) if x % 2 == 0]

# Nested comprehension

matrix = [[i*j for j in range(3)] for i in range(3)]

Functions और Modules

Function Definition:

# Basic function

def greet(name):

return f"Hello, {name}!"

# Function with default parameters

def calculate_area(length, width=1):

return length * width

# Function with multiple return values

def get_name_age():

return "John", 25

name, age = get_name_age()

# Function with variable arguments

def sum_numbers(*args):

return sum(args)

result = sum_numbers(1, 2, 3, 4, 5)

# Function with keyword arguments

def create_profile(**kwargs):

return kwargs

profile = create_profile(name="John", age=25, city="Mumbai")

Lambda Functions:

# Lambda function

square = lambda x: x**2

print(square(5)) # 25

# Lambda with map

numbers = [1, 2, 3, 4, 5]

squared = list(map(lambda x: x**2, numbers))

# Lambda with filter

even_numbers = list(filter(lambda x: x % 2 == 0, numbers))

Modules:

# Importing modules

import math

from datetime import datetime

import pandas as pd # Alias

# Using imported functions

print(math.sqrt(16))

print(datetime.now())

# Creating custom module

# File: my_functions.py

def add(a, b):

return a + b

# Using custom module

# from my_functions import add

# result = add(5, 3)

अध्याय 5.2: NumPy Library

Arrays Creation और Manipulation

NumPy Installation:

import numpy as np

Array Creation:

# From list

arr1 = np.array([1, 2, 3, 4, 5])

arr2 = np.array([[1, 2, 3], [4, 5, 6]])

# Built-in functions

zeros = np.zeros(5) # Array of zeros

ones = np.ones((3, 4)) # 3x4 array of ones

full = np.full((2, 3), 7) # Array filled with 7

empty = np.empty((2, 2)) # Uninitialized array

# Range arrays

range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]

linspace = np.linspace(0, 1, 5) # 5 evenly spaced values

# Random arrays

random_arr = np.random.random(5) # Random floats 0-1

random_int = np.random.randint(1, 10, 5) # Random integers

normal_dist = np.random.normal(0, 1, 100) # Normal distribution

Array Properties:

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr.shape) # (2, 3)

print(arr.size) # 6

print(arr.ndim) # 2

print(arr.dtype) # int64

print(arr.itemsize) # 8 bytes per element

Mathematical Operations

Element-wise Operations:

arr1 = np.array([1, 2, 3, 4])

arr2 = np.array([5, 6, 7, 8])

# Arithmetic operations

addition = arr1 + arr2 # [6, 8, 10, 12]

subtraction = arr1 - arr2 # [-4, -4, -4, -4]

multiplication = arr1 * arr2 # [5, 12, 21, 32]

division = arr2 / arr1 # [5.0, 3.0, 2.33, 2.0]

power = arr1 ** 2 # [1, 4, 9, 16]

# Scalar operations

scalar_mult = arr1 * 3 # [3, 6, 9, 12]

scalar_add = arr1 + 10 # [11, 12, 13, 14]

Mathematical Functions:

arr = np.array([1, 4, 9, 16, 25])

# Square root

sqrt_arr = np.sqrt(arr) # [1, 2, 3, 4, 5]

# Trigonometric functions

angles = np.array([0, np.pi/2, np.pi])

sin_values = np.sin(angles) # [0, 1, 0]

cos_values = np.cos(angles) # [1, 0, -1]

# Logarithmic functions

log_values = np.log(arr) # Natural log

log10_values = np.log10(arr) # Base 10 log

# Exponential

exp_values = np.exp([1, 2, 3]) # [e, e^2, e^3]

Aggregate Functions:

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Basic aggregations

print(np.sum(arr)) # Sum of all elements

print(np.mean(arr)) # Average

print(np.median(arr)) # Median

print(np.std(arr)) # Standard deviation

print(np.var(arr)) # Variance

print(np.min(arr)) # Minimum

print(np.max(arr)) # Maximum

# Axis-wise operations

print(np.sum(arr, axis=0)) # Sum along columns [12, 15, 18]

print(np.sum(arr, axis=1)) # Sum along rows [6, 15, 24]

print(np.mean(arr, axis=0)) # Column means [4, 5, 6]

Array Indexing और Slicing

Basic Indexing:

arr = np.array([10, 20, 30, 40, 50])

# Single element

print(arr[0]) # 10

print(arr[-1]) # 50

# Slicing

print(arr[1:4]) # [20, 30, 40]

print(arr[:3]) # [10, 20, 30]

print(arr[2:]) # [30, 40, 50]

print(arr[::2]) # [10, 30, 50] - every 2nd element

2D Array Indexing:

arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Element access

print(arr_2d[0, 1]) # 2

print(arr_2d[1][2]) # 6

# Row/column access

print(arr_2d[0, :]) # First row [1, 2, 3]

print(arr_2d[:, 1]) # Second column [2, 5, 8]

# Subarray

print(arr_2d[0:2, 1:3]) # [[2, 3], [5, 6]]

Boolean Indexing:

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Boolean condition

condition = arr > 5

print(condition) # [False, False, False, False, False, True, True, True, True, True]

print(arr[condition]) # [6, 7, 8, 9, 10]

# Direct boolean indexing

print(arr[arr > 5]) # [6, 7, 8, 9, 10]

print(arr[(arr > 3) & (arr < 8)]) # [4, 5, 6, 7]

# Multiple conditions

print(arr[(arr < 3) | (arr > 8)]) # [1, 2, 9, 10]

Fancy Indexing:

arr = np.array([10, 20, 30, 40, 50])

# Index array

indices = [0, 2, 4]

print(arr[indices]) # [10, 30, 50]

# 2D fancy indexing

arr_2d = np.array([[1, 2], [3, 4], [5, 6]])

rows = [0, 2]

cols = [1, 0]

print(arr_2d[rows, cols]) # [2, 5]

Broadcasting

Broadcasting Rules: NumPy automatically broadcasts arrays of different shapes during arithmetic operations.

# Scalar with array

arr = np.array([1, 2, 3, 4])

result = arr + 10 # [11, 12, 13, 14]

# 1D with 2D

arr_1d = np.array([1, 2, 3])

arr_2d = np.array([[10], [20], [30]])

result = arr_1d + arr_2d # [[11, 12, 13], [21, 22, 23], [31, 32, 33]]

# Different shapes

a = np.array([[1, 2, 3]]) # Shape: (1, 3)

b = np.array([[4], [5]]) # Shape: (2, 1)

result = a + b # Shape: (2, 3)

print(result)

# [[5, 6, 7],

# [6, 7, 8]]

Broadcasting Examples:

# Normalize columns

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

column_means = np.mean(data, axis=0) # [4, 5, 6]

normalized = data - column_means # Broadcasting subtraction

# Distance calculation

points = np.array([[1, 2], [3, 4], [5, 6]])

center = np.array([3, 4])

distances = np.sqrt(np.sum((points - center)**2, axis=1))

अध्याय 5.3: Pandas Library

DataFrames और Series

Pandas Installation:

import pandas as pd

import numpy as np

Series Creation:

# From list

series1 = pd.Series([1, 2, 3, 4, 5])

# With custom index

series2 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

# From dictionary

data_dict = {'A': 1, 'B': 2, 'C': 3}

series3 = pd.Series(data_dict)

# Series properties

print(series2.values) # Array of values

print(series2.index) # Index labels

print(series2.dtype) # Data type

print(series2.shape) # Dimensions

DataFrame Creation:

# From dictionary

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],

'Age': [25, 30, 35, 28],

'City': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai'],

'Salary': [50000, 60000, 70000, 55000]

}

df = pd.DataFrame(data)

# From list of lists

data_list = [

['Alice', 25, 'Mumbai', 50000],

['Bob', 30, 'Delhi', 60000],

['Charlie', 35, 'Bangalore', 70000]

]

df2 = pd.DataFrame(data_list, columns=['Name', 'Age', 'City',

Page updated

Google Sites

Report abuse