Teach Me Data Analysis: From Questions to Decisions

Teach Me Data Analysis: From Questions to Decisions

Verified Sources
Jun 18, 2026

Data analysis is the disciplined practice of transforming raw data into evidence that supports decisions. A strong analyst does not merely “make charts”; they define a problem, assess data quality, clean and transform information, explore patterns, test assumptions, communicate findings, and recommend action.

A practical data analysis workflow often follows six phases: Ask, Prepare, Process, Analyze, Share, and Act, a structure popularized in professional analytics training. More mature analytics and data mining projects often use lifecycle models such as CRISP-DM, which emphasizes business understanding, data understanding, preparation, modeling, evaluation, and deployment.

At its core, data analysis combines three capabilities:

CapabilityWhat it answersTypical methods
Descriptive analysisWhat happened?Tables, charts, averages, distributions
Diagnostic analysisWhy did it happen?Segmentation, correlation, drill-downs
Predictive analysisWhat might happen?Regression, classification, forecasting

Good data analysis is not only technical. It is also critical thinking: every chart, metric, and model should connect back to a decision, a hypothesis, or a measurable outcome.

Footnotes

  1. Coursera: Data Analysis Process - Explains the common Ask, Prepare, Process, Analyze, Share, and Act framework used in analytics education.

  2. IBM Documentation: CRISP-DM Overview - Describes the CRISP-DM lifecycle for structured data mining and analytics projects.

Learning Objective

By the end of this section, you should be able to frame an analytical question, prepare and clean data, perform exploratory analysis, choose appropriate charts, interpret basic statistics, and communicate evidence-based recommendations with citations and limitations.

A Practical Data Analysis Lifecycle

Ask

Phase 1

Define the business or research question, identify stakeholders, specify success metrics, and decide what action the analysis should inform. Analytics work should start with a clear decision context rather than with random exploration."

Footnotes

  1. Coursera: Data Analysis Process - Explains the common Ask, Prepare, Process, Analyze, Share, and Act framework used in analytics education.

Prepare

Phase 2

Locate, collect, and document data sources. Analysts should evaluate whether the data is relevant, representative, accessible, and ethically usable before analysis begins."

Footnotes

  1. IBM Documentation: CRISP-DM Overview - Describes the CRISP-DM lifecycle for structured data mining and analytics projects.

Process

Phase 3

Clean the data by handling missing values, duplicates, inconsistent formats, invalid categories, and outliers. Data quality checks commonly examine dimensions such as completeness, validity, consistency, accuracy, and timeliness."

Footnotes

  1. CDC: Data Quality - Discusses data quality concepts relevant to public health surveillance and analytical reliability.

Analyze

Phase 4

Use exploratory data analysis, descriptive statistics, segmentation, hypothesis testing, or models to discover patterns. Exploratory data analysis emphasizes graphs, distributions, residuals, and unexpected structure in the data."

Footnotes

  1. NIST/SEMATECH e-Handbook: Exploratory Data Analysis - Provides a technical reference on exploratory data analysis, graphical methods, and data investigation.

Share

Phase 5

Convert results into clear visuals, concise explanations, and recommendations. Effective communication should distinguish evidence, assumptions, uncertainty, and practical implications."

Act

Phase 6

Implement a decision, monitor outcomes, and compare results against the original success metrics. Analysis becomes valuable when it changes action."

The Analyst’s Mindset

A data analyst thinks in questions, evidence, and uncertainty. Before touching a spreadsheet or writing code, ask:

  1. What decision will this analysis support?
  2. Who will use the result?
  3. What metric defines success?
  4. What data is available, and what data is missing?
  5. What assumptions could distort the conclusion?

A strong analytical question is specific, measurable, and actionable. For example:

Weak questionStrong analytical question
“How are sales doing?”“Which customer segments contributed most to the 12%12\% sales decline in Q2 compared with Q1?”
“Are users happy?”“Which onboarding steps are associated with lower 7-day retention among new users?”
“Is the campaign good?”“Did the email campaign increase conversion rate compared with the control group?”

A metric must be aligned with the decision. For example, if the goal is customer retention, total website visits may be less relevant than repeat purchase rate, churn rate, or cohort retention.

Common analytical metric formulas include:

Conversion Rate=Number of ConversionsNumber of Visitors×100\text{Conversion Rate} = \frac{\text{Number of Conversions}}{\text{Number of Visitors}} \times 100 Churn Rate=Customers Lost During PeriodCustomers at Start of Period×100\text{Churn Rate} = \frac{\text{Customers Lost During Period}}{\text{Customers at Start of Period}} \times 100 Average Order Value=Total RevenueNumber of Orders\text{Average Order Value} = \frac{\text{Total Revenue}}{\text{Number of Orders}}

Start with the Decision, Not the Dataset

A dataset can answer many questions, but not all of them matter. Begin by writing the decision you want to influence, then choose metrics and methods that directly support that decision.

How to Perform a Complete Data Analysis

  1. 1
    Step 1

    Translate a vague topic into a precise analytical question. Define the population, time period, comparison group, metric, and decision. For example, instead of asking whether sales are low, ask which product categories caused the month-over-month revenue decline and whether discounting, traffic, or conversion changed.

  2. 2
    Step 2

    List internal and external sources such as transaction tables, survey responses, web analytics, CRM records, public datasets, logs, or experiments. Document each source, owner, update frequency, permissions, and known limitations.

  3. 3
    Step 3

    Review rows, columns, data types, unique identifiers, timestamps, categories, and units. A data dictionary helps prevent misinterpretation when multiple people use the same dataset.

  4. 4
    Step 4

    Check missing values, duplicates, invalid ranges, impossible dates, inconsistent labels, and mismatched units. Data quality assessment is essential because unreliable inputs can produce misleading conclusions even when the analysis method is correct.

    Footnotes

    1. CDC: Data Quality - Discusses data quality concepts relevant to public health surveillance and analytical reliability.

  5. 5
    Step 5

    Standardize formats, correct known errors, remove or flag duplicates, handle missingness, create calculated fields, aggregate records, and join tables carefully. Keep a reproducible record of every transformation.

  6. 6
    Step 6

    Use exploratory data analysis to understand distributions, relationships, unusual observations, and subgroup differences. EDA commonly uses visual and numerical summaries to reveal structure before formal modeling.

    Footnotes

    1. NIST/SEMATECH e-Handbook: Exploratory Data Analysis - Provides a technical reference on exploratory data analysis, graphical methods, and data investigation.

  7. 7
    Step 7

    Apply appropriate methods such as descriptive statistics, cohort analysis, correlation, regression, segmentation, or hypothesis testing. Interpret results in relation to the original question, not merely statistical output.

  8. 8
    Step 8

    Present the key message, evidence, limitations, and recommended action. Use charts when they clarify comparisons, patterns, distributions, or relationships.

  9. 9
    Step 9

    After action is taken, track whether the recommendation improved the target metric. This closes the loop and turns analysis into organizational learning.

Data Preparation and Cleaning

Most real-world analysis time is spent preparing data. A clean dataset is not necessarily perfect; it is suitable for the analytical purpose and its limitations are documented.

Important concepts include missing data, duplicate records, outliers, and data validation.

Common cleaning tasks:

ProblemExamplePossible treatment
Missing valuesAge is blank for 18%18\% of usersImpute, exclude, flag, or investigate source issue
Duplicate rowsSame transaction appears twiceRemove exact duplicates or deduplicate by key
Inconsistent labels“USA”, “U.S.”, “United States”Standardize categories
Invalid valuesNegative quantity soldCorrect, remove, or flag after source verification
Unit mismatchRevenue in dollars and cents mixedConvert to common unit
Date problemsText dates in multiple formatsParse and standardize timestamps

The idea of tidy data is widely used because it makes analysis, visualization, and modeling easier to automate.

Footnotes

  1. Hadley Wickham: Tidy Data - Defines tidy data principles: variables as columns, observations as rows, and values as cells.

Never Clean Data Silently

Every cleaning decision can change the conclusion. Record what you removed, changed, imputed, or flagged. Reproducibility is part of analytical credibility.

1import pandas as pd 2 3df = pd.read_csv("sales.csv") 4 5# Inspect structure 6print(df.info()) 7print(df.head()) 8 9# Standardize column names 10df.columns = ( 11 df.columns 12 .str.strip() 13 .str.lower() 14 .str.replace(" ", "_") 15) 16 17# Remove exact duplicates 18df = df.drop_duplicates() 19 20# Parse dates 21df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce") 22 23# Create a calculated metric 24df["revenue"] = df["quantity"] * df["unit_price"] 25 26# Check missingness 27missing_rate = df.isna().mean().sort_values(ascending=False) 28print(missing_rate)

Exploratory Data Analysis

Exploratory data analysis is the stage where you learn what the dataset is trying to tell you before making strong claims. John Tukey’s tradition of EDA emphasized visual thinking, resistant summaries, and discovery-oriented analysis; modern EDA remains central to analytical work.

Key EDA questions:

  1. Distribution: What values are common, rare, or impossible?
  2. Central tendency: What is typical?
  3. Variation: How spread out are the values?
  4. Relationships: Which variables move together?
  5. Segments: Do patterns differ by group?
  6. Time: Are there trends, seasonality, or sudden breaks?
  7. Outliers: Are unusual values errors, rare events, or important signals?

Core descriptive statistics:

StatisticMeaningFormula or interpretation
MeanArithmetic averagexˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i
MedianMiddle valueRobust to extreme outliers
ModeMost frequent valueUseful for categorical data
RangeMaximum minus minimumSensitive to outliers
VarianceAverage squared deviations2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2
Standard deviationTypical distance from means=s2s = \sqrt{s^2}
Interquartile rangeMiddle 50%50\% spreadIQR=Q3Q1IQR = Q_3 - Q_1

Footnotes

  1. NIST/SEMATECH e-Handbook: Exploratory Data Analysis - Provides a technical reference on exploratory data analysis, graphical methods, and data investigation.

Illustrative Data Quality Issue Counts

Example count of problems found during a fictional data audit

Choosing the Right Visualization

Visualization is not decoration. It is a reasoning tool. The correct chart depends on the analytical task: comparing quantities, showing change over time, displaying distributions, mapping relationships, or explaining composition. Well-designed visualizations reduce cognitive load and help audiences see patterns accurately.

Analytical taskGood chart choicesAvoid
Compare categoriesBar chart, dot plot3D bars, overloaded pie charts
Show trend over timeLine chart, area chartUnsorted bars for continuous time
Show distributionHistogram, box plot, density plotPie chart
Show relationshipScatter plot, bubble chartDual-axis chart without justification
Show compositionStacked bar, treemap, limited pie chartToo many slices
Show geographyChoropleth, symbol mapMaps when location is irrelevant

A useful rule: if the viewer needs to compare lengths or positions, use bars or dots; if the viewer needs to see movement over time, use lines.

Footnotes

  1. Tableau: What Is Data Visualization? - Introduces the role of visualization in understanding and communicating data.

Correlation Is Not Causation

Two variables can move together because of coincidence, confounding, reverse causality, or a hidden third factor. Causal claims usually require stronger designs such as experiments, natural experiments, or careful causal inference methods.

From Descriptive Analysis to Statistical Inference

Descriptive analysis summarizes the data you have. Statistical inference helps you reason beyond the observed sample.

Important concepts:

  • A population is the entire group of interest.
  • A sample is the observed data.
  • A parameter describes the population.
  • A statistic describes the sample.
  • A confidence interval communicates uncertainty.

For example, if you survey 500500 customers and find that 62%62\% are satisfied, 62%62\% is a sample statistic. The true satisfaction rate for all customers may be higher or lower. A confidence interval expresses that uncertainty.

Hypothesis testing usually begins with:

H0:There is no effect or differenceH_0: \text{There is no effect or difference} HA:There is an effect or differenceH_A: \text{There is an effect or difference}

A p-value is often used in hypothesis testing, but it should not be treated as the probability that the hypothesis is true. Interpretation must consider study design, sample size, effect size, and practical significance.

Example: What was the average monthly revenue in 2024? This can be answered with aggregation, summary statistics, and visualization.

Practical Analysis Techniques

1. Segmentation

Segmentation helps identify which groups drive an overall pattern. If revenue falls by 10%10\%, the cause may be isolated to one region, product, acquisition channel, or customer cohort.

Example segmentation dimensions:

DomainUseful segments
E-commerceProduct category, device type, traffic source, customer tenure
EducationGrade level, course, attendance group, assessment type
HealthcareAge group, diagnosis category, facility, treatment pathway
SaaSPlan type, signup cohort, usage level, company size

2. Cohort Analysis

Cohort analysis is useful for retention and lifecycle questions. For example, users who joined in January can be tracked across weeks to see whether retention differs from users who joined in February.

3. Correlation and Regression

Correlation measures how strongly two variables move together. The Pearson correlation coefficient ranges from 1-1 to 11:

r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum (x_i-\bar{x})^2 \sum (y_i-\bar{y})^2}}

Regression estimates how an outcome changes with one or more predictors. A simple linear regression has the form:

y=β0+β1x+ϵy = \beta_0 + \beta_1x + \epsilon

where yy is the outcome, xx is the predictor, β0\beta_0 is the intercept, β1\beta_1 is the slope, and ϵ\epsilon is the error term.

4. Experiments and A/B Tests

An A/B test compares variants such as two landing pages, email subject lines, or checkout designs. Randomization helps reduce bias and makes causal interpretation stronger than simple observational comparison.

Common Data Analysis Mistakes

Core Data Analysis Terms

1 / 7
14%
Question · Term

Data cleaning

Click to reveal
Answer · Definition

The process of correcting, standardizing, validating, and preparing data for analysis.

Mini Case Study: Analyzing a Revenue Drop

Imagine an online store reports that monthly revenue fell from \500{,}000toto$425{,}000$.

The first calculation is:

Revenue Change=425,000500,000500,000×100=15%\text{Revenue Change} = \frac{425{,}000 - 500{,}000}{500{,}000} \times 100 = -15\%

A poor analysis stops here. A better analysis decomposes revenue into drivers:

Revenue=Visitors×Conversion Rate×Average Order Value\text{Revenue} = \text{Visitors} \times \text{Conversion Rate} \times \text{Average Order Value}

Suppose the analyst finds:

MetricPrevious monthCurrent monthChange
Visitors250,000250{,}000255,000255{,}000+2%+2\%
Conversion rate4.0%4.0\%3.3%3.3\%17.5%-17.5\%
Average order value\50$\50.50$+1%+1\%
Revenue\500{,}000$\425{,}000$15%-15\%

The likely driver is not traffic or order value; it is conversion rate. The analyst should then segment conversion rate by device, channel, product category, region, and checkout step.

This decomposition turns a vague problem into targeted investigation.

How to Communicate an Analysis

  1. 1
    Step 1

    Begin with the main conclusion in one sentence. Decision-makers should understand the key message before seeing supporting details.

  2. 2
    Step 2

    Use a small number of charts or tables that directly support the conclusion. Each visual should have a clear title, labeled axes, units, and a takeaway.

  3. 3
    Step 3

    Connect the pattern to likely drivers. Distinguish confirmed evidence from hypotheses that require further validation.

  4. 4
    Step 4

    Identify missing data, assumptions, possible bias, measurement issues, and uncertainty. This improves trust rather than weakening the analysis.

  5. 5
    Step 5

    Translate the analysis into a concrete next step, such as running an experiment, fixing a data pipeline, changing targeting, or investigating a specific segment.

  6. 6
    Step 6

    Specify how success will be measured after the recommendation is implemented.

Use the One-Sentence Insight Test

If you cannot summarize the finding in one sentence, the analysis is probably not ready. A strong insight states what changed, where it changed, how much it changed, and why it matters.

A beginner does not need every tool at once. Start with concepts, then add tools as your questions become more complex.

Tool categoryBeginner optionWhy it matters
SpreadsheetsExcel or Google SheetsFast inspection, pivot tables, simple charts
QueryingSQLEssential for extracting and aggregating database data
ProgrammingPython or RReproducible cleaning, analysis, automation, and modeling
VisualizationTableau, Power BI, Python libraries, or R packagesClear communication and dashboarding
StatisticsDescriptive statistics and inferenceSound interpretation of uncertainty
DocumentationData dictionaries and notebooksReproducibility and collaboration

Python’s pandas library is widely used for tabular data manipulation, including loading, filtering, joining, grouping, reshaping, and summarizing datasets. SQL is essential because much organizational data lives in relational databases, where analysts often need to filter, join, and aggregate records before deeper analysis.

Footnotes

  1. pandas Documentation: 10 Minutes to pandas - Provides an overview of pandas operations for tabular data manipulation in Python.

Frequently Asked Questions

Knowledge Check

Question 1 of 5
Q1Single choice

Which sequence best describes a practical data analysis workflow?

Explore Related Topics

1

Teach Me Cybersecurity: Foundations, Threats, Defenses, and Professional Practice

The course presents core cybersecurity concepts—from the CIA triad and risk formula Risk=Likelihood×Impact\text{Risk} = \text{Likelihood} \times \text{Impact} to the NIST CSF 2.0 functions, common threats, and layered defenses.

  • Defense‑in‑depth uses governance, identity, endpoint, network, application, data, monitoring, and resilience controls.
  • Identity security (MFA, least‑privilege, access reviews) and reliable backups are the highest‑value early steps.
  • A typical attack progresses through reconnaissance, initial access, execution, privilege escalation, persistence, lateral movement, and action on objectives.
  • Secure software development follows a SDLC: define requirements, threat model, code securely, test, deploy safely, then monitor and improve.
  • The learning roadmap guides beginners from networking basics to specialization in ops, testing, forensics, cloud, or governance.
2

How to Become a Data Scientist

Becoming a data scientist requires a multidisciplinary foundation in math, statistics, programming, machine learning, domain knowledge, and communication, combined with hands‑on projects that demonstrate the full data‑science lifecycle.

  • Master core competencies: probability & inference, Python + SQL, data cleaning/EDA, modeling (regression, classification, clustering) and storytelling.
  • Follow the iterative CRISP‑DM process: business understanding → data preparation → modeling → evaluation → deployment.
  • Build 2–4 end‑to‑end portfolio projects with messy real data, clear documentation, and business impact to outweigh certificates.
  • A typical 12‑month pathway allocates ~20% effort to math & stats, 25% to Python/SQL, and the remainder to cleaning, ML, and portfolio work.
  • Employers usually require at least a bachelor’s degree, but strong projects and communication often outweigh advanced degrees.