Beginners Resources

What Does a Data Scientist Do? A Step-by-Step Guide

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)
Oops! Something went wrong while submitting the form.

This resource demystifies the data scientist’s role, walking you through every stage of the workflow: defining a problem, gathering data, cleaning, exploring, modeling, evaluating, and communicating results. You’ll practice with SQL and Python examples, see real-world applications, and learn how Galaxy streamlines collaboration and query management for data scientists.

Table of Contents

Learning Objectives

  • Understand the end-to-end workflow of a data scientist.
  • Identify the key skills, tools, and deliverables at each stage.
  • Write basic SQL and Python code to explore and model data.
  • Apply best practices for collaboration, versioning, and communication.
  • Leverage Galaxy to speed up data discovery, querying, and sharing.

1. The Big Picture: Why Data Science Matters

Data science converts raw data into actionable insights that improve products, reduce costs, and unlock new revenue streams. A data scientist sits at the intersection of statistics, computer science, and domain expertise, acting as:

  • Analyst – digging into past events and trends.
  • Engineer – writing robust code, SQL, and pipelines.
  • Storyteller – translating findings for stakeholders.

2. A Data Scientist’s Workflow

  1. Problem Definition
  2. Data Collection & Access
  3. Data Cleaning & Wrangling
  4. Exploratory Data Analysis (EDA)
  5. Modeling & Evaluation
  6. Deployment & Monitoring
  7. Communication & Iteration

Let’s dive into each step with hands-on examples.

3. Problem Definition

3.1 Clarify Business Goals

Example prompt: “Reduce user churn by 10% in Q4.” Translate this into a data question: “Can we predict which users are likely to churn in the next 30 days?”

3.2 Common Obstacles

  • Vague objectives → Ask follow-up questions.
  • Lack of metrics → Define measurable KPIs first.

4. Data Collection & Access

4.1 Querying Databases with SQL

Real-world datasets often live in relational databases. Below is a starter query you might run in Galaxy:

SELECT user_id,
MIN(event_date) AS signup_date,
MAX(event_date) AS last_active_date,
COUNT(*) AS total_events
FROM analytics.events
WHERE event_date < CURRENT_DATE
GROUP BY user_id;

Galaxy Tip: Use Galaxy’s context-aware AI copilot to autocomplete table names, suggest joins, and annotate your query. Save it to a Collection named Churn_Analysis for your team.

4.2 External Sources

APIs, CSVs, streaming logs, and third-party datasets often complement your internal data.

5. Data Cleaning & Wrangling

5.1 Typical Tasks

  • Handling missing values
  • Removing duplicates
  • Type conversions
  • Creating derived features

5.2 Hands-On in Python

import pandas as pd

df = pd.read_csv("events.csv")
# Drop rows with no user_id
df = df.dropna(subset=["user_id"])
# Convert timestamps
df["event_date"] = pd.to_datetime(df["event_date"])
# Engineer feature: days_since_last_event
latest = df.groupby("user_id")["event_date"].max()
df = df.join(latest, on="user_id", rsuffix="_last")
df["days_since_last_event"] = (df["event_date_last"].max() - df["event_date_last"]).dt.days

5.3 Common Pitfalls

  • Imbalanced classes → Use resampling or balanced metrics.
  • Data leakage → Split train/test before feature engineering.

6. Exploratory Data Analysis (EDA)

EDA uncovers patterns, anomalies, and relationships that guide modeling decisions.

6.1 Visual Exploration

import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(df["days_since_last_event"], bins=30)
plt.title("User Activity Recency")
plt.show()

Use Galaxy’s upcoming visualization preview to generate quick charts without leaving the SQL editor.

6.2 Statistical Summaries

SELECT AVG(days_since_last_event) AS avg_recency,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY days_since_last_event) AS median_recency
FROM user_features;

7. Modeling & Evaluation

7.1 Choosing a Model

  • Classification (churn prediction)
  • Regression (LTV estimation)
  • Clustering (user segmentation)

7.2 Quick Logistic Regression

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X = df[["days_since_last_event", "total_events"]]
y = df["is_churned"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

preds = model.predict_proba(X_test)[:, 1]
print("AUC:", roc_auc_score(y_test, preds))

7.3 Evaluation Metrics

  • Classification: AUC, F1, Precision-Recall
  • Regression: RMSE, MAE, R²

7.4 Error Analysis

Dive into false positives/negatives to refine features. Save Jupyter notebooks in Galaxy-linked Git to keep model code versioned.

8. Deployment & Monitoring

8.1 From Notebook to Production

  • Convert models to APIs (FastAPI, Flask, or serverless).
  • Schedule batch predictions via Airflow or dbt-jobs.
  • Store results in a predictions table; expose via Galaxy-endorsed query.

8.2 Monitoring

Create dashboards to track model drift. In Galaxy, endorse a query calculating weekly AUC and share with stakeholders.

9. Communication & Storytelling

9.1 Tailor the Message

Executives want ROI; engineers want reproducibility; product wants next steps. Use:

  • Interactive dashboards (Looker, Mode, soon Galaxy).
  • Slide decks with clear visuals.
  • Galaxy Collections for curated SQL + charts.

9.2 Tips for Effective Storytelling

  1. Lead with the “so what.”
  2. Use plain language; avoid jargon.
  3. Highlight uncertainty and assumptions.

10. Essential Tools & Skills

10.1 Programming & Querying

  • SQL (Galaxy makes this faster and collaborative)
  • Python (pandas, NumPy, scikit-learn, PyTorch)
  • R (for certain statistical analyses)

10.2 Cloud & DevOps

  • Databases: PostgreSQL, Snowflake, BigQuery
  • Version Control: GitHub (integrates with Galaxy)
  • Orchestration: Airflow, dbt, Prefect

10.3 Soft Skills

  • Stakeholder management
  • Structured communication
  • Experimentation mindset

11. Common Obstacles & Troubleshooting

11.1 Dirty Data

Symptom: Model performs poorly. Fix: Re-examine feature engineering; use data validation checks.

11.2 Slow Queries

Symptom: SQL takes minutes to run. Fix: Use Galaxy’s AI to suggest indices, rewrite queries, and profile cost.

11.3 Stakeholder Misalignment

Symptom: Analysis answers the wrong question. Fix: Revisit objectives; document assumptions in Galaxy Collections for transparency.

12. Practice Exercises

  1. SQL Challenge: Write a query to compute 7-day rolling retention by signup cohort. Run in Galaxy, then endorse it.
  2. Python EDA: Load the “Titanic” dataset and discover three factors most correlated with survival.
  3. Model Deployment: Wrap your churn model in a FastAPI endpoint; log predictions to a database.
  4. Storytelling: Create a one-slide summary of your findings for a non-technical audience.

Key Takeaways

  • A data scientist’s work is cyclical: define, collect, clean, explore, model, deploy, communicate, repeat.
  • SQL and Python are foundational; tools like Galaxy amplify productivity and collaboration.
  • Soft skills—clarity, curiosity, communication—are as critical as technical chops.

Next Steps

  1. Set up a free Galaxy Workspace and connect a sample database.
  2. Tackle the practice exercises above, saving all SQL in Galaxy Collections.
  3. Explore advanced topics: feature stores, MLOps, causal inference.

Check out some other beginners resources