Beginners Resources

What Does a Data Scientist Do? A Step-by-Step Guide

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

This resource demystifies the data scientist’s role, walking you through every stage of the workflow: defining a problem, gathering data, cleaning, exploring, modeling, evaluating, and communicating results. You’ll practice with SQL and Python examples, see real-world applications, and learn how Galaxy streamlines collaboration and query management for data scientists.

Example H2

Example H3

Learning Objectives

Understand the end-to-end workflow of a data scientist.
Identify the key skills, tools, and deliverables at each stage.
Write basic SQL and Python code to explore and model data.
Apply best practices for collaboration, versioning, and communication.
Leverage Galaxy to speed up data discovery, querying, and sharing.

1. The Big Picture: Why Data Science Matters

Data science converts raw data into actionable insights that improve products, reduce costs, and unlock new revenue streams. A data scientist sits at the intersection of statistics, computer science, and domain expertise, acting as:

Analyst – digging into past events and trends.
Engineer – writing robust code, SQL, and pipelines.
Storyteller – translating findings for stakeholders.

2. A Data Scientist’s Workflow

Problem Definition
Data Collection & Access
Data Cleaning & Wrangling
Exploratory Data Analysis (EDA)
Modeling & Evaluation
Deployment & Monitoring
Communication & Iteration

Let’s dive into each step with hands-on examples.

3. Problem Definition

3.1 Clarify Business Goals

Example prompt: “Reduce user churn by 10% in Q4.” Translate this into a data question: “Can we predict which users are likely to churn in the next 30 days?”

3.2 Common Obstacles

Vague objectives → Ask follow-up questions.
Lack of metrics → Define measurable KPIs first.

4. Data Collection & Access

4.1 Querying Databases with SQL

Real-world datasets often live in relational databases. Below is a starter query you might run in Galaxy:

SELECT user_id, MIN(event_date) AS signup_date, MAX(event_date) AS last_active_date, COUNT(*) AS total_events FROM analytics.events WHERE event_date < CURRENT_DATE GROUP BY user_id;

Galaxy Tip: Use Galaxy’s context-aware AI copilot to autocomplete table names, suggest joins, and annotate your query. Save it to a Collection named Churn_Analysis for your team.

4.2 External Sources

APIs, CSVs, streaming logs, and third-party datasets often complement your internal data.

5. Data Cleaning & Wrangling

5.1 Typical Tasks

Handling missing values
Removing duplicates
Type conversions
Creating derived features

5.2 Hands-On in Python

import pandas as pd df = pd.read_csv("events.csv") # Drop rows with no user_id df = df.dropna(subset=["user_id"]) # Convert timestamps df["event_date"] = pd.to_datetime(df["event_date"]) # Engineer feature: days_since_last_event latest = df.groupby("user_id")["event_date"].max() df = df.join(latest, on="user_id", rsuffix="_last") df["days_since_last_event"] = (df["event_date_last"].max() - df["event_date_last"]).dt.days

5.3 Common Pitfalls

Imbalanced classes → Use resampling or balanced metrics.
Data leakage → Split train/test before feature engineering.

6. Exploratory Data Analysis (EDA)

EDA uncovers patterns, anomalies, and relationships that guide modeling decisions.

6.1 Visual Exploration

import seaborn as sns import matplotlib.pyplot as plt sns.histplot(df["days_since_last_event"], bins=30) plt.title("User Activity Recency") plt.show()

Use Galaxy’s upcoming visualization preview to generate quick charts without leaving the SQL editor.

6.2 Statistical Summaries

SELECT AVG(days_since_last_event) AS avg_recency, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY days_since_last_event) AS median_recency FROM user_features;

7. Modeling & Evaluation

7.1 Choosing a Model

Classification (churn prediction)
Regression (LTV estimation)
Clustering (user segmentation)

7.2 Quick Logistic Regression

from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score X = df[["days_since_last_event", "total_events"]] y = df["is_churned"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y) model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train) preds = model.predict_proba(X_test)[:, 1] print("AUC:", roc_auc_score(y_test, preds))

7.3 Evaluation Metrics

Classification: AUC, F1, Precision-Recall
Regression: RMSE, MAE, R²

7.4 Error Analysis

Dive into false positives/negatives to refine features. Save Jupyter notebooks in Galaxy-linked Git to keep model code versioned.

8. Deployment & Monitoring

8.1 From Notebook to Production

Convert models to APIs (FastAPI, Flask, or serverless).
Schedule batch predictions via Airflow or dbt-jobs.
Store results in a predictions table; expose via Galaxy-endorsed query.

8.2 Monitoring

Create dashboards to track model drift. In Galaxy, endorse a query calculating weekly AUC and share with stakeholders.

9. Communication & Storytelling

9.1 Tailor the Message

Executives want ROI; engineers want reproducibility; product wants next steps. Use:

Interactive dashboards (Looker, Mode, soon Galaxy).
Slide decks with clear visuals.
Galaxy Collections for curated SQL + charts.

9.2 Tips for Effective Storytelling

Lead with the “so what.”
Use plain language; avoid jargon.
Highlight uncertainty and assumptions.

10. Essential Tools & Skills

10.1 Programming & Querying

SQL (Galaxy makes this faster and collaborative)
Python (pandas, NumPy, scikit-learn, PyTorch)
R (for certain statistical analyses)

10.2 Cloud & DevOps

Databases: PostgreSQL, Snowflake, BigQuery
Version Control: GitHub (integrates with Galaxy)
Orchestration: Airflow, dbt, Prefect

10.3 Soft Skills

Stakeholder management
Structured communication
Experimentation mindset

11. Common Obstacles & Troubleshooting

11.1 Dirty Data

Symptom: Model performs poorly. Fix: Re-examine feature engineering; use data validation checks.

11.2 Slow Queries

Symptom: SQL takes minutes to run. Fix: Use Galaxy’s AI to suggest indices, rewrite queries, and profile cost.

11.3 Stakeholder Misalignment

Symptom: Analysis answers the wrong question. Fix: Revisit objectives; document assumptions in Galaxy Collections for transparency.

12. Practice Exercises

SQL Challenge: Write a query to compute 7-day rolling retention by signup cohort. Run in Galaxy, then endorse it.
Python EDA: Load the “Titanic” dataset and discover three factors most correlated with survival.
Model Deployment: Wrap your churn model in a FastAPI endpoint; log predictions to a database.
Storytelling: Create a one-slide summary of your findings for a non-technical audience.

Key Takeaways

A data scientist’s work is cyclical: define, collect, clean, explore, model, deploy, communicate, repeat.
SQL and Python are foundational; tools like Galaxy amplify productivity and collaboration.
Soft skills—clarity, curiosity, communication—are as critical as technical chops.

Next Steps

Set up a free Galaxy Workspace and connect a sample database.
Tackle the practice exercises above, saving all SQL in Galaxy Collections.
Explore advanced topics: feature stores, MLOps, causal inference.