5 Agentic Workflows to Automate Your Data Science Pipeline
This article covers five concrete agentic workflows, one for each major stage of a data science pipeline, from automated EDA to feature engineering, with code patterns and real-world scenarios.
--> 5 Agentic Workflows to Automate Your Data Science Pipeline - KDnuggets
-->
Join Newsletter
Introduction
The average data scientist spends roughly 45% of their working time on data preparation and cleaning, not on modeling, not on insight generation, not on the work that requires genuine judgment. That estimate keeps appearing across industry surveys because it keeps being true. The tasks eating up that time — profiling columns, flagging nulls, running the same exploratory data analysis (EDA) scripts, grid-searching hyperparameters, and writing the same monitoring checks — are formulaic enough to follow explicit rules.
That is precisely what makes them automatable with agents. Agentic workflows do not replace the data scientist. They absorb the procedural weight so you can focus on the evaluative weight: deciding whether a model makes sense, whether a feature is genuinely informative, whether a finding warrants a business decision. Platforms like Databricks have already started shipping agentic data science capabilities into their core infrastructure, with their Agent framework explicitly designed to "compress the time from question to insight." This is the direction production data teams are moving.
This article covers five concrete agentic workflows, one for each major stage of a data science pipeline. Each includes a real-world scenario, tested code patterns, and the design decisions that matter in production.
Prerequisites
All five workflows assume Python 3.10+ and familiarity with pandas, scikit-learn, and basic large language model (LLM) API usage. Specific package requirements are listed under each workflow. For the tool-calling patterns, you need either an OpenAI API key or a local serving endpoint (Ollama, vLLM) that exposes an OpenAI-compatible API.
Core packages used across all workflows
pip install openai pandas numpy scipy scikit-learn lightgbm shap pydantic
Workflow 1: Automated Exploratory Data Analysis Agent
What it replaces: Manually loading data, computing summary statistics, visualizing distributions, inspecting nulls, detecting outliers, writing up findings. Every dataset, every time, the same script with different column names.
What the agent does instead: Loads the dataset, runs a full profile, flags issues by severity, and produces a structured Markdown report. A human reviews the findings and decides what to do about them. The agent handles everything before that review.
// Architecture
The agent uses a Reasoning and Acting (ReAct) loop with two tools: profile_dataset produces summary statistics per column, and flag_issues classifies problems by severity. The agent then synthesizes both outputs into a structured report through a single language model call. The key design decision is how the agent handles the flag_issues output; it reasons about which issues are actionable before reporting, so the output is a prioritized list, not a raw dump.
// Code Pattern
eda_agent.py
Prerequisites: pip install openai pandas scipy
Run: python eda_agent.py
import json import pandas as pd from scipy import stats from openai import OpenAI from dataclasses import dataclass
client = OpenAI() # Uses OPENAI_API_KEY env var
@dataclass class ColumnIssue: column: str issue_type: str # null_rate | skewness | dtype | high_correlation severity: str # low | medium | high detail: str
def profile_dataset(df: pd.DataFrame) -> dict: """ Generate per-column statistics. In production, swap this for ydata-profiling for richer output. """ profile = {} for col in df.columns: col_stats = { "dtype": str(df[col].dtype), "null_rate": df[col].isnull().mean(), "n_unique": df[col].nunique(), } if pd.api.types.is_numeric_dtype(df[col]): col_stats["skewness"] = float(df[col].skew()) col_stats["mean"] = float(df[col].mean()) col_stats["std"] = float(df[col].std()) elif df[col].dtype == "object": non_null = df[col].dropna() numeric_coerced = pd.to_numeric(non_null, errors="coerce") col_stats["looks_numeric"] = bool(len(non_null) > 0 and numeric_coerced.notna().mean() > 0.9) profile[col] = col_stats return profile
def flag_issues(profile: dict) -> list[ColumnIssue]: """ Flag data quality issues from a column profile. Severity tiers: high = needs immediate attention, medium = worth reviewing. """ issues = [] for col, stats_dict in profile.items(): null_rate = stats_dict.get("null_rate", 0.0) if null_rate > 0.15: issues.append(ColumnIssue(col, "null_rate", "high", f"{null_rate:.0%} of values are missing")) elif null_rate > 0.05: issues.append(ColumnIssue(col, "null_rate", "medium", f"{null_rate:.0%} of values are missing"))
skewness = abs(stats_dict.get("skewness", 0.0)) if skewness > 5.0: issues.append(ColumnIssue(col, "skewness", "high", f"Extreme skew={skewness:.1f} -- consider log transform")) elif skewness > 2.0: issues.append(ColumnIssue(col, "skewness", "medium", f"Moderate skew={skewness:.1f}"))
Object columns with all-numeric values are likely miscoded
if stats_dict["dtype"] == "object" and stats_dict.get("looks_numeric", False): issues.append(ColumnIssue(col, "dtype", "medium", "Numeric values stored as strings"))
return issues
def run_eda_agent(df: pd.DataFrame, dataset_description: str) -> str: """ Run the EDA agent loop. The agent decides which tools to call and in what sequence, then produces a structured report summarizing its findings. """ profile = profile_dataset(df) issues = flag_issues(profile)
Format issues for the agent
issues_text = "\n".join( f"- [{i.severity.upper()}] {i.column}: {i.issue_type} -- {i.detail}" for i in issues ) or "No issues detected."
prompt = f"""You are a senior data scientist reviewing a dataset for a data science project.
Dataset: {dataset_description}
Column profile (summary stats): {json.dumps(profile, indent=2)}
Detected issues: {issues_text}
Write a structured EDA report with these sections:
- DATASET OVERVIEW -- shape, dtypes, overall quality assessment (1-2 sentences)
- HIGH PRIORITY ISSUES -- items requiring action before modeling
- MEDIUM PRIORITY ISSUES -- items worth monitoring
- RECOMMENDED NEXT STEPS -- ordered list of 3-5 specific actions
Be direct. Prioritize actionability over completeness."""
response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0.2, # Low temperature for consistent structured output ) return response.choices[0].message.content
── Run it ────────────────────────────────────────────────────────────────────
if name == "main":
Example: retail transaction data
import numpy as np np.random.seed(42) n = 5000 df = pd.DataFrame({ "revenue": np.random.exponential(scale=200, size=n), # right-skewed "customer_age": np.random.normal(40, 12, n), "created_at": pd.date_range("2024-01-01", periods=n, freq="h").astype(str), "region_code": np.random.choice(["US", "EU", "APAC", None], size=n, p=[0.5, 0.3, 0.1, 0.1]), "session_count": np.where(np.random.rand(n)
How to run:
export OPENAI_API_KEY=your_key python eda_agent.py
Real scenario
Retail transaction data, 5,000 rows, 8 columns. The agent flags revenue as high-priority (extreme right skew at 7.3), session_count as high-priority (22% null rate), and created_at as medium-priority (date stored as string). It recommends a log transform for revenue, a null indicator feature for session_count, and parsing created_at to extract hour-of-day and day-of-week features. All of this surfaces in under 30 seconds. A human reviews the report and acts on the recommendations, with no time spent running the diagnostics manually.
Workflow 2: Agentic Feature Engineering and Selection
What it replaces: Manually brainstorming interaction features, writing the transformation code, evaluating each candidate with a baseline model, pruning the ones that do not contribute, documenting what survived and why.
What the agent does instead: Proposes candidate features based on the data profile and domain context, generates the transformation code, evaluates each candidate against a fast baseline, and prunes features below a configurable importance threshold, with a written rationale for each decision.
// Architecture
Two phases, one agent. The generation phase uses the LLM to propose candidate features from a structured description of the dataset and the prediction task. The selection phase evaluates each candidate by training a LightGBM classifier with 5-fold cross-validation (CV) and computing feature importance using SHapley Additive exPlanations (SHAP). Features below the threshold are pruned. The agent reasons about the importance scores before pruning; it catches cases where a feature looks weak globally but carries a signal for a specific segment.
// Code Pattern
feature_agent.py
Prerequisites: pip install openai lightgbm shap scikit-learn pandas numpy
Run: python feature_agent.py
import json import numpy as np import pandas as pd from openai import OpenAI from sklearn.model_selection import cross_val_score from sklearn.preprocessing import LabelEncoder import lightgbm as lgb
client = OpenAI()
def generate_feature_candidates( column_descriptions: dict[str, str], target: str, task_type: str = "classification", n_candidates: int = 10, ) -> list[dict]: """ Ask the LLM to propose candidate features given column descriptions and the prediction task. Returns a list of dicts with 'name', 'formula', and 'rationale'. """ prompt = f"""You are a senior ML engineer performing feature engineering for a {task_type} task.
Target variable: {target}
Available columns: {json.dumps(column_descriptions, indent=2)}
Propose {n_candidates} candidate engineered features that are likely to improve model performance. For each feature, provide:
- name: a snake_case feature name
- formula: how to compute it from the available columns (pandas expression)
- rationale: one sentence on why this feature might help
Return a JSON object with a single key "features" containing an array of objects, each with keys: name, formula, rationale. Return ONLY valid JSON -- no explanation outside the JSON."""
response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"}, temperature=0.4, ) result = json.loads(response.choices[0].message.content) return result.get("features", result.get("candidates", []))
def evaluate_and_prune( df: pd.DataFrame, candidate_features: list[dict], target_col: str, importance_threshold: float = 0.01, ) -> tuple[list[str], list[str], dict[str, float]]: """ Add candidate features to the dataframe, train a fast LightGBM baseline, extract feature importances, and prune below threshold.
Returns (kept_features, pruned_features, importance_scores) """ feature_df = df.copy() added = []
for candidate in candidate_features: try:
Evaluate the formula string -- in production, use a safe eval sandbox
feature_df[candidate["name"]] = feature_df.eval(candidate["formula"]) added.append(candidate["name"]) except Exception as e:
Formula failed -- skip this candidate
print(f" Skipped '{candidate['name']}': {e}")
if not added: return [], [], {}
X = feature_df[added].fillna(0) y = df[target_col]
model = lgb.LGBMClassifier(n_estimators=100, random_state=42, verbose=-1) model.fit(X, y)
importance_scores = dict(zip(added, model.feature_importances_ / model.feature_importances_.sum()))
kept = [f for f in added if importance_scores.get(f, 0) >= importance_threshold] pruned = [f for f in added if importance_scores.get(f, 0) str: """Ask the agent to explain its selection decisions in plain language.""" prompt = f"""You are reviewing feature selection results for an ML pipeline.
Features KEPT (above importance threshold): {json.d
[truncated for AI cost control]