Spark and Snowflake-Informed Analytics Pipeline
A big-data analytics project on Karnataka agriculture data using Python preprocessing, feature engineering, statistical exploration, and architecture comparisons with Spark and Snowflake.
Primary audience

Story
The problem
Yield outcomes varied by season, rainfall, irrigation, and soil type, making manual interpretation difficult and error-prone.
The approach
Used Pandas and NumPy for cleaning, scaling, encoding, and feature creation, then profiled relationships through statistical and visual analysis.
The outcome
The project revealed crop and region patterns, highlighted data quality risks, and framed a scalable path using Spark and Snowflake concepts.
Overview
This project analyzes a multi-year Karnataka agricultural dataset (3,151 records) spanning crops, climate conditions, irrigation methods, yields, prices, and seasons. The workflow includes rigorous data cleaning with Pandas and NumPy, transformation and encoding, feature engineering, statistical profiling, and visual analytics including correlation heatmaps. It also includes an engineering-focused comparison of Apache Spark and Snowflake for scalable cloud analytics, highlighting trade-offs in performance, cost, and complexity.
Capabilities
Handled missing values, corrected schema issues, removed inconsistencies, and prepared data for reliable downstream analysis.
Applied scaling for numeric features and one-hot encoding for categorical variables to standardize mixed-source data.
Created meaningful variables such as yield-per-area and interaction terms to improve interpretability and predictive potential.
Used descriptive stats and correlation heatmaps to reveal influential relationships between climate factors, yields, and prices.
Detected outliers, typographical issues, and edge-case values to improve trustworthiness before modeling.
Evaluated Pandas workflows against Spark and Snowflake paradigms to define a migration path for larger data volumes.
Gallery

Engineering
Results
Delivered a clean, analysis-ready Karnataka agriculture dataset
Built reusable preprocessing and feature engineering workflow in Python
Produced clear statistical insights into yield, climate, and pricing dynamics
Identified practical data quality and scaling constraints early
Established a foundation for Spark or Snowflake-backed expansion