Machine Learning Process & Frameworks

The gist of Machine Learning Process frameworks — KDD, CRISP-DM

Pankaj Jainani
3 min readMay 27, 2021

Introduction

Very rarely we formally discuss various data analysis process frameworks that are useful to manage our Data Analytics and Machine Learning project lifecycle. The gist of any data analysis and data mining framework is about — Data collection, data preprocessing, analyzing to find hidden insights and storytelling. In this post, I will briefly touch upon these processes and the distinguishing features and similarities for various such frameworks.

Overview of Data Analysis Processes

Data Analysis Process Frameworks

The KDD Process

KDD stands for Knowledge Discovery from Data or Knowledge Discovery in Databases. The main objective of this process framework is to extract or discover hidden patterns from large databases. These can be Data Warehouse or Data Mart for descriptive or prescriptive analysis. It has 7 — phases:

  1. Data Cleaning: Noise is removed, missing values are handled and outliers are detected
  2. Data Integration: Combine data from different sources and perform ETL
  3. Data Selection: Select relevant data for the specific analysis task.
  4. Data Transformation: Data engineering, feature engineering specifically for the analysis task.
  5. Data Mining: Discover useful & unknown patterns using Data Mining techniques
  6. Pattern Evaluation: Evaluate the extracted patterns for the business analysis task
  7. Knowledge Representation: Visualize and present to the business people for decision making purpose

SEMMA

SEMMA is an acronym for Sample, Explore, Modify, Model, and Assess. Thereby, it is comprised of 5-phases:

  1. Sample: Identify different databases and merge; select the sample which is sufficient for the modeling process
  2. Explore: Understand the data, discover the relationships among
    variables, visualize the data and get initial interpretations
  3. Modify: Deal with missing values, detecting outliers, transforming features, and creating new additional features.
  4. Model: Apply different modeling techniques.
  5. Assess: Evaluate the performance of the models with baseline metrics.

SEMMA emphasizes model building and assessment

CRISP-DM

CRoss-InduStry Process for Data Mining (CRISP-DM) is a well-defined and proven machine learning and data mining framework. It is practical, cyclic, flexible to solve business problems. CRISP-DM is comprised of 6-stages:

  1. Business Understanding: Understand the business scenario and requirements complying; get an initial action plan.
  2. Data Understanding: understand the data and its collection process, perform data quality checks, and gain initial insights.
  3. Data Preparation: Prepare data for analytics by remediating missing values, outlier detection and handling, overall feature engineering. This is the most important phase.
  4. Modeling: Model designing and validation using various algorithms and techniques.
  5. Evaluation: assess and test the model’s performance on validation and test data using model evaluation measures such as MSE, RMSE, R-Square for regression and accuracy, precision, recall, and the F1-measure.
  6. Deployment: The final phase — the model that was chosen in the previous step will be deployed to the production environment. Implement MLOps for complete lifecycle management.

The following diagram shows the complete cycle for CRISP-DM —

Block Diagram for CRISP-DM framework for Machine Learning process

Summary

The post briefly explains various Data Analysis, Data Mining, and Machine Learning frameworks — KDD mainly focuses on data-driven pattern discovery. SEMMA guides mainly focus upon model-building tasks. Lastly, CRISP-DM core strength is to focus mainly on business understanding and model deployment till production.

--

--

Pankaj Jainani

Here to learn about Artificial Intelligence & Cloud Computing | LinkedIn: http://www.linkedin.com/in/p-jainani | Twitter: @pankajjainani