This is meant to be a simple introduction to the CRISP-DM framework, which is just one of many artificial intelligence and machine learning lifecycles. There are numerous sources for deeper understanding.
The CRISP-DM framework, the CRoss-Industry Standard Process for Data Mining, was created in 1996. The process consists of six major phases:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
The sequence between the phases is not strict. The outer circle represents the symbolic nature of data mining itself. A data mining process continues after the initial problem is solved, often leading to more focused business questions or refinement of the data.
CRISP-DM Phase 1 – Business Understanding
Business understanding consists of four main steps:
- Understanding business requirements
- Analyzing supporting information
- Converting to a data mining problem
- Preparing a preliminary plan
The business question needs to be both specific and measurable. It can then be turned into a machine learning question. For example, the business question “What customers should we target for a new product?” can be turned into the machine learning question “Would this customer buy the product or not?”. Evaluating the cost of creating a data mining solution to the business value of the question is important. As with all business projects, proper planning is essential, including risks, goals, dependencies, tools and techniques, and project duration.
CRISP-DM Phase 2 – Data Understanding
Data understanding has three primary steps:
- Data collection
- Data properties
- Data quality
The data collection step entails listing data sources and what data to extract from those sources, analyzing the data for additional requirements, and determining if any additional data source is needed. The data properties include understanding the metadata of the data, the size of the set, key features and relationships between data elements, including correlation between elements. The data quality step involves determining if there are any missing data elements, if these can be removed or substituted.
CRISP-DM Phase 3 – Data Preparation
Data preparation includes the final data set selection and preparing the data. The final data set should keep in mind constraints such as total size, which columns to include and exclude, record selection, and element data types. Data preparation may involve cleaning, transforming, merging data sets, normalizing, or formatting the data. The number of records can be a consideration if the data set is small and missing elements can be filled in with default values or using statistical methods. It may be useful to revisit the data understanding phase after this phase is completed.
CRISP-DM Phase 4 – Modeling
Modeling is arguably the most fun phase. It consists of three main steps:
- Model selection and creation
- Creating a model testing plan
- Parameter testing and tuning
This modeling step is tied to the data understanding phase because the model selection influences the data preparation and vice versa. Further testing may reveal that the data doesn’t fit well into the type of modeling algorithm used and Phase 3 must be revisited. Obviously, the first step is to choose a modeling algorithm and the tools needed to do it. Model testing is generally broken into a test and training data set. The split can vary depending on the data set and algorithm. A common split is 30% test and 70% training. An evaluation criterion should be chosen at this time. The actual training can involve tuning hyper parameters to adjust the accuracy or speed of training.
CRISP-DM Phase 5 – Evaluation
Evaluation is where you evaluate how the model is performing with relation to your business goals defined in the business understanding phase and make a decision on if the model should be deployed or not. This depends on the evaluation criteria you outlined in the modeling phase. It is important to keep in mind business considerations like the cost of false positives or negatives, execution speed, and cost. Review the steps taken throughout the process to verify that all criteria are met. Finally determine if the model should be deployed.
CRISP-DM Phase 6 – Deployment
There are four phases in deployment:
- Planning deployment
- Maintenance and monitoring
- Final report
- Project review
First, you need to determine where the model will be deployed. For example, on AWS there are many options including Amazon EC2, Amazon Elastic Container Service, and AWS Lambda. Then you need to decide how the model will be deployed and managed. For example on AWS, AWS CodeDeploy, AWS CloudFormation, AWS OpsWorks, and AWS Elastic Beanstalk. As with all well-architected systems, monitoring system health is important. Examples on AWS include Amazon CloudWatch, AWS CloudTrail, and AWS Elastic Beanstalk. A final report is delivered to stakeholders, highlighting the processes used, if the project goals were met, any findings, and explain the model used and reasoning behind using it. The project review assesses what went wrong, what went write, and determine if any parts of the process can be reused.