COMM 392 - GROUP PROJECT: PREDICTIVE ANALYTICS AND SUPERVISED MACHINE LEARNING
1. Project Overview and Deliverables
The group project activities contribute 40% of each student’s final mark in the course (35% for the project analysis and report and 5% for the predictive analytics competition, which counts in the participation grade). Winning the competition improves the final grade by 3% (first place), 2% (second place), and 1% (third place).
Your task in this project is to work with your team of 4-5 students to solve a business problem based on a given dataset for an organization. The business context, dataset, and detailed project tasks are described in the next sections.
For five days of competition (leading up to the project submission deadline), the teams have the opportunity to compete by developing increasingly accurate predictive models. Submitting a new model on each day of the competition is worth 1%. The best predictive model will be determined based on the F2 score of the champion model at the testing stage. To participate on a given day, a team must upload an improved version of their KNIME workflow to the appropriate assignment folder before midnight. The results of each competition day are revealed in the next day by the instructor.
You will also submit a Project Report that summarizes your work on the following Project Steps, which follow the CRISP-DM process:
1. Business Understanding
2. Data Understanding and Preparation
3. Predictive Modeling/Classification
a. Classification using Logistic Regression
b. Classification using Decision Tree
c. Classification using Ensemble methods: Gradient Boosting & Random Forest
d. Compare the results of the different techniques and evaluate the performance of the selected champion model
4. Post-prediction Analysis: Score the model on a new dataset and evaluate its prediction performance
5. Conclusions and Recommendations
All the analytics steps must be conducted in KNIME.
The project steps are further explained in the sections below.
Here are the deliverables for this project:
Deliverable |
Submission Mode |
Deadline |
Competition Submissions Make a daily submission over a period of 5 days. The KNIME file uploaded must contain the latest predictive workflow of the team, and the F2 score of the champion model at the testing stage. Make sure to always properly embed the data file into the workflow as practiced in class. The KNIME files with the workflow for each day should be renamed “Group Project_Day X_Team XX”. It should include the workflow you will develop to train, validate, test, and score the algorithms. |
.knar /.knwf files uploaded to the appropriate assignment folders on course website |
The last 5 days leading up to the project submission deadline (one submission each day) |
Project Report The report should include the sections outlined below. Your report should clearly describe what you did, any assumptions you made, the results you obtained, what you learned, and any gleaned insights as appropriate. Use the provided report template. 1. Executive Summary 2. Business Understanding 3. Data Understanding and Preparation 4. Predictive Modelling/ Classification 5. Post-prediction Analysis 6. Conclusion and Recommendations 7. Appendix: Tables, Figures, Screenshots of KNIME output, and any other information that supports your analysis |
MS Word file (DOC, DOCX) uploaded to the appropriate assignment folder on course website |
See syllabus |
Overview of ML process for group project The KNIME file with the final workflow should be renamed “Group Project_Final_Team XX”. It should include the full workflow you will develop to train, validate, test, and score the algorithms. Make sure to always properly embed the data file into the workflow as practiced in class. |
.knar /.knwf file uploaded to the assignment folder on the course website |
|
KPI Spreadsheet |
Excel file (XLSX) uploaded to the |
Submit the KPI spreadsheet used to evaluate the financial impact of your AI solution. |
assignment folder on course website |
|
2. Background
You and your team form. the advanced analytics department of a U.S. nationwide property and casualty insurance company. Your team specializes in analyzing the insurance fraud occurring in auto accident claims.
According to a 2012 Insurance Research Council (IRC) study, Insurance Fraud accounted for up to 17% of total claims payments and up to $7.7 billion was fraudulently added to paid claims for auto insurance bodily injury payments. And this doesn’t include all the forms of auto insurance fraud.
Your company’s senior management is well aware of these fraud statistics and your company can cite similar statistics from its own data; a 2017 internal study found a 20% fraud rate exists!
Meanwhile, senior management has also been watching the Property and Casualty (P&C) industry become disrupted by new players, and they've examined the reasons behind its own slowly weakening market position. In response, they are now seeking to reinvent the firm by leveraging AI in away similar to Lemonade Insurance.
While the expected benefits include a reduction in costs by reserving fraud investigation resources (i.e., human workers manually investigating the claims) only for those most suspicious claims, management sees other benefits to modernization, such as being able to market the company to new segments of the population. For example, the company’s latest data showed that the current customer’s average age is about 44. Management wishes to target younger adults who demand fast claims payouts, service by smartphone, and lower premiums.
This modernization partly hinges on management’s objective that a real-time, AI-enabled predictive model is (i) trustable and (ii) demonstrable to be better than the current state “legacy” process as measured by an anticipated positive change in a specific key performance indicator (KPI) (more on this below). Management wants to know that the idea is viable in these two ways.
To this end, management has asked your team to build an AI proof of concept (POC) that reduces the net costs of fraud by increasing the efficiency in being able to catch fraud when it happens. In sum, your AI solution aims at automatically identifying potentially fraudulent claims that will then be investigated by human experts.
2.1. Trustable Results
To demonstrate trust in the results of the AIPOC, management has requested evidence that the data flagged as fraudulent will likely contain at least 25% more cases that are actually fraudulent than a random sample of the same data. As shown later, you can use the Lift table and graph in KNIME to implement this constraint.
2.2. Key Performance Indicator (KPI)
The ultimate KPI of interest to senior management is “Net Fraud Savings.”
This KPI is calculated as the dollar value of suspected fraudulent claims that are flagged for investigation and found to be truly fraudulent MINUS the Total Cost of fraud investigations.
The Total Cost of Investigations (TCI) = Number of Investigations (NI) * Cost per Investigation (CPI)
For simplicity, the Cost per Investigation (CPI) of claims has been reduced to a simple formula:
• $1,000 per claim investigated for the first 600 claims
• $2,500 per claim investigated for every claim after the first 600 claims investigated.
The business case for a predictive AI model is illustrated by a simple calculation used in the Figures below. An AI model that does nothing except correctly identify fraud better than chance (20%, which is the proportion of fraudulent cases in the dataset) increases the KPI. See below how that one KPI can help tell the story!
An optimal point exists for this KPI, and it is a function of the model’s predictive capability and the number of investigated claims. For example, if your AI model were able to predict fraud and it correctly identified 75% of the fraud cases (i.e., 75% success rate), you could then calculate the “money not paid out on fraudulent claims” for a given number of these suspected fraudulent claims you decide to investigate. This is because, in the primary dataset, each claim has an estimated claim payment value (in the “claim_est_payout” column). Deducting the total cost of fraud investigations from this amount will give you the KPI metric of “net fraud savings” .
In the KPI Spreadsheet provided (auto_claims_kpi.xlsx), you can try these calculations yourself. In the first worksheet (“KPI Calculation”), simply enter a success rate in the "% successfully identified" column (cells B6:B11) in the top table (don’t touch the bottom table at this point), and then observe how the KPI of “net fraud savings” increases and decreases based on the inputted success rate as well as the number of investigated claims. In the Figures below, you can see the KPI values for three scenarios: 20% success rate (the legacy model, or current state), a model based on 75% success rate, and a model based on 45% success rate.
Note: These scenarios only serve to demonstrate the value of a hypothetical AI model to management. After you obtain your KNIME results, you will need to use the bottom table of the KPI spreadsheet with the actual results from your analysis so that you can calculate an optimal KPI. More on this in the Post-prediction Analysis section.
Figure 1: KPI Based on Current State (20%)
Figure 2: KPI - Future State Using 75% Success Rate for Illustration
Figure 3: KPI - Future State Using 45% Success Rate for Illustration
3. Project Files and Data Sets
In addition to this document (the project description), the following files and data sets are included with the project.
3.1. Project Report Files and The Primary Data Set
For the project report, the following files are included:
The Primary Data Set
Management had previously commissioned the investigations department to do a deep dive into fraudulent cases of auto claims. This resulted in a thorough internal study on fraud where 30,000 records of auto insurance claims were classified as fraudulent or non-fraudulent. Having this ground truth was essential to management's objective of reducing fraud costs. That study also confirmed management's expectation that 20% of cases were fraudulent and 80% were non- fraudulent. This is the dataset you will use for your AI POC.
Senior management has split the primary data set containing 30,000 records and provided you with two files: (1) a file named auto_claims.csv, which contains 20,000 records to be used for developing (training, validating, and testing) your AI proof-of-concept, and (2) afile named auto_claims_score.csv, which contains the remaining stratified subset of 10,000 records as a scoring data set. That scoring data set will be used later for scoring your champion AI model. The scoring data set will also be the basis for calculating the optimal KPI in the KPI spreadsheet provided to you.
You will use the auto_claims.csv file to train and compare your supervised learning AI models. First, you will need to decide how to split the data so that you end up with three subsets:
1. Training Subset: The training subset will be used to develop your AI models. Your team will create several models using different classification algorithms that you learned in the course (logistic regression, decision tree, gradient boosting, and random forest).
2. Validation Subset: The validation subset will be used to validate your AI models. Using this subset,you will obtain the relevant performance measure for each algorithm (the F-2 score, as explained later), compare the models against one another, and select a champion model.
3. Test Subset: The test subset measures the performance of your champion model.
Werecommend starting your iterative ML journey with a split of 60%-30%-10%, but you may choose to tryout other ratios (or even use the “cross-validation” technique) to see whether your trained results improve.
The data file you are given contains information on more than 20 attributes that could be used to predict fraudulent auto claims. The final column in the file, “fraud”, represents the class target, which indicates whether a claim was fraudulent or not. It can be presumed to be ground truth.
Here is the data dictionary for the data file:
Table 1: Data Dictionary - Auto Insurance Claims
The second data file you receive from management, auto_claims_score.csv, represents the scoring data set that will be used for scoring the champion AI model. It has all the data EXCEPT that the class variable has been removed. Thus, it cannot be used for training or developing models. Remember that this scoring dataset should only be used later on in the project, for scoring.
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:821613408 微信:horysk8 电子信箱:[email protected]
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。