 
  
                  Lab 6&7&8
Professor Julien Maitre, Ph.D. Winter 2023
This lab is graded. The grade is 100 points and represents a percentage of 20% in the final grade for this course. You must form groups of 7 or 8 persons to achieve this lab.
1. General Description
The general objective of this lab 6 is to manipulate and process the data. It represents the step before exploiting machine learning algorithms for knowledge extraction or classification. Thus, this lab 6 will allow you to learn Python's programming language and its libraries for data science (e.g., NumPy, Pandas, Matplotlib, SciPy...).
In addition, in this lab, you will have to produce a scientific report that analyzes/explores the data and describes the processing steps you have applied. The page limit for the scientific report is 10 pages. All student names of the group should appear on the first page.
2. Formalities
The deadline for submitting your work is March 16th, 2023, at 11.59 p.m (China time). After this deadline, there will be a penalty of 10% per day of delay.
You will email me a WeTransfer link with the scientific report and code.
1/4
3. What is expected?
The scientific report should include:
• the description of your dataset o For example:
▪ what are the variables?;
▪ the meaning of each variable;
▪ the number of instances;
▪ the number of classes (if applicable);
▪ the values (e.g., min-max interval) that each of the variables can take?.
• data checking and pre-processing o For example :
▪ how many missing values does your dataset have?;
▪ what method(s) did you use to manage these missing values;
▪ a summary of the number of instances per class (if applicable);
▪ what are the statistics (e.g. mean, variance, standard deviation) for each variable;
o In this part, you could use data visualization tools.
• a statistical study of the data and analyzes/interpretations of this statistical study
o For example :
▪ statistical study for each variable with respect to each class;
▪ statistical study for each variable with respect to each other variable; ▪ hypothesis tests;
▪ correlation between two or more variables;
▪ Chi-square tests;
o In this part, do not hesitate to use data visualization tools. • a conclusion of the study
o summarize the essential information of your data analysis/exploration. What did you learn about/thanks to the data?
• a general conclusion
o summarize what you appreciated, learned, appreciated less in this lab.
4. Details
Regarding the dataset, there is only one restriction. The number of variables should be more than 7 and lower than 12. Also, I recommend you select a dataset where there are classes. If the dataset has more than 12 variables, you can remove variables to reach the maximum number. Finally, you will search on the Web to find a dataset in a field that interests you for more "fun" (e,g., bioinformatics, marketing, commerce, etc.). Here is a sample of web links that provide access to datasets:
• https://www.data.gov/
• https://www.reddit.com/r/datasets/
• https://www.reddit.com/r/data/
• https://registry.opendata.aws/
• https://rs.io/100-interesting-data-sets-for-statistics/
2/4
• https://www.kaggle.com/datasets
• https://archive.ics.uci.edu/ml/datasets.php • https://datasetsearch.research.google.com/ • etc.
3/4
Annex A
Content of your report
Description of
your dataset
Data checking and pre-processing Presentation of statistics
Advanced statistical study
Analyzes and interpretations
Code Comments
10/100 20/100 10/100 10/100 20/100 20/100
5/100 5/100
Scientific report
Total
90/100
Python scripts
Total
10/100
4/4
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:821613408  微信:horysk8 电子信箱:[email protected]  
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。