联系方式

您当前位置:首页 >> Python编程Python编程

日期:2024-09-20 11:07

Deakin University Trimester 2, 2024

School of IT Assignment 2

Unit Team: SIT742

SIT742: Modern Data Science

Extension Request Students with difffculty in meeting the deadline because of various

reasons, must apply for an assignment extension no later than 5:30pm on 20/09/2024

(Friday). Apply via ‘CloudDeakin’, the menu item ‘Extension Request’ under the

‘Assessment’ drop-down menu.

Academic Integrity All assignment will be checked for plagiarism, and any academic

misconduct will be reported to unit chair and university.

Generative AI Deakin’s Policy and advices on responsible usage of Generative

AI in your studies: https://www.deakin.edu.au/students/study-support/

study-resources/artificial-intelligence

Instructions

Assignment Questions

There are total 2 parts in the assessment task 2:

Part 1 The ffrst part will focus on the data manipulation and pyspark skills which includes the Data

Acquisition, the Data Wrangling, the EDA and Spark, the modules and library from M03,

M04.

Part 2 The second part focus on more advanced data science skills with particular scenario. This part

will require the knowledge covered in M05.

What to Submit?

There is no optional part for assignment 2. You (your group) are required to submit the following completed

ffles to the corresponding Assignment (Dropbox) in CloudDeakin:

SIT742Task2.ipynb The completed notebook with all the run-able code for all requirements (part 1

and part2).

In general, you (your group) need to complete, save the results of running, download/export the

notebook as a local ffle, and submit your notebook from Python platform such as Google Colab.

You need to clearly list the answer for each question, and the expected format from your notebook

will be like in Figure 1 (One notebook for each group).

1Figure 1: Notebook Format

SIT742Task2report.pdf You (group) are also required to write a report with your answer (code) and

running results from SIT742Task2.ipynb for all the questions (Part 1 and Part 2). You could

make screenshot on your answer (code) and running results from SIT742Task2.ipynb and paste

into the report. Please try to include the code comments, and results including plot images as well

in the report, and make sure the code format such as Indentation keeps same as the ipynb notebook.

In this report (one for each group), you will also need to provide a clear explanation on your

logic for solving each question (you could write explanation below your solution and results in the

report). In the explanation, you will need to cover below parts: 1). why you decide to choose your

solution; 2). are there any other solutions that could solve the question; 3). whether your solution is

the optimal or not? why? The length of the explanation part for each question is limited below 100

words.

In the end of your report, you (group) also need to discuss below three points:

• How you and your team member collaborate on this assignment?

• What you have learned with your team member from the second assignment.

• What is the contribution of each the team member for ffnishing the second assignment.

SIT742Task2video.avi A video demonstration between 10 and 15 minutes, and the ffle format can be

any common video format, such as ‘MKV’, ‘WMV’, ‘MOV’ etc.

For your group, one important submission is a short video in which each of You (group members)

orally present the solutions that you provide in the notebook and illustrate the running of code with

the used logic. In the video, your group need to work together to discuss below three points:

• Which question(s) you have worked on and how did you collaborate with other team members.

• What is the logic behind the your solution on the question(s)? and is there any alternative

optimized ways to resolve the question?

• What is your understanding of Code collaboration? How do you collaborate with your group

in coding? What are the common tools/platform to support the Code collaboration?

2Part I

Data Acquisition and Manipulation

There are 10 questions in this part, totalling 60 marks. Each of question is worth 5 marks. Additionally,

the quality of your explanation in both the report and video will collectively be worth 10 marks.

You are recommended to use Google Colab to ffnish all the coding in the code block cell, and provide

sufffcient coding comments, and also save the result of running as well.

The (transactionrecord.zip) data used for this part could be found in here. You will need to use

spark to read the unzipped (csv) data for starting. You could ffnd the code on reading csv data with

Spark from M04G.

Question 1.1

Using PySpark to do some of the data wrangling process, so that:

1.1.1 For the ’NA’ in CustomerNo columns, change it to ’-1’.

1.1.2 Process the text in productName column, only alphabet characters left, and save the processed

result to a new column productName_process and show the ffrst 5 rows.

Question 1.2

Find out the revenue on each transaction date. In order to achieve the above, some wrangling work is

required to be done:

1.2.1 Using pyspark to calculate the revenue (price * Quantity) and save as ffoat format in

pyspark dataframe to show the top 5 rows.

1.2.2 Transform the pyspark dataframe to pandas dataframe (named as df) and create the column

transaction_date with date format according to Date. Print your df pandas dataframe with

top 5 rows after creating the column transaction_date.

1.2.3 Plot the sum of revenue on transaction_date in a line plot and ffnd out any immediate

pattern / insight?

Question 1.3

Let’s continue to analyse on the transaction_date vs revenue.

1.3.1 Determine which workday (day of the week), generates the most sales (plotting the results in

a line chart with workday on averaged revenues).

1.3.2 Identify the name of product (column productName_process) that contributes the highest

revenue on ‘that workday’ (you need to ffnd out from 1.3.1) and the name of product (column

productName_process) that has the highest sales volume (sum of the Quantity), no need to

remove negative quantity transactions.) on ‘that workday’ (you need to ffnd out from 1.3.1).

1.3.3 Please provide two plots showing the top 5 products that contribute the highest revenues in

general and top 5 products that have the highest sales volumes in general.

Question 1.4

3Which country generates the highest revenue? Additionally, identify the month in that country that

has the highest revenue.

Question 1.5

Let’s do some analysis on the CustomerNo and their transactions. Determine the shopping frequency of

customers to identify who shops most frequently (ffnd out the highest distinct count of transactionNo

on customer level, be careful with those transactions that is not for shopping – filter those

transaction quantity <= 0). Also, ffnd out what products (column productName_process) ‘this

customer’ typically buys based on the Quantity of products purchased.

Question 1.6

As the data scientist, you would like to build a basket-level analysis on the product customer buying

(fflter the ‘df’ dataframe with df[’Quantity’]>0). In this task, you need to:

1.6.1 Group by the transactionNo and aggregate the category of product (column

product_category) into list on transactionNo level. Similarly, group and aggregate name of

product (column productName_process) into list on transactionNo level.

1.6.2 Removing duplicates on adjacent elements in the list from product_category you obtained

from 1.6.1, such as [product category 1, product category 1, product category

2, ...] will be processed as [product category 1, product category 2,....]. After this

processing, there will be no duplicates on on adjacent elements in the list. Please save your

processed dataframe as ‘df_1’ and print the top 10 rows.

Question 1.7

Continue work on the results of question 1.6, now for each of the transaction, you will have a list of

product categories. To further conduct the analysis, you need to ffnish below by using dataframe

‘df_1’:

1.7.1 Create new column prod_len to ffnd out the length of the list from product_category on

each transaction. Print the ffrst ffve rows of dataframe ‘df_1’.

1.7.2 Transform the list in product_category from [productcategory1, productcategory2...]

to ‘start > productcategory1 > productcategory2 > ... > conversion’ with new column

path. You need to add ‘start’ as the ffrst element, and ‘conversion’ as the last. Also

you need to use ‘ > ’ to connect each of the transition on products (there is a space between

the elements and the transition symbol >). The ffnal format after the transition is given in

example as below ffg. 2. Deffne the function data_processing to achieve above with three

arguments: df which is the dataframe name, maxlength with default value of 3 for ffltering the

dataframe with prod_len" <=maxlength and minlength with default value of 1 for ffltering

the dataframe with prod_len >=minlength. The function data_processing will return the

new dataframe ‘df_2’. Run your deffned function with dataframe ‘df_1’, maxlength = 5 and

minlength = 2, print the dataframe ‘df_2’ with top 10 rows.

Figure 2: Example of the transformation on 1.7.2, left column is before the transformation, right

column is after the transformation. After transformation, it is not list anymore

4Hint: you might consider to use str.replace() syntax from default python 3.

Question 1.8

Continue to work on the results of question 1.7, the dataframe ‘df_2’, we would like to build the

transition matrix together, but before we actually conduct the programming, we will need to

ffnish few questions for exploration:

1.8.1 Check on your transaction level basket with results from question 1.7, could you please ffnd

out respectively how many transactions ended with pattern ‘... > 0ca > conversion’ / ‘...

> 1ca > conversion’ / ‘... > 2ca > conversion’ / ‘... > 3ca > conversion’ / ‘... >

4ca > conversion’ (1 result for each pattern, total 5 results are expected).

1.8.2 Check on your transaction level basket with results from question 1.7, could you please ffnd

out respectively how many times the transactions contains ‘0ca > 0ca’ / ‘0ca > 1ca’ / ‘0ca >

2ca’ / ‘0ca > 3ca’ / ‘0ca > 4ca’ / ‘0ca > conversion’ in the whole data (1 result for each

pattern, total 6 results are expected and each transaction could contain those patterns multiple

times, such as ‘start > 0ca > 1ca > 0ca > 1ca > conversion’ will count ‘two’ times with

pattern ‘0ca > 1ca’, if there is not any, then return 0, you need to sum the counts from each

transaction to return the ffnal value).

1.8.3 Check on your transaction level basket with results from task question 1.7, could you please

ffnd out how many times the transactions contains ‘...> 0ca > ...’ in the whole data (1

result is expected and each transaction could contain the pattern multiple times, such as ‘start

> 0ca > 1ca > 0ca > 1ca > conversion’ will count ‘two’ times, you need to sum the counts

from each transaction to return the ffnal value).

1.8.4 Use the 6 results from 1.8.2 to divide the result from 1.8.3 and then sum all of them and return

the value.

Hint: you might consider to use endswith and count functions from default python 3.

Question 1.9

Let’s now look at the question 1.6 again, you have the list of product and list of product category

for each transaction. We will use the transactionNo and productName_process to conduct the

Association rule learning.

1.9.1 Work on the dataframe df from question 1.2 (fflter out the transaction with negative quantity

value and also only keep those top 100 products by ranking the sum of quantity) and

build the transaction level product dataframe (each row represents transactionNo and

productName_process become the columns, the value in the column is the Quantity).

Hint: you might consider to use pivot function in pandas.

1.9.2 Run the apriori algorithm to identify items with minimum support of 1.5% (only looking at

baskets with 4 or more items).

Hint: you might consider to use mlxtend.frequent_patterns to run apriori rules.

1.9.3 Run the apriori algorithm to ffnd the items with support >= 1.0% and lift > 10.

1.9.4 Please explore three more examples with different support / conffdence / lift measurements

(you could leverage your rule mining with one of the three measurements or all of them) to

ffnd out any of the interesting patterns from the Association rule learning. Save your code and

results in a clean and tidy format and writing down your insights.

Question 1.10

5After we finished the Association rule learning, it is a time for us to consider to do customer analysis

based on their shopping behaviours.

1.10.1 Work on the dataframe df from question 1.2 and build the customer product dataframe

(each row represents single customerNo and productName_process become as the columns,

the value in the columns is the aggregated Quantity value from all transactions and the result

is a N by M matrix where N is the number of distinct customerNo and M is the number of

distinct productName_process. Please filter out the transaction with negative quantity value

and also only keep those top 100 product by ranking the sum of quantity).

1.10.2 Use the customer-product dataframe, let’s calculate the Pairwise Euclidean distance on

customer level (you will need to use the product Quantity information on each customer to

calculate the Euclidean distance for all other customers and the result is a N by N matrix where

N is the number of distinct customerNo).

1.10.3 Use the customer Pairwise Euclidean distance to find out the top 3 most similar customer to

CustomerNo == 13069 and CustomerNo == 17490.

1.10.4 For the customer CustomerNo == 13069, you could see there are some products that this

customer has never shopped before, could you please give some suggestions on how to recommend

these product to this customer? please write down your suggestions and provide a coding logic

(steps on how to achieve, not actual code).

Part II

Sales Prediction

There are 3 questions in this part, totaling 40 marks. Each question is worth 10 marks. Additionally, the

quality of your explanation in both the report and video will collectively be worth 10 marks.

You are required to use Google Colab to finish all the coding in the code block cell, and provide sufficient

coding comments, and also save the result of running as well.

In this part, we will focus only on two columns revenue with transaction_date to form the revenue

time series based on transaction_date. We will use the dataframe df from question 1.2 (without any

filtering on transactions) to finish below sub-tasks:

Question 2.1

You are required to explore the revenue time series. There are some days not available in the

revenue time series such as 2019-01-01. Please add those days into the revenue time series with

default revenue value with the mean value of the revenue in the whole data (without any filtering on

transactions). After that, decompose the revenue time series with addictive mode and analyses on

the results to find if there is any seasonality pattern (you could leverage the M05A material from lab

session with default setting in seasonal_decompose function).

Question 2.2

We will try to use time series model ARIMA for forecasting the future. you need to find the best model

with different parameters on ARIMA model. The parameter range for p,d,q are all from [0, 1, 2]. In

total, you need to find out the best model with lowest Mean Absolute Error from 27 choices based

on the time from ”Jan-01-2019” to ”Nov-01-2019” (you might need to split the time series to train

and test with grid search according to the M05B material).

Question 2.3

6There are many deep learning time series forecasting methods, could you please explore those methods

and write down the necessary data wrangling and modeling steps (steps on how to achieve, not actual

code). Also please give the reference of the deep learning time series forecasting models you are using.

7


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:821613408 微信:horysk8 电子信箱:[email protected]
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:horysk8