联系方式

您当前位置:首页 >> Java编程Java编程

日期:2024-12-12 09:29

CP1407 Assignment 2

- Page 1 -

Note: This is an individual assignment. While it is expected that students will

discuss their ideas with one another, students need to be aware of their

responsibilities in ensuring that they do not deliberately or inadvertently

plagiarise the work of others.

Assignment 2 – Practice on various Machine Learning algorithms

1. [Data Pre-Processing, Clustering] [10 marks]

Why is attribute scaling of data important? The following table contains sample

records having the number of numbers and the total revenue generated by particular

stores of a supermarket. Use the table as an example to discuss the necessity of

normalisation in any proximity measurement for clustering purposes.

Supermarket ID Employee Count Revenue

001 38 $5,500,000

002 29 $5,000,000

003 24 $5,000,000

004 10 $890,000

005 40 $2,500,000

006 31 $3,200,000

007 14 $678,000

008 35 $5,200,000

009 30 $5,300,000

010 22 $5,500,000

2. [Classification – Decision Tree algorithm] [20 marks]

Use the soybean dataset (diabetes.arff) to perform decision tree induction in Weka

using three different decision tree induction algorithms; J48, REPTree, and

RandomTree. Investigate different options, particularly looking at differences between

pruned trees and unpruned trees. In discussing your results, consider the following

questions.

a) What are the effects of pruning on the results for the soybean datasets?

b) Are there differences in the performances of the three decision tree algorithms?

c) What impacts do other parameters of the algorithms have on the results?

3. [Classification – Naïve Bayes algorithm] [30 marks]

Suppose we have data on a few individuals randomly examined for basic health check.

The following table gives the data on these individuals’ health-related attributes. CP1407 Assignment 2

- Page 2 -

Body

Weight

Body

Height

Blood

Pressure

Blood Sugar

Level

Habit Class

Heavy Tall High 3 Smoker P

Heavy Short High 1 Nonsmoker P

Normal Tall Normal 3 Nonsmoker N

Heavy Tall Normal 2 Smoker N

Low Medium Normal 2 Nonsmoker N

Low Tall Normal 1 Nonsmoker P

Normal Medium High 3 Smoker P

Low Short High 2 Smoker P

Heavy Tall High 2 Nonsmoker P

Low Medium Normal 3 Smoker P

Heavy Medium Normal 3 Smoker N

Use the data together with the Naïve Bayes classifier to perform a new classification for

the following new instance. Create and use the classifier by hand, not with Weka, and

show all your working.

Body

Weight

Body

Height

Blood

Pressure

Blood Sugar

Level

Habit Class

Low Tall High 2 Smoker ?

4. [Association Rules Mining] [20 marks]

The following table film watching histories for several viewers of an on-demand service.

User Id Items

001 Airplane!, Downfall, Evita, Idiocracy, Jurassic Park

002 Casablanca, Downfall, Evita, Flubber, Jurassic Park

003 Airplane!, Downfall, Half Baked, Jurassic Park

004 Airplane!, Downfall

005 Casablanca, Downfall, Flubber, Jurassic Park, Zoolander

006 Casablanca, Downfall, Half Baked, Idiocracy, Zoolander

007 Evita, Idiocracy, Jurassic Park

008 Downfall, Jurassic Park, Zoolander

009 Casablanca, Downfall, Evita, Half Baked, Jurassic Park, Zoolander

a) Follow the steps outlined in Practical 07 and conduct a mining task for Boolean

association rules using the Apriori algorithm in Weka.

b) Set different parameters and observe the association rules discovered.

c) Weka provides association evaluation parameters other than support and

confidence. Note the evaluation results by those evaluation parameters of example

rules.

CP1407 Assignment 2

- Page 3 -

5. [Clustering] [20 marks]

Consider the following 2-dimensional point data set presented in (x,y) coordinates:

P1(1,1), P2(1,3), P3(4,3), P4(5,4), P5(9,4), P6(9, 6).

Apply the hierarchical clustering method by hand (using Agglomerative algorithm) to

get final two clusters. Use the Manhattan distance function to measure the distance

between points and use the single-linkage scheme to do clustering. Show all your

working.

Rubric

Exemplary Good Satisfactory Limited Very Limited

90-100% 70-80% 50-60% 30-40% 0-20%

For each

question

Answer

demonstrates

excellent

knowledge of

machine

learning and

data science,

is well-written,

and very welljustified.


Exhibits

aspects of

exemplary

(left) and

satisfactory

(right)

Answer

demonstrates

sound

knowledge of

machine

learning and

data science

and provides

justification.

Exhibits

aspects of

satisfactory

(left) and very

limited (right)

Answer

demonstrates

flawed

knowledge of

machine

learning

and/or

provides

incoherent

justification.

Or

Answer is

absent or

negligible.


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:821613408 微信:horysk8 电子信箱:[email protected]
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:horysk8