联系方式

您当前位置:首页 >> Python编程Python编程

日期:2024-04-29 08:54

Assignment 6

1/4

Assignment 6

5/1/2024

10 Points Possible

In Progress

NEXT UP: Submit Assignment

Unlimited Attempts Allowed

4/22/2024

Attempt 1 Add Comment

Details

For this assignment, you will submit a README.md with your answers to the questions below, along with the code you used to

produce your answers (including all boto3 scripts necessary to reproduce your cloud infrastructure, where relevant). You should

commit your Assignment 6 file(s) to your private “a6” GitHub repository (click here (https://classroom.github.com/a/jXPdPm3s) to

accept the GitHub Classroom invitation to access this repository) and submit a link to your repository here on the Canvas (clicking

the “Submit Assignment” button to make your submission). You must work alone on this assignment. Before submitting your

assignment, please take a look at the tips one of the previous TAs for the course (Jinfei Zhu) compiled for writing a grader-friendly

README file and organizing your assignment GitHub repository (https://github.com/lsc4ss-a21/assignment-submission?template) if you have not already done so.

1. (6 Points Total) This first prompt builds on the survey submission pipeline you have been working on in Assignments 4 and 5. As

a final step in your survey submission pipeline, you will write a Python function that can be invoked on a survey participant’s

mobile device when they complete a survey to send their survey submission into an SQS queue, which should then trigger the

Lambda function you wrote in Assignments 4 and 5.

Note that each survey submission is initially saved as a JSON file (on the mobile device) when a participant completes a survey

via the mobile app (see example files here ()

() ). For the purposes of this prompt, you do

not need to worry about the implementation of the mobile app or the creation of these JSON files. Your job is to write a Python

function that will send a string representation of this JSON data (an individual survey) into an AWS SQS queue (your function

will then be incorporated into the mobile app by another researcher). The SQS queue should then trigger your AWS Lambda

function from Assignment 5, which will take this survey submission data and perform necessary processing and storage

operations in the cloud. You should accomplish all of these tasks programmatically (using boto3 ) to ensure reproducibility of

your architecture. Specifically, you should complete the following tasks:

a. (1 Point) Write a Python function send_survey (which you can assume will be installed with the mobile app and will

automatically be invoked after a survey is saved as a JSON file on the device) that has the following signature:

def send_survey(survey_path, sqs_url):

'''

Input: survey_path (str): path to JSON survey data

(e.g. `./survey.json')

sqs_url (str): URL for SQS queue

Output: StatusCode (int): indicating whether the survey

was successfully sent into the SQS queue (200) or not (400)

'''

In the function body, you should use boto3 to send the data from a survey (a JSON file on the mobile device, converted into

a string representation) into an AWS SQS queue.

b. (2 Points) Create an SQS queue and configure it to act as a trigger for your Lambda function from Assignment 5 (which will

process your data and write it to storage).

Note that if you test your full survey submission pipeline using the example JSON files provided above (in a loop,

using time.sleep(10) in between survey submissions, as in Assignment 5), you should see the following keys in your S3

Bucket:

['0001092821120000.json', '0001092921120000.json', '0001093021120300.json',

'0002092821120000.json', '0003092821120001.json', '0004092821120002.json',

Assignment 6

2/4

'0005092821122000.json']

You should also see the following records if you query your DynamoDB table:

{'q1': Decimal('1'), 'q2': Decimal('1'), 'user_id': '0001',

'q3': Decimal('2'), 'q4': Decimal('2'), 'q5': Decimal('2'),

'num_submission': Decimal('3'),

'freetext': "I lost my car keys this afternoon at lunch, so I'm more stressed than normal"}

{'q1': Decimal('4'), 'q2': Decimal('1'), 'user_id': '0002',

'q3': Decimal('1'), 'q4': Decimal('1'), 'q5': Decimal('3'),

'num_submission': Decimal('1'),

'freetext': "I'm having a great day!"}

{'q1': Decimal('1'), 'q2': Decimal('3'), 'user_id': '0003',

'q3': Decimal('3'), 'q4': Decimal('1'), 'q5': Decimal('4'),

'num_submission': Decimal('1'),

'freetext': 'It was a beautiful, sunny day today.'}

{'q1': Decimal('1'), 'q2': Decimal('1'), 'user_id': '0004',

'q3': Decimal('1'), 'q4': Decimal('1'), 'q5': Decimal('1'),

'num_submission': Decimal('1'),

'freetext': 'I had a very bad day today...'}

{'q1': Decimal('3'), 'q2': Decimal('3'), 'user_id': '0005',

'q3': Decimal('3'), 'q4': Decimal('3'), 'q5': Decimal('3'),

'num_submission': Decimal('1'),

'freetext': "I'm feeling okay, but not spectacular"}

c. (3 Points) Your PI, who is overseeing this project, is worried that if all of the participants in the study (potentially thousands)

submit surveys at the same time in the day, this might cause the system to crash and your lab might lose data (this

happened to your PI when they ran a similar digital survey via on-premise servers in the early 2000s). How would you

reassure your PI that your architecture is scalable and will be able to handle such spikes in demand? Your response should

be at least 200 words and discuss the scalability of each of the cloud services you used in your pipeline in detail.

2. (4 points) For this prompt, we ask you to declare whether you will complete a Final Project or a Final Exam as your capstone

assignment for the course. You are welcome to meet with course staff and discuss your options and ideas with us before

making your election and submitting your answer to this prompt.

If you wish to complete a Final Project, you should additionally write a ~250 word-proposal in your README for this

assignment, detailing your plan for the project (see expectations and sample projects on the Final Exam/Final Project

Assignment page () ). You should explain why your project

idea helps to solve a social science research problem using large-scale computing methods and outline a schedule for

completing the project by the deadline. If you are working in a group, you should also write down the names of your group

members and describe how you are going to split up the work amongst yourselves.

If you wish to take a Final Exam, you should instead write one question for possible inclusion in the Final Exam and submit it

in your README for this assignment. The better the question you submit, the higher the likelihood you will see the question

(or a closely related one) on the exam. We will additionally post the best questions to the Final Exam page on Canvas so that

you can use them as study material for the exam. Note: YOU WILL NEED TO PROVIDE THE SOLUTION FOR YOUR

QUESTIONS. A good question is one that goes beyond memorization and asks the student to apply a concept in a way that

is similar to what we do in our in-class activities and conceptual questions in assignments (we will not ask implementation

questions that involve writing code from scratch). Specifically, we plan to include questions of the following types (for

additional examples, you can take a look at past examples of questions used on the exam on the Final Exam/Final Project

assignment () page):

Applied Conceptual Questions, such as:

You are conducting a large digital experiment, in which you have designed an online music sharing application and

recruited participants to use the platform over the course of a month. During the experiment, you will manipulate features

of the website in order to test your research hypotheses. In order to run the experiment, you need to be able to

collect/record thousands of data points per second; for instance: tracking the songs that participants download, the

treatments that they were exposed to (by you the researcher), as well as all of the things that participants click on. When

the experiment is over, you would like to perform a statistical analysis on a subset of the data to identify experimental

interventions that caused participants to change their clicking/downloading behavior. Ultimately, when your work is

published, you would also like to have your (de-identified) data publicly accessible, so that future scholars can replicate

your statistical analysis.

Assignment 6

3/4

What databases and/or storage solutions would you use to solve these problems (storing data while you run the

experiment, as well as afterwards) in the AWS cloud ecosystem? Why? How about if you scaled the experiment up by

several orders of magnitude to include millions of participants? Would this change your data storage/management

solution?

Code Interpretation Questions, such as:

Below is a serial version of a Monte Carlo simulation to estimate π that is written in Python. Identify parts of this code

that could be accelerated using a GPU, as well as those that would best be run on a CPU – attempting to accelerate the

estimation of π as much as possible. For each section of code, you should explain why your answers are the best

hardware options for optimal performance (e.g. thinking in terms of some of the key bottlenecks and hardware limitations

for CPUs vs. GPUs).

# NumPy Pi Estimation with Monte Carlo Simulation

import numpy as np

import time

t0 = time.time()

n_runs = 10 ** 8 # Simulate Random Coordinates in Unit Square:

ran = np.random.uniform(low=-1, high=1, size=(2, n_runs))

# Identify Random Coordinates that fall within Unit Circle and count them

result = ran[0] ** 2 + ran[1] ** 2 <= 1

n_in_circle = np.sum(result)

# Estimate Pi

print("Pi Estimate: ", 4 * n_in_circle / n_runs)

print("Time Elapsed: ", time.time() - t0)

Troubleshooting Questions, such as:

You are training a linear regression model to predict the price of an AirBnB listing given a variety of text features derived

from the listing’s description on AirBnB (note that AirBnB publishes this data in CSV format for listings across the world

and the data is updated on a monthly basis).

You have written a machine learning workflow in PySpark that does the following on an AWS EMR cluster composed of 3

m5.xlarge EC2 instances (1 resource manager and 2 core instances), with 10 GB in EBS storage available on each

instance:

1. Cleans the description text data (e.g. drops stop words and punctuation) from all AirBnB listings around the world

from the past month (prior to the current month).

2. Engineers features based on the clean description data (such as categorical and binary features indicating whether

the description contains certain types of words).

3. Uses MLLib’s CrossValidator to identify the optimal hyperparameters for your linear regression model given a grid of

possible values used to tune the model (i.e. a grid search)

4. Trains the regression model using the optimal hyperparameters from (3) and make predictions on the prices of AirBnB

listings from the current month.

Having successfully run this workflow on one previous month of data, you want to increase your training data size to

several years worth of data. As you increase the amount of data entering into your pipeline, though, you begin to observe

unexpected (i.e. nonlinear) diminishing performance (in terms of speed) and beyond a certain data size, your job will not

complete at all – it keeps running indefinitely.

Describe at least two possible root causes of this slowdown (considering both hardware and software). Why would these

be concerns? Is it possible to remedy them? What would be your solutions?

Some hints for writing good questions:

You shouldn’t make the question needlessly complicated or overly verbose.

Try to be clear about what you’re asking and what you’re looking for.

Try to cover multiple topics from the course – i.e. a question that touches on the memory hierarchy, GPU vs. CPU

Parallelism, and Spark’s execution model, would be better than one that is narrowly relevant to invoking a Lambda function.

Assignment 6

4/4

Anything we’ve covered in the class is fair game (and you’re welcome to continue submitting relevant questions through

Wednesday of Week 9 related to material we cover after this assignment – you just will not receive additional credit).

Clear social science tie-ins are preferred


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:821613408 微信:horysk8 电子信箱:[email protected]
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:horysk8