联系方式

  • QQ:821613408
  • 邮箱:[email protected]
  • 工作时间:8:00-23:00
  • 微信:horysk8
  • 您当前位置:首页 >> CS作业CS作业

    日期:2019-06-08 09:09

    You use a subset (see below) of the dataset in the file “HousePrices.txt” which consist of 11
    columns, with measurements for each of 585 Belgian municipalities. The response variable is the
    median price of a regular house in the municipality (in thousands of euros).
    Region x1 The administrative region: Flanders, Walloon, Brussels-capital.
    Province x2 The name of the province (there are officially 10 provinces in Belgium), plus the
    Brussels-capital region, which is here treated as a separate province. Hence this variable has 11
    categories.
    Municipality The name of the municipality (this identifies the different observations and is
    provided just for the curious ones).
    PriceHouse y Median price of a regular house in the municipality (in thousands of euros).
    Shops x3 The number of officially registered shops in the municipality exceeding a certain
    number of square meters.
    Bankruptcies x4 Number of bankruptcies in the municipality in one year, this includes all type of
    enterprises (from one-person companies to big firms).
    MeanIncome x5 The average of the taxable incomes of all tax forms of the municipality (in
    thousands of euros).
    TaxForms x6 The number of tax declarations for the municipality that were submitted to the tax
    office.
    HotelRestaurant x7 The number of hotels and restaurants (added together) in the municipality.
    Industries x8 Number of industrial firms in the municipality.
    HealthSocial x9 The number of health care and social service facilities in the municipality.
    Each of you will study a subset of these data, and use the following code to get your sub-dataset.
    Note that the provided code serves as a hint, you will need to make changes to it.
    Constructing your own dataset:
    code = 753031
    fulldata = read.csv("HousePrices.txt", sep = " ", header = TRUE)
    digitsum = function(x) sum(floor(x/10^(0:(nchar(x)-1)))%%10)
    set.seed(code)
    mysum = digitsum(code)
    if((mysum %% 2) == 0) { # number is even
    rownumbers = sample(1:327,150,replace=F)
    } else { # number is odd
    rownumbers = sample(309:585,150,replace=F)
    }
    mydata = fulldata[rownumbers,]This way you have taken a sample of 150 municipalities, either from the Flanders region +
    Brussels captial area, or from the Walloon region + Brussels captial area. Now, based on your own
    sub-dataset, answer following questions one by one.
    Questions to be answered:
    1) Q1: Use semiparametric flexible modelling to construct a model for the median house price.
    Use AIC as a method to select a final model and report on which (type of) models were
    included in the search. Only for the components of the selected model that are modeled in a
    nonlinear way, provide graphs. The models in this question should not treat covariates as
    random effects. Give the model that you have selected in correct notation. It is alright to use
    a general notation (e.g. f(x2)) for a smooth function, but you have to state which (spline)
    functions you have used, and how the smoothing parameter was selected. If you want to
    use the function gam from library(mgcv), the provided AIC value is compatible with
    parametric AIC values when using the default option for setting the smoothing parameter.
    Notes for Q1:
    a) explore all variables of “mydata”, state the information of “distribution” and “link function”
    clearly in the models.
    b) Clearly state how many (and why) knots you choose, and clarify how you choose
    smoothing parameter in details.
    c) Treat all variables as fixed.
    2) For this question you use the response and only the covariates x6 (number of tax forms) and
    x9 (number of health care and social service facilities). State the null hypothesis of a
    parametric additive model for the median house price with quadratic effects for both
    covariates. Test this hypothesis using an order selection test against a nonparametric
    alternative hypothesis, report the hypotheses, the construction of the test statistic, its value,
    as well as the corresponding p-value and draw the correct conclusion.
    Notes for Q2:
    a) Test whether you can fit an additive model in those two covariates (x6 and x9) in quadratic
    effects.
    b) Clearly state how to do a proper test, including all steps of hypothesis testing and how
    they lead to the conclusion?
    3) In this question a parametric (generalized) linear mixed effect model should be constructed.
    (i) Make a graphical presentation that supports why you suggest a certain mixed effect structure
    using x2 Province as the grouping variable. Construct the plot illustrating whether there is an
    effect of Province when regressing y on x6 the number of tax forms. For the plot you may
    ignore all other covariates.
    (ii) Construct a parametric (generalized) linear mixed effect model using your suggestion from (i).
    You leave out variable x1 for this part, other covariates may be included in the model in a parametric way. Your model should include x2 and x6, the inclusion of other covariates in
    your model may be based on your answer of question 1, no fixed effect model selection
    should be done for this question. Provide the model using correct notation, and give a
    summary of the output. Briefly discuss whether the output supports your suggestion from (i).
    Note: library(hglm) contains both hglm and hglm2 wich may be used for fitting, also
    glmm-PQL is a possibility. If one of these functions gives problems for your dataset, try one of
    the other ones.
    Notes for Q3:
    a) Among Q1-5, only Q3 takes the random effect into consideration.
    4) In this question you start from a large parametric model (no random effects, no interactions)
    and you will perform a focused search over all sub-models of the large model and this for
    two focuses:
    (i) the median price of a regular house for one municipality of your choice from your
    dataset where there is a low (though not the lowest) number of industrial firms,
    (ii) the median price of a regular house for one municipality of your choice from your
    dataset where there is a large (though not the largest) number of hotels and restaurants.
    Write the selected model for each focus using correct notation and provide the
    estimated values of the focuses for both cases. Briefly discuss.
    Notes for Q4:
    a) Look your dataset in 150 lines, pick one village for the low industry, and another one for
    the high number of hotels. And, search for the best models to match the house price for
    those two villages.
    b) Use correct notations and clearly state the “distribution”, “link function”, “coefficient”.
    5) In this question you use the same large parametric model (no random effects, no interactions)
    as you started with in question 4.
    (i) Construct a table containing the vector of estimated coefficients of the regression model
    using four methods:
    (a) maximum likelihood estimation in the large model
    (b) Ridge regression
    (c) Lasso estimation
    (d) An elastic net estimator, different from the ridge and lasso one.
    For (b), (c) and (d) you use the software’s default value for the penalty parameter λ.
    (ii) Using the four estimation methods from (i), give in a table the predictions for the median
    price of a regular house for the same two municipalities as in question 4. Briefly discuss.
    Note: If you would like to use a function other than glmnet for penalized estimation, here is an
    alternative with a few more options. Since the syntax is quite a bit different, you might want to
    adjust the lines below to your setting, if you want to use this.library(h2o)
    h2o.init()
    mydat2=as.h2o(mydata)
    mydat2$Region <- as.factor(mydat2$Region)
    mydat2$Province <- as.factor(mydat2$Province)
    y="PriceHouse"
    X = c("Province", "Shops") # add here the variables that you wish to put in X.
    alpha0 <- h2o.glm(family= "something", link="something", x= X, y=y, alpha=0,
    lambda_search=TRUE, training_frame=mydat2, nfolds=0)
    # indicate the same rows as in question 4:
    Xeval = as.h2o(as.data.frame(mydat2[c(1,2),]))
    h2o.predict(alpha0, newdata=Xeval)

    [email protected]

    版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:821613408 微信:horysk8 电子信箱:[email protected]
    免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

    python代写
    微信客服:horysk8