class: center, middle, inverse, title-slide # STA 225 2.0 Design and Analysis of Experiments ## Lecture 7 - 9 ### Dr Thiyanga S. Talagala ### 2021-11-16/ 2021-11-19/ 2021-11-26 --- <style> .center2 { margin: 0; position: absolute; top: 50%; left: 50%; -ms-transform: translate(-50%, -50%); transform: translate(-50%, -50%); } </style> <style type="text/css"> .remark-slide-content { font-size: 27px; } </style> # Completely randomized design ## Advantages of CRD - Easy to design - Analysis of data is simple and straight forward - Even if some values are missing the analysis can be done ## Disadvantages of CRD - The design requires homogeneous set of experimental units - If the experimental units are not homogeneous, error component will be large and this will make the treatment comparison less efficient. --- # Randomized Complete Block Design (RCBD) - Suppose we want to compare `\(r\)` treatments - To run a CRD, we need to find `\(n_1 + n_2 + .... + n_r\)` **homogeneous** experimental units to apply treatment. - Often it is difficult to find enough experimental homomogeneous units. - However, it may be possible to find **several blocks of experimental units, with enough homogeneous experimental units** to apply `\(r\)` treatments. - RCBD can be used to compare treatment means in such situations even though there are differences between blocks. --- ## Nuisance factor A factor that has an effect on the response, but we are not interested in that effect. - **Unknown and uncontrollable: ** *Randomization* to balance out its effect - **Known and uncontrollable but measurable:** *Analysis of covariance (ANCOVA)* (at the time of the analysis) - Nuisance source of variability is **Known and controllable**: use *Blocking* to systematically eliminate its effect on the statistical comparison among treatment means. --- ## Example We want to test the difference between different fertilizers (A, B, C, D, E, F) on strawberry yield. The experimenter has decided to obtain four observations for each fertilizer type. -- How many experimental units are required to compare the treatments? --- ![](farm.jpeg) --- ## Example (cont.) The experiment is run at 4 different locations having 6 different plots of land each. Hence, a block is given by a location and an experimental unit by a plot of land. --- ## Example (cont.) - randomized complete block design (RCBD) - the experimental units are homogeneous within a block - within each block, the 6 treatments are randomly assigned to 6 experimental units - each block (farms) contains all the treatments (fertilizer) - the design is called **complete** because we see the complete set of treatments within every block --- ## RCBD In-class diagram --- ## RCBD ### Randomization: - Number the `\(a\)` treatments `\(1,2,…,a\)`. - First form the homogeneous blocks of the experimental units. Then allocate each treatment randomly in each block. - Number the units in each block as `\(1, 2,...,a\)`. - Randomly allocate the `\(a\)` treatments to `\(a\)` experimental units in each block --- ## RCBD ### Replication Since each block contains all the treatments, so every treatment will appear in all the blocks. So each treatment can be considered as if replicated the number of times as the number of blocks. Hence, in RCBD, the number of blocks and the number of replications are same. --- ## Your turn Question: How the 4 diets (A, B, C, and D) affect the coagulation of rabbits? Treatment: Diet Factor levels: A, B, C, D Response: Time in seconds that it takes for a cut to stop bleeding (coagulation rate). Experimental unit: 16 rabits Replicates: 4 There are 16 rabbits (same age, weight, height) --- background-image: url('bunnies.jpeg') background-position: center background-size: cover ## Your turn <span style="color: white;">Which approach do you use to analyze the data? CRD or RCBD</span> --- Have a jar with the letters A, B, C, D written on separate slips. Catch a rabbit, pick a slip at random from the bowl and assign the rabbit to the diet letter on the slip. Do not replace the slip. Catch the second rabbit and select another slip from the remaining three slips. Assign that treatment to the second rabbit. Continue until the first four rabbits are assigned one of the four diets. Replace the slips and repeat the procedure until all rabbits are assigned to a diet. #### Which approach do you use to analyze the data? CRD or RCBD source: [here](http://www.worldcolleges.info/sites/default/files/enggnotes/completely_randomized_design.pdf) --- ## Notations | Block 1 | Block 2 | ... | Block b | |:---: | :-----: | :---: | :---: | | `\(y_{11}\)` | `\(y_{12}\)` | ... | `\(y_{1b}\)` | | `\(y_{21}\)` | `\(y_{22}\)` | ... | `\(y_{2b}\)` | | . | | | | | . | | | | | . | | | | | `\(y_{a1}\)` | `\(y_{a2}\)` | ... | `\(y_{ab}\)` | Number of treatments: `\(a\)` Number of blocks: `\(b\)` --- "There is one observation per treatment in each block, and the order in which the treatments are run within each block is determined randomly. Because the only, randomization of treatment is within the blocks, we often say that the blocks represent **a restriction on randomization**". (Montgomery, Design and Analysis of Experiments, 2001) --- ## Statistical Model for the RCBD **1. Mean model** `$$Y_{ij}= \mu_{ij} + \epsilon_{ij} \begin{cases} i=1, 2, ..., a \\ j=1, 2, ..., b \end{cases}$$` `\(a\)` - number of treatments `\(b\)` - number of blocks `\(\mu_{ij}\)` - mean of the `\(i\)`th factor level or treatment and `\(j\)`th block `\(\epsilon_{ij}\)` - random error (It is assumed `\(\epsilon_{ij}\)`'s are independent and `\(N(0, \sigma^2)\)`) --- Let `\(\mu\)` - overall mean, `\(\tau_i\)` - effect of the `\(i\)`th treatment, `\(\beta_j\)` - effect of the `\(j\)`th block, Then, `\(\mu_{ij} = \mu + \tau_i + \beta_j\)`. --- ## Statistical Model for the RCBD **2. Effects model** `$$Y_{ij}= \mu + \tau_i + \beta_j + \epsilon_{ij} \begin{cases} i=1, 2, ..., a \\ j=1, 2, ..., b \end{cases}$$` .pull-left[ `\(a\)` - number of treatments `\(b\)` - number of blocks `\(\mu\)` - overall mean ] .pull-right[ `\(\tau_i\)` - effect of the `\(i\)`th treatment `\(\beta_j\)` - effect of the `\(j\)`th block `\(\epsilon_{ij}\)` - random error with `\(NID(0, \sigma^2)\)` ] --- ## Two constraints Treatment effect and block effects are the deviations from the overall mean. Hence `$$\sum_{i=1}^a\tau_i=0$$` and `$$\sum_{j=1}^b\beta_j = 0$$` --- ## In-class In CRD `$$Y_{ij}= \mu + \tau_i + \epsilon_{ij} \begin{cases} i=1, 2, ..., a \\ j=1, 2, ..., n \end{cases}$$` `$$\tau_i = \mu + \mu_i, \text{ } i=1, 2, ...a$$` `$$\frac{\sum_{i=1}^a \mu_i}{a} = \mu$$` This definition implies `$$\sum_{i=1}^a\tau_i = 0$$` --- # In-class --- class: center, middle ## Hypothesis - RCBD --- We want to test the **equality of the treatment means** `$$H_0: \mu_1 = \mu_2 = ... = \mu_a$$` `$$H_1: \text{at least one }\mu_i \neq \mu_j$$` The `\(i\)`th treatment mean can be written as `$$\mu_i = \frac{\sum_{j=1}^b(\mu + \tau_i + \beta_j)}{b} = \mu + \tau_i.$$` Hence, an equivalent way of writing the above hypothesis `$$H_0: \tau_1 = \tau_2 = ... = \tau_a = 0$$` `$$H_1: \tau_i \neq 0\text{ at least one } i$$` --- # Notations `\(y_{i.}\)` - total of all observations taken under treatment `\(i\)` > *Write the mathematical equation* `\(y_{.j}\)` - total of all observations in block `\(j\)` > *Write the mathematical equation* `\(y_{..}\)` - the grand total of all observations > *Write the mathematical equation* `\(N=ab\)` be the total number of observations --- ## Notations `\(\bar{y}_{i.}\)` - the average of the observations taken under treatment `\(i\)` `\(\bar{y}_{.j}\)` - the average of the observations in block `\(j\)` `\(\bar{y}_{..}\)` - the grand average of all observations `$$\bar{y}_{i.} = \frac{y_{i.}}{b}$$` `$$\bar{y}_{.j} = \frac{y_{.j}}{a}$$` `$$\bar{y}_{..} = \frac{y_{..}}{N}$$` --- We express the total corrected sum of squares `$$\sum_{i=1}^a\sum_{j=1}^b (y_{ij} - \bar{y}_{..})^2 = \sum_{i=1}^a\sum_{j=1}^b[(\bar{y}_{i.} - \bar{y}_{..}) + (\bar{y}_{.j} - \bar{y}_{..}) + (y_{ij} - \bar{y}_{i.} - \bar{y}_{.j} + \bar{y}_{..})]^2$$` Note: - The **sum of squares** is the **sum of the squared values of a variable**. - For example the total corrected sum of squares is the sum of the squared values after subtracting (i.e. correcting for) their mean. --- ## Your turn `$$\sum_{i=1}^a\sum_{j=1}^b (y_{ij} - \bar{y}_{..})^2 = \sum_{i=1}^a\sum_{j=1}^b[(\bar{y}_{i.} - \bar{y}_{..}) + (\bar{y}_{.j} - \bar{y}_{..}) + (y_{ij} - \bar{y}_{i.} - \bar{y}_{.j} + \bar{y}_{..})]^2$$` Show that the above can be simplified into `$$\sum_{i=1}^a\sum_{j=1}^b (y_{ij} - \bar{y}_{..})^2 = b\sum_{i=1}^a(\bar{y}_{i.} - \bar{y}_{..})^2 + a\sum_{j=1}^b(\bar{y}_{.j} - \bar{y}_{..})^2 + \\ \sum_{i=1}^a\sum_{j=1}^b (y_{ij} - \bar{y}_{.j} - \bar{y}_{i.} + \bar{y}_{..})$$` `$$SS_T = SS_{Treatment} + SS_{Block} + SS_{E}$$` --- ## ANOVA Table Source of variation | Sum of squares (SS) | DF| Mean Square (MS) | F| p-value | ---:---:---:---:---:---| Treatments | `\(SS_{Treatments}\)` | `\(a-1\)` | `\(MS_{Treatments}\)` | `\(F_0=\frac{MS_{Treatments}}{MS_E}\)` | `\(P(F \geq F_0)\)`| Blocks | `\(SS_{Blocks}\)` | `\(b-1\)` | `\(MS_{Blocks}\)` | | | Error | `\(SS_E\)` | `\((a-1)(b-1)\)` | `\(MS_{E}\)` | | Total | `\(SS_T\)` | `\(N-1\)`| | | | --- ### Example Section 4.1, Montgomery, D. C. (2017). Design and analysis of experiments. John wiley & sons. Experiment: Hardness testing experiment - We wish to determine whether 4 different tips produce different (mean) hardness reading on a hardness tester. - A hardness testing machine operates by pressing a tip into a metal test “coupon.” - The hardness of the coupon is measured from the depth of the resulting depression. - **Response variable** - Four tip types are being tested to see if they produce significantly different readings. - **Treatment** --- ### Example (cont.) - The coupons might differ slightly in their hardness (for example, if they are taken from ingots produced in different heats). - **Block** - Within each coupon (block) the order in which the four tips were tested was randomly determined. References: https://web.ma.utexas.edu/users/mks/384Esp08/rcbdexample.pdf --- ### Example (cont.) ```r tip <- c(1, 2, 3, 4) coupon1 <- c(9.3, 9.4, 9.2, 9.7) coupon2 <- c(9.4, 9.3, 9.4, 9.6) coupon3 <- c(9.6, 9.8, 9.5, 10.0) coupon4 <- c(10.0, 9.9, 9.7, 10.2) df <- data.frame(tip=tip, coupon1=coupon1, coupon2=coupon2, coupon3=coupon3, coupon4=coupon4) df ``` ``` tip coupon1 coupon2 coupon3 coupon4 1 1 9.3 9.4 9.6 10.0 2 2 9.4 9.3 9.8 9.9 3 3 9.2 9.4 9.5 9.7 4 4 9.7 9.6 10.0 10.2 ``` --- ```r summary(df[, 2:5]) ``` ``` coupon1 coupon2 coupon3 coupon4 Min. :9.200 Min. :9.300 Min. : 9.500 Min. : 9.70 1st Qu.:9.275 1st Qu.:9.375 1st Qu.: 9.575 1st Qu.: 9.85 Median :9.350 Median :9.400 Median : 9.700 Median : 9.95 Mean :9.400 Mean :9.425 Mean : 9.725 Mean : 9.95 3rd Qu.:9.475 3rd Qu.:9.450 3rd Qu.: 9.850 3rd Qu.:10.05 Max. :9.700 Max. :9.600 Max. :10.000 Max. :10.20 ``` --- ```r library(tidyverse) df.pl <- df %>% pivot_longer(2:5, "Block", "value") df.pl ``` ``` # A tibble: 16 × 3 tip Block value <dbl> <chr> <dbl> 1 1 coupon1 9.3 2 1 coupon2 9.4 3 1 coupon3 9.6 4 1 coupon4 10 5 2 coupon1 9.4 6 2 coupon2 9.3 7 2 coupon3 9.8 8 2 coupon4 9.9 9 3 coupon1 9.2 10 3 coupon2 9.4 11 3 coupon3 9.5 12 3 coupon4 9.7 13 4 coupon1 9.7 14 4 coupon2 9.6 15 4 coupon3 10 16 4 coupon4 10.2 ``` --- ```r library(tidyverse) df.pl$tip <- as.factor(df.pl$tip) df.block <- df.pl %>% group_by(Block) %>% summarize(mean = mean(value)) df.block ``` ``` # A tibble: 4 × 2 Block mean <chr> <dbl> 1 coupon1 9.4 2 coupon2 9.43 3 coupon3 9.72 4 coupon4 9.95 ``` ```r df.treatment <- df.pl %>% group_by(tip) %>% summarize(mean = mean(value)) df.treatment ``` ``` # A tibble: 4 × 2 tip mean <fct> <dbl> 1 1 9.57 2 2 9.6 3 3 9.45 4 4 9.88 ``` --- ## Block .pull-left[ ![](L6_DOE_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] .pull-right[ ![](L6_DOE_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] --- ## Treatment .pull-left[ ![](L6_DOE_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] .pull-right[ ![](L6_DOE_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] --- ## ANOVA ```r two.way <- aov(value~ tip + Block, data = df.pl) summary(two.way) ``` ``` Df Sum Sq Mean Sq F value Pr(>F) tip 3 0.385 0.12833 14.44 0.000871 *** Block 3 0.825 0.27500 30.94 4.52e-05 *** Residuals 9 0.080 0.00889 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- class: center, middle # Model Adequacy Checking --- ## Residuals ```r library(broom) residdf <- augment(two.way) ``` ``` Warning: Tidiers for objects of class aov are not maintained by the broom team, and are only supported through the lm tidier method. Please be cautious in interpreting and reporting broom output. ``` ```r residdf ``` ``` # A tibble: 16 × 9 value tip Block .fitted .resid .hat .sigma .cooksd .std.resid <dbl> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 9.3 1 coupon1 9.35 -0.0500 0.438 0.0972 5.56e- 2 -0.707 2 9.4 1 coupon2 9.38 0.0250 0.438 0.0993 1.39e- 2 0.354 3 9.6 1 coupon3 9.67 -0.0750 0.438 0.0935 1.25e- 1 -1.06 4 10 1 coupon4 9.9 0.100 0.438 0.0882 2.22e- 1 1.41 5 9.4 2 coupon1 9.38 0.0250 0.437 0.0993 1.39e- 2 0.354 6 9.3 2 coupon2 9.4 -0.100 0.438 0.0882 2.22e- 1 -1.41 7 9.8 2 coupon3 9.7 0.100 0.437 0.0882 2.22e- 1 1.41 8 9.9 2 coupon4 9.93 -0.0250 0.437 0.0993 1.39e- 2 -0.354 9 9.2 3 coupon1 9.22 -0.0250 0.438 0.0993 1.39e- 2 -0.354 10 9.4 3 coupon2 9.25 0.150 0.438 0.0707 5.00e- 1 2.12 11 9.5 3 coupon3 9.55 -0.0500 0.438 0.0972 5.56e- 2 -0.707 12 9.7 3 coupon4 9.78 -0.0750 0.438 0.0935 1.25e- 1 -1.06 13 9.7 4 coupon1 9.65 0.0500 0.438 0.0972 5.56e- 2 0.707 14 9.6 4 coupon2 9.68 -0.0750 0.438 0.0935 1.25e- 1 -1.06 15 10 4 coupon3 9.98 0.0250 0.437 0.0993 1.39e- 2 0.354 16 10.2 4 coupon4 10.2 0 0.438 0.100 2.81e-30 0 ``` --- # The normality assumption .pull-left[ ```r ggplot(residdf, aes(x=.resid))+ geom_histogram(colour="white")+ggtitle("Distribution of Residuals") ``` ``` `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` ![](L6_DOE_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ] .pull-right[ ```r ggplot(residdf, aes(sample=.resid))+ stat_qq() + stat_qq_line()+labs(x="Theoretical Quantiles", y="Sample Quantiles") ``` ![](L6_DOE_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ] --- # The normality assumption (cont.) ```r shapiro.test(residdf$.resid) ``` ``` Shapiro-Wilk normality test data: residdf$.resid W = 0.93957, p-value = 0.3438 ``` --- # Plot of the residuals vs type ```r ggplot(data=residdf, aes(x=tip, y=.resid)) + geom_point() ``` ![](L6_DOE_files/figure-html/unnamed-chunk-14-1.png)<!-- --> --- # Plot of the residuals vs block ```r ggplot(data=residdf, aes(x=Block, y=.resid)) + geom_point() ``` ![](L6_DOE_files/figure-html/unnamed-chunk-15-1.png)<!-- --> --- # Plot of residuals versus fitted values - To check the assumption of constant variance - Residuals should be structureless. Residuals should not contain any obvious patterns ```r ggplot(data=residdf, aes(x=.fitted, y=.resid)) + geom_point() ``` ![](L6_DOE_files/figure-html/unnamed-chunk-16-1.png)<!-- --> --- ## ANOVA - Computing formulas `$$SS_T = \sum_{i=1}^{a}\sum_{j=1}^b y_{ij}^2 - \frac{y_{..}^2}{N}$$` `$$SS_{Treatments} = \frac{1}{b}\sum_{i=1}^a y_{i.}^2 - \frac{y_{..}^2}{N}$$` `$$SS_{Blocks} = \frac{1}{a}\sum_{i=j=1}^b y_{.j}^2 - \frac{y_{..}^2}{N}$$` `$$SS_{E} = SS_{T} - SS_{Treatments} - SS_{Blocks}$$` --- ## Expected value of mean squares, if treatments and blocks are fixed `$$E(MS_{Treatments}) = \sigma^2 + \frac{b\sum_{i=1}^a \tau_i^2}{a-1}$$` `$$E(MS_{Blocks}) = \sigma^2 + \frac{a\sum_{j=1}^b \beta_j^2}{b-1}$$` $$E(MS_{E}) = \sigma^2 $$ --- ### Test the equality of treatment means `$$F_0 = \frac{MS_{Treatments}}{MS_E}$$` which is distributed as `\(F_{a-1, (a-1)(b-1)}\)` ### Comparing block treatment means `$$F_0 = \frac{MS_{Blocks}}{MS_E}$$` which is distributed as `\(F_{b-1, (a-1)(b-1)}\)` --- ## RCBD - Multiple Comparisons - The rejection of the null hypothesis indicates a significant difference in treatment means. - Any of the multiple comparison methods can be used for detecting which means are significantly different. In this course we use, Tukey's method can be used. - In general we are not interested to compare block means. --- .pull-left[ ```r TukeyHSD(two.way, "tip") ``` ``` Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = value ~ tip + Block, data = df.pl) $tip diff lwr upr p adj 2-1 0.025 -0.18311992 0.23311992 0.9809005 3-1 -0.125 -0.33311992 0.08311992 0.3027563 4-1 0.300 0.09188008 0.50811992 0.0066583 3-2 -0.150 -0.35811992 0.05811992 0.1815907 4-2 0.275 0.06688008 0.48311992 0.0113284 4-3 0.425 0.21688008 0.63311992 0.0006061 ``` ] .pull-right[ ```r plot(TukeyHSD(two.way, "tip")) ``` ![](L6_DOE_files/figure-html/unnamed-chunk-18-1.png)<!-- --> ] --- ## Randomized Incomplete Block Designs Every treatment is not present in every block ## Balanced Incomplete Block Design (BIBD) All pairs of treatments occur together within a block an equal number of times --- # Acknowledgement Some of the slide content is based on Montgomery, D. C. (2017). Design and analysis of experiments. John wiley & sons.