class: center, middle, inverse, title-slide # STA 225 2.0 Design and Analysis of Experiments ## Lecture 3 ### Dr Thiyanga S. Talagala ### 2021-10-29 --- <style> .center2 { margin: 0; position: absolute; top: 50%; left: 50%; -ms-transform: translate(-50%, -50%); transform: translate(-50%, -50%); } </style> <style type="text/css"> .remark-slide-content { font-size: 30px; } </style> ## Completely Randomized Design (CRD) - The simplest experimental design - CRD - The treatments are assigned completely at random so that each experimental unit has the same chance of receiving every treatment. - Appropirate only for experiments with **homogeneous** experimental units --- ## Example **Objective of the study:** Effect of fertilizer dose on the yield of chille -- **Experimental design:** CRD -- **Independent variable:** Fertilizer doses -- **Treatment:** Different doses of fertilizer (A, B, C, D, E) -- **Dependent variables:** Number of chilles after one month -- **Number of replicates:** 5 -- **Number of experimental units:** `\(\sum_{i=1}^{t}r_i = 25\)`, where `\(t\)` is the number of treatments ( `\(t = 1, 2, 3, 4, 5\)` ) and `\(r_i\)` number of replicates under treatment `\(i\)` --- ## Step 1: Assign treatments at random --- ## Step 2: Measure the response variable --- ## Step 3: Organize data <table style="width:50%; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:center;"> A </th> <th style="text-align:center;"> B </th> <th style="text-align:center;"> C </th> <th style="text-align:center;"> D </th> <th style="text-align:center;"> E </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;background-color: green !important;"> 8 </td> <td style="text-align:center;background-color: green !important;"> 11 </td> <td style="text-align:center;background-color: green !important;"> 14 </td> <td style="text-align:center;background-color: green !important;"> 20 </td> <td style="text-align:center;background-color: green !important;"> 8 </td> </tr> <tr> <td style="text-align:center;background-color: green !important;"> 8 </td> <td style="text-align:center;background-color: green !important;"> 17 </td> <td style="text-align:center;background-color: green !important;"> 16 </td> <td style="text-align:center;background-color: green !important;"> 24 </td> <td style="text-align:center;background-color: green !important;"> 10 </td> </tr> <tr> <td style="text-align:center;background-color: green !important;"> 15 </td> <td style="text-align:center;background-color: green !important;"> 13 </td> <td style="text-align:center;background-color: green !important;"> 18 </td> <td style="text-align:center;background-color: green !important;"> 22 </td> <td style="text-align:center;background-color: green !important;"> 12 </td> </tr> <tr> <td style="text-align:center;background-color: green !important;"> 11 </td> <td style="text-align:center;background-color: green !important;"> 18 </td> <td style="text-align:center;background-color: green !important;"> 20 </td> <td style="text-align:center;background-color: green !important;"> 19 </td> <td style="text-align:center;background-color: green !important;"> 15 </td> </tr> <tr> <td style="text-align:center;background-color: green !important;"> 7 </td> <td style="text-align:center;background-color: green !important;"> 17 </td> <td style="text-align:center;background-color: green !important;"> 20 </td> <td style="text-align:center;background-color: green !important;"> 23 </td> <td style="text-align:center;background-color: green !important;"> 12 </td> </tr> </tbody> </table> --- ## Step 4: Exploratory data analysis .pull-left[ ![](L3_DOE_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] -- .pull-right[ ![](L3_DOE_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- ## Step 4: Exploratory data analysis .pull-left[ ![](L3_DOE_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] .pull-right[ ![](L3_DOE_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- .pull-left[ ```r library(tidyverse) A <- c(8, 8, 15, 11, 7) B <- c(11, 17, 13, 18, 17) C <- c(14, 16, 18, 20, 20) D <- c(20, 24, 22, 19, 23) E <- c(8, 10, 12, 15, 12) df <- data.frame(A=A, B=B, C=C, D=D, E=E) df <- df %>% pivot_longer(1:5, "Treatment", "value") df$Treatment <- as.factor(df$Treatment) df %>% group_by(Treatment) %>% summarise( mean=mean(value), sd=sd(value), median=median(value), min=min(value), max=max(value)) ``` ``` # A tibble: 5 × 6 Treatment mean sd median min max <fct> <dbl> <dbl> <dbl> <dbl> <dbl> 1 A 9.8 3.27 8 7 15 2 B 15.2 3.03 17 11 18 3 C 17.6 2.61 18 14 20 4 D 21.6 2.07 22 19 24 5 E 11.4 2.61 12 8 15 ``` ] .pull-right[ ![](L3_DOE_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] --- ## Sources of Variation - Source of variation due to **treatments** - Source of variation due to **experimental error** ## Partition of total variability <style> div.blue pre { background-color:lightblue; } div.blue pre.r { background-color:red; } </style> <div class = "red"> `$$\textbf{Variability}_\text{total} = \textbf{Variability}_\text{treatment} + \textbf{Variability}_\text{experimental error}$$` </div> --- ## Notations | Treatment (level) | R1 | R2 | ... | Rn | Totals | Averages | |:---: | :-----: | :---: | :---: | :---: | :---: | :---: | | 1 | `\(y_{11}\)` | `\(y_{12}\)` | ... | `\(y_{1n}\)` | `\(y_{1.}\)` | `\(\bar{y}_{1.}\)` | | 2 | `\(y_{21}\)` | `\(y_{22}\)` | ... | `\(y_{2n}\)` | `\(y_{2.}\)` | `\(\bar{y}_{2.}\)` | | . | | | | | | | | . | | | | | | | | . | | | | | | | | a | `\(y_{a1}\)` | `\(y_{a2}\)` | ... | `\(y_{an}\)` | `\(y_{a.}\)` | `\(\bar{y}_{a.}\)` | | | | | | | `\(y_{..}\)` | `\(\bar{y}_{..}\)` | --- | Treatment (level) | R1 | R2 | R3| R4 | R5 | Totals | Averages | |:---: | :-----: | :---: | :---: | :---: | :---: | :---: | | A | `\(y_{11}\)` <br> 8 | `\(y_{12}\)` <br> 8 | `\(y_{13}\)` <br> 15 | `\(y_{14}\)` <br> 11 | `\(y_{15}\)` <br> 7 | `\(y_{1.}\)` <br> 49 | `\(\bar{y}_{1.}\)` <br> 9.8 | | B | `\(y_{21}\)` <br> 11 | `\(y_{22}\)` <br> 17 | `\(y_{23}\)` <br> 13 | `\(y_{24}\)` <br> 18 | `\(y_{25}\)` <br> 17 | `\(y_{2.}\)` <br> 76| `\(\bar{y}_{2.}\)` <br> 15.2 | | C | `\(y_{31}\)` <br> 14 | `\(y_{32}\)` <br> 16 | `\(y_{33}\)` <br> 18 | `\(y_{34}\)` <br> 20 | `\(y_{35}\)` <br> 20 | `\(y_{3.}\)` <br> 88| `\(\bar{y}_{3.}\)` <br> 17.6 | | D | `\(y_{41}\)` <br> 20 | `\(y_{42}\)` <br> 24 | `\(y_{43}\)` <br> 22 | `\(y_{44}\)` <br> 19 | `\(y_{45}\)` <br> 23 | `\(y_{4.}\)` <br> 108 | `\(\bar{y}_{4.}\)` <br> 21.6| | E | `\(y_{51}\)` <br> 8 | `\(y_{52}\)` <br> 10 | `\(y_{53}\)` <br> 12 | `\(y_{54}\)` <br> 15 | `\(y_{55}\)` <br> 12 | `\(y_{5.}\)` <br> 57| `\(\bar{y}_{5.}\)` <br> 11.4 | | | | | | | | `\(y_{..} = 378\)`| `\(\frac{378}{25} = \frac{75.6}{5} = 15.12\)` | --- ## Notations (cont.) `\(a\)` - number of different treatments `\(y_{ij}\)` - `\(j\)`th observation taken under factor level or treatment `\(i\)` `\(n\)` - observations under the `\(i\)`th treatment `\(y_{i.}\)` - total of the observations under the `\(i\)`th treatment `\(\bar{y}_{i.}\)` - average of the observations under the `\(i\)`th treatment `\(y_{i.} = \sum_{j=1}^n y_{ij},\)` `\(\bar{y}_{i.} = \frac{y_{i.}}{n},\)` `\(i = 1, 2, ..., a\)` --- ## Notations (cont.) `\(y_{..}\)` - grand total of all the observations `\(\bar{y}_{..}\)` - grand average of all the observations `\(y_{..} = \sum_{i=1}^a\sum_{j=1}^n y_{ij}\)` `\(\bar{y}_{..} = \frac{y_{..}}{N},\)` where `\(N = an\)` Note: "dot" subscript notation implies summation over the subscript that it replaces. --- class: inverse, center, middle #**one-way or single-factor analysis of variance** --- ## Model for the data We can describe the observations come from an experiment using a model. There are two ways to write the models **Method 1: means model** `$$Y_{ij}= \mu_i + \epsilon_{ij} \begin{cases} i=1, 2, ..., a \\ j=1, 2, ..., n \end{cases}$$` Where `\(Y_{ij}\)` is the `\(ij\)`th observation, `\(\mu_i,\)` is the mean of the `\(i\)`th factor level or treatment, and `\(\epsilon_{ij}\)` is the **random error** component that includes other variabilities such as background noise, differences between experimental units, variability arising from uncontrolled factors, etc. We assume errors have mean zero, hence, `\(E(Y_{ij}) = \mu_{i}\)`. --- **Method 2: Effects model (an alternative way to write the mean model)** `$$\mu_i = \mu + \tau_i, \text{ } i=1, 2, ..., a$$` Hence the following equation can be written as `$$Y_{ij}= \mu_i + \epsilon_{ij} \begin{cases} i=1, 2, ..., a \\ j=1, 2, ..., n \end{cases}$$` `$$Y_{ij}= \mu + \tau_i + \epsilon_{ij} \begin{cases} i=1, 2, ..., a \\ j=1, 2, ..., n \end{cases}$$` where `\(\mu\)`, is common to all treatment and we called it **overall mean**. The parameter `\(\tau_i\)` is unique to the `\(i\)`th treatment called the `\(i\)`**th treatment effect.** --- ## Model assumptions The model errors are assumed to be normally and independently distributed random variables with mean zero and variance `\(\sigma^2\)`. The variance `\(\sigma^2\)` is assumed to be constant for all levels of the factor. This implies that the observations `$$y_{ij} \sim N(\mu+\tau_i, \sigma^2)$$` and that the observations are mutually independent. --- ## In-class `$$Y_{ij}= \mu + \tau_i + \epsilon_{ij} \begin{cases} i=1, 2, ..., a \\ j=1, 2, ..., n \end{cases}$$` This is called the **random effects model** or **components of variance model**. --- ## Our goal - Test appropriate hypotheses about the treatment means - Estimate treatment means --- ## Hypotheses We want to test the equality of the `\(a\)` treatment meann, that is, `\(E(Y_{ij}) = \mu + \tau_i = \mu_i, i=1, 2, ..., 1.\)` In terms of `\(\mu_i\)` `$$H_0: \mu_1 = \mu_2=...=\mu_a$$` `$$H_1: \mu_i \neq \mu_j \text{ for at least one pair (i, j)}$$` In terms of `\(\tau_i\)` `$$H_0: \tau_1 = \tau_2=...=\tau_a=0$$` `$$H_1: \tau_i \neq 0 \text{ for at least one }i$$` --- ### Analysis of Variance (ANOVA) `$$\textbf{Variability}_\text{total} = \textbf{Variability}_\text{treatment} + \textbf{Variability}_\text{experimental error}$$` ANOVA is derived from decomposing the total sum of squares (partitioning of total variability into its component). The total sum of squares ( `\(SS_T\)` ) `$$SS_T = \sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{..})^2.$$` `\(SS_T\)` measures the overall variability in the data. Number of degrees of freedom = `\(an-1 = N-1\)` --- ## Decomposition of the Total Sum of Squares `$$\sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{..})^2 = \sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{..} + \bar{y}_{i.} - \bar{y}_{i.})^2$$` `$$\sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{..})^2 = \sum_{i=1}^a\sum_{j=1}^n [(\bar{y}_{i.} - \bar{y}_{..} ) + (y_{ij} - \bar{y}_{i.})]^2$$` `\(\sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{..})^2 = n \sum_{i=1}^a(\bar{y}_{i.} - \bar{y}_{..})^2 + \sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{i.})^2\)` `\(\text{ }+ 2\sum_{i=1}^a\sum_{j=1}^n(\bar{y}_{i.} - \bar{y}_{..})(y_{ij}-\bar{y}_{i.})\)` The cross product term is zero. Hence, `\(\sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{..})^2 = n \sum_{i=1}^a(\bar{y}_{i.} - \bar{y}_{..})^2 + \sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{i.})^2\)` --- `\(\sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{..})^2 = n \sum_{i=1}^a(\bar{y}_{i.} - \bar{y}_{..})^2 + \sum_{i=1}^a\sum_{j=1}^n (y_{ij} - \bar{y}_{i.})^2\)` `$$SS_T = SS_{Treatments} + SS_{E}$$` `\(SS_{Treatments}\)` is called the sum of squares due to treatments (i.e., between treatments) `\(SS_{E}\)` is called the sum of squares due to error (i.e., within treatments) --- ## ANOVA Table Source of variation | Sum of squares (SS) | DF| Mean Square (MS) | F| p-value | ---:---:---:---:---:---| Between treatments | `\(SS_{Treatments}\)` | `\(a-1\)` | `\(MS_{Treatments}\)` | `\(F_0=\frac{MS_{Treatments}}{MS_E}\)` | `\(P(F \geq F_0)\)`| Error (within treatments) | `\(SS_E\)` | `\(N-a\)` | `\(MS_{E}\)` | | Total | `\(SS_T\)` | `\(N-1\)`| | | | --- `$$SS_E = SS_T - SS_{Treatments}$$` `$$MS_{Treatments} = \frac{SS_{Treatments}}{a-1}$$` `$$MS_{E} = \frac{SS_{E}}{N-a}$$` `$$F_0 = \frac{MS_{Treatments}}{MS_E}$$` If the null hypothesis of no difference in treatment means is true, the ratio `$$F_0 = \frac{MS_{Treatments}}{MS_E}$$` follows a `\(F\)` distribution with `\(a-1\)` and `\(N-a\)` degrees of freedom. If `\(F_0 > F_{\alpha, a-1, N-a }\)`, we can conclude that there are differences in the treatment means. --- .pull-left[ ``` # A tibble: 5 × 6 Treatment mean sd median min max <fct> <dbl> <dbl> <dbl> <dbl> <dbl> 1 A 9.8 3.27 8 7 15 2 B 15.2 3.03 17 11 18 3 C 17.6 2.61 18 14 20 4 D 21.6 2.07 22 19 24 5 E 11.4 2.61 12 8 15 ``` What is your guess? Reject `\(H_0\)` or Do not reject `\(H_0\)`? `$$H_0: \mu_1 = \mu_2=...=\mu_5$$` `$$H_1: \mu_i \neq \mu_j \text{ for at least one pair (i, j)}$$` ] .pull-right[ ![](L3_DOE_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] --- # ANOVA ```r one.way <- aov(value~ Treatment, data = df) summary(one.way) ``` ``` Df Sum Sq Mean Sq F value Pr(>F) Treatment 4 451.4 112.86 14.93 8.39e-06 *** Residuals 20 151.2 7.56 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Acknowledgement Some of the slide content is based on Montgomery, D. C. (2017). Design and analysis of experiments. John wiley & sons.