Setting the Range of Fold Change and P-Values
Microarray studies often aim to identify genes that are differentially regulated across different classes of samples; examples are: finding the genes affected by treatment or finding marker genes that discriminate diseased from healthy subjects. Since the data set contains a column named ‘class’ with binary values, we can identify differentially expressed genes across those two samples by performing statistical tests.
Comparing two conditions
A simple microarray experiment may be carried out to detect the differences in expression between two conditions. Each condition may be represented by one or more RNA samples. Using two-color cDNA microarrays, samples can be compared directly on the same microarray or indirectly by hybridizing each sample with a common reference sample. The null hypothesis being tested is that there is no difference in expression between the conditions; when conditions are compared directly, this implies that the true ratio between the expression of each gene in the two samples should be one. When samples are compared indirectly, the ratios between the test sample and the reference sample should not differ between the two conditions.
The simplest method for identifying differentially expressed genes is to evaluate the log-ratio between two conditions (or the average of ratios when there are replicates) and consider all genes that differ by more than an arbitrary cut-off value to be differentially expressed. For example, if the cut-off value is chosen is a two-fold difference, genes are taken to be differentially expressed if the expression under one condition is over two-fold greater or less than that under the other condition.
In statistics, the p-value is the probability of obtaining results as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis. P-value is calculated through the t-test in GeNet.
The t-test is a simple, statistically based method for detecting differentially expressed genes. In replicated experiments, the error variance can be estimated for each gene from the log ratios, and a standard t-test can be conducted for each gene; the resulting t statistic can be used to determine which genes are significantly differentially expressed.
The 'volcano plot' is an effective and easy-to-interpret graph that summarizes both fold-change and t-test criteria. It is a scatter-plot of the negative log10-transformed p-values from the t-test against the log2 fold change. Genes with statistically significant differential expression according to the gene-specific t-test will lie above a horizontal threshold line. Genes with large fold-change values will lie outside a pair of vertical threshold lines. The significant genes will tend to be located in the upper left or upper right parts of the plot.