Zhengdeng Lei, PhD

Zhengdeng Lei, PhD

2009 - Present Research Fellow at Duke-NUS, Singapore
2007 - 2009 High Throughput Computational Analyst, Memorial Sloan-Kettering Cancer Center, New York
2003 - 2007 PhD, Bioinformatics, University of Illinois at Chicago

Tuesday, May 29, 2012

http://smd.stanford.edu/gp/pages/protocols/ClassDiscovery_consensus.html

Best practice is to normalize the data being clustered
ZL: usually done by standardizing on row (gene) then on column (array),  or 




  • Adjust Cycle 1) log transform ##### skip this step if RMA

  • Adjust Cycle 2) median center genes and arrays 
  • repeat (2) five to ten times  #### like median polish 

  • Adjust Cycle 3) normalize genes and arrays
  • repeat (3) five to ten times

  • see cluster3 (http://db.tt/fKAluEip treeview) documentation.



    http://bonsai.hgc.jp/~mdehoon/software/cluster/manual/Data.html#Data




    Consensus Clustering

    protocols
    Determine an optimal number of clusters by repeatedly running a selected clustering algorithm. Examine the resulting consensus matrix to assess the stability of the resulting clusters.

    Before you begin

    Gene expression data must be in a GCT or RES file.
    Example file: all_aml_test.gct.
    learn more:
    file formats

    Step 1: PreprocessDataset

    Preprocess gene expression data to remove platform noise and genes that have little variation. Although researchers generally preprocess data before clustering if doing so removes relevant biological information, skip this step.
    CONSIDERATIONS
    • PreprocessDataset can preprocess the data in one or more ways (in this order):
      1. Set threshold and ceiling values. Any value lower/higer than the threshold/ceiling value is reset to the threshold/ceiling value.
      2. Convert each expression value to the log base 2 of the value.
      3. Remove genes (rows) if a given number of its sample values are less than a given threshold.
      4. Remove genes (rows) that do not have a minimum fold change or expression variation.
      5. Discretize or normalize the data.
    • When using ratios to compare gene expression between samples, convert values to log base 2 of the value to bring up- and down-regulated genes to the same scale. For example, ratios of 2 and .5 indicating two-fold changes for up- and down-regulated expression, respectively, are converted to +1 and -1.
    • If you did not generate the expression data, check whether preprocessing steps have already been taken before running the PreprocessDataset module.
    learn more:
    PreprocessDataset

    Step 2: ConsensusClustering

    ConsensusClustering runs a selected clustering algorithm (by default, hierarchical clustering) against perturbations of the gene expression data a selected number of times (by default, 20). It assesses the stability of the resulting clusters by creating a consensus matrix.
    For every pair of objects, the matrix records the number of times both are assigned to the same cluster divided by the number of times both are in the perturbed data set. A consensus matrix where all values are 0 or 1 corresponds to perfect consensus.
    CONSIDERATIONS
    • ConsensusClustering clusters genes or samples, not both.
    • ConsensusClustering groups objects (genes or samples) into k clusters. It groups objects into two clusters, then three clusters, up to the maximum number of clusters specified by the kmaxparameter (by default, 5). The module builds a separate consensus matrix for each set of clusters.
    • Best practice is to normalize the data being clustered (normalize type parameter).
    learn more:
    ConsensusClustering

    Step 3: HeatMapViewer

    Run the HeatMapViewer module to view the consensus matrices. The consensus matrix is formatted as a GCT file. The HeatMapViewer displays the consensus matrix as a heat map. A consensus matrix where all values are dark blue (0) or dark red (1) corresponds to perfect consensus.
    CONSIDERATIONS
    • For more about the consensus matrix and its interpretation, see Monti et al., 2003.
    • ConsensusClustering also creates a text file (*.clu) listing the items belonging to each cluster, text files (*.clsdist, *.stdev) listing the cluster statistics, and a .pdf file showing statistical plots (Lorenz curve, Gini index, Consensus CDF) that can be used to determine the best number of clusters. To display any of these files, click the file.

    No comments:

    Post a Comment