Zhengdeng Lei, PhD

Zhengdeng Lei, PhD

2009 - Present Research Fellow at Duke-NUS, Singapore
2007 - 2009 High Throughput Computational Analyst, Memorial Sloan-Kettering Cancer Center, New York
2003 - 2007 PhD, Bioinformatics, University of Illinois at Chicago

Friday, March 22, 2013

eval(parse(text = command))


setwd("E:\\Projects\\8.ComBAT\\ComBat201\\NormalMarker")
data <- read.table("ComBat201T_orderbyDIM.txt", header = T, sep = "\t", row.names=1)
attach(data)
data <- subset(data, T.Stage >= 1 & T.Stage <=4 & T.Stage !="" & T.Stage != " ")

vars <- colnames(data)
res.out <- vector()
res.out[1] <-  paste("model", "r.squared", "regression.coef", "coef.pval", sep=", ")
for (i in 6:26)
{
command <- paste("lm(", vars[i], " ~ T.Stage)", sep="")
lm.res <- eval(parse(text = command))
summary(lm.res)$r.squared
summary(lm.res)$coefficients[2] #regression.coef
summary(lm.res)$coefficients[8] #coef.pval

res.out[i-4] <- paste(command, summary(lm.res)$r.squared, summary(lm.res)$coefficients[2], summary(lm.res)$coefficients[8], sep=", ")
}
write.table(res.out, file="linear.model.T.Stage.txt", sep = "\t", quote=F, row.names=F, col.name=F)

pdf(file="plot.T.Stage.pdf")
op <- par(mfrow = c(3, 2), # 3 x 2 pictures on one plot
mar = c(4,0.1,0.1,0.1),
        pty = "s")       # square plotting region,
  # independent of device T.Stage
for (i in 6:26)
{
command <- paste("plot(", vars[i], " ~ T.Stage, col ='green')", sep="")
plot.res <- eval(parse(text = command))
}

dev.off()
## At end of plotting, reset to previous settings:
par(op)

linear regression for multiple times


setwd("E:\\Projects\\8.ComBAT\\ComBat201\\NormalMarker")
data <- read.table("ComBat201T_orderbyDIM.txt", header = T, sep = "\t", row.names=1)

attach(data)

vars <- colnames(data)

res.out <- vector()
res.out[1] <-  paste("model", "r.squared", "coef.Subtype.Mes", "T.Stage", "Subtype.Meta", "Subtype.Mes:T.Stage", "T.Stage:Subtype.Meta" , "pval.Subtype.Mes", "T.Stage", "Subtype.Meta", "Subtype.Mes:T.Stage", "T.Stage:Subtype.Meta", sep=", ")
for (i in 6:26)
{
command <- paste("lm(", vars[i], " ~ Subtype.Mes*T.Stage+Subtype.Meta*T.Stage)", sep="")
lm.res <- eval(parse(text = command))
summary(lm.res)$r.squared
summary(lm.res)$coefficients[2] #regression.coef
summary(lm.res)$coefficients[3] #regression.coef

summary(lm.res)$coefficients[8] #pval

coef <- summary(lm.res)$coefficients

res.out[i-4] <- paste(command, summary(lm.res)$r.squared, coef[2],coef[3], coef[4], coef[5], coef[6], coef[20],coef[21], coef[22], coef[23],coef[24],sep=", ")
}

write.table(res.out, file="linear.model.TStage_subtype.txt", sep = "\t", quote=F, row.names=F, col.name=F)

Friday, March 1, 2013

NGS Glossary from blueseq.com



16s rRNA: A component of the 30S subunit of prokaryotic ribosomes. It is 1,542 nts in length. Multiple sequences of 16S rRNA can exist within a single bacterium. 16s rDNA sequences contain hypervariable regions which can provide species-specific signature sequences useful for bacterial identification to species level, particularly in metagenomic studies.

Adapter: A short oligonucleotide of known sequence that is typically attached to longer nucleic acids. In the context of next generation sequencing, an adapter can provide a priming site for both amplification and sequencing of the adjoining, unknown nucleic acid.

Amplicons: Pieces of DNA formed as the products of natural or artificial amplification events.

Assembly: The aligning and merging of fragments of DNA in order to reconstruct the original sequence.

Bacterial artificial chromosome (BAC): Bacterial artificial chromosome (BAC) - a DNA construct, based on a functional fertility plasmid (or F-plasmid) used for transforming and cloning in bacteria, usually E. coli.

Base pair: Two nucleotides on opposite complementary DNA or RNA strands that are connected via hydrogen bonds (often abbreviated bp).

Bridge Amplification: Pre-sequencing fragment amplification step that is used exclusively on the Illumina platform. Fragments to be sequenced are immobilized on a flow cell via hybridization involving two adapters attached at opposing ends of the fragment and slide-anchored primers with sequence complementary to the aforementioned adapters. Through the use of standard PCR reagents, thermal cycling, and the surrounding "primer lawn", many copies of the original fragment are produced and localized in a tight cluster.

cDNA: See "complementary DNA"

Chromosome: An organized structure of DNA and protein found in cells. It is a single piece of coiled DNA containing many genes, regulatory elements and other nucleotide sequences. Chromosomes also contain DNA-bound proteins, which serve to package the DNA and control its functions.

CNV: See "copy number variation"

Complementary DNA (cDNA): DNA synthesized from a mature mRNA template in a reaction catalyzed by reverse transcriptase and DNA polymerase.

Consensus Accuracy: The accuracy of the aggregate of multiple aligned reads when compared to a reference sequence. This is in contrast to "raw read accuracy" which is a direct measure of the sequencing error rate. As the "consensus accuracy" improves with read depth, it is substantially higher than "raw read accuracy" and is particularly relevant for resequencing applications.

Contig: A set of overlapping DNA segments derived from a single genetic source which can be used to deduce the original DNA sequence of the source.

Copy Number Variation (CNV): Copy number variation refers to the observation of a segment of DNA for which the number of iterations differ between genomes of a given population. Measures of copy number variations include microarrays and depth of coverage analysis from NGS data.

Cosmid: A hybrid plasmid that contains cos sites at each end. Cos sites are recognized during head filling of lambda phages. Cosmids are useful for cloning large segments of foreign DNA (up to 50 kb).

Coverage Depth: The number of nucleotides from reads that are mapped to a given position.

CpG Island: A genomic DNA stretch of alternating cytosine and guanine nucleotides. CpG islands are often methylated and associated with genomic regions of reduced expression.

DGE: See "digital gene expression".

Digital Gene Expression (DGE): Experimental methodology that aims to quantify gene expression through an absolute count. One such type of experimentation that leverages next generation sequencing is SuperSAGE, in which 26 bp fragments representative of specific mRNA transcripts are sequenced. Each sequenced fragment acts as a "count" for its associated mRNA transcript, allowing for high throughput comparative analysis of gene expression under various conditions.

Diploid: Condition in which cell or organism contains two sets of homologous chromosomes and hence two copies of each gene or genetic locus.

DNA: deoxyribonucleic acid; double-stranded polynucleotide formed from two separate chains of covalently linked deoxyribonucleotide units. It serves as the cell's store of genetic information that is transmitted from generation to generation.

DNA Sequence: The linear order of nucleotides in a DNA molecule.

emPCR: See "emulsion PCR"

Emulsion PCR: A method for bead-based amplification of a library. A single adapter-bound fragment is attached to the surface of a bead, and an oil emulsion containing necessary amplification reagents is formed around the bead/fragment component. Parallel amplification of millions of beads with millions of single strand fragments produces a sequencer-ready library.

Epigenetics: The study of phenotype and gene expression caused by DNA modifications besides modifications of the primary sequence itself, mainly cytosine-methylation and DNA-protein interactions.

EST: See "expressed sequence tag"

Expressed Sequence Tag (EST): A short sub-sequence of a transcribed cDNA sequence. They may be used to identify gene transcripts and are instrumental in gene discovery and gene sequence determination.

Fosmid: An f-factor cosmid, which is like a plasmid (circular DNA), but it is capable of containing much larger pieces of DNA, up to 50 kb compared to about 10 kb in a plasmid. Unlike plasmids, E.Coli can only ever consume (and therefore replicate) one fosmid, which yields a much lower copy number when cloning. Fosmids can therefore hold larger pieces of DNA than plasmids, but fewer of them.

GC-rich region: A region of the DNA sequence containing a high percentage of G's and C's, which often indicates a gene-rich region.

Genome: The entirety of an organism's hereditary information, including both the genes and non-coding sequence. The human genome consists of over 3 billion base pairs.

Genomics: The study of genes and their function, via mapping genes and sequencing DNA, with the aim of understanding the structure of the genome.

Haploid: Condition in which a cell or organism contains one set of homologous chromosomes and hence one copy of each gene or genetic locus.

Haplotype: A way of denoting the collective genotype of a number of closely linked loci on a chromosome.

HiTS-FLIP: High-throughput sequencing Cfluorescent ligand interaction profiling; see initial description here: http://www.nature.com/nbt/journal/v29/n7/full/nbt.1882.html

Homopolymer: Uninterrupted stretch of a single nucleotide type (e.g., AAAA or GGGG).

InDel: A form of structural variation in which a DNA segment is either deleted or inserted. Related to copy number variations, but typically the lengths are shorter. Often detected through paired end mapping techniques that utilize NGS data.

Intron: A DNA region within a gene that is not translated into a protein. These non-coding sections are transcribed into precursor mRNA (pre-mRNA) and other RNAs (such as long non-coding RNAs). They are subsequently removed by a process called splicing during RNA maturation. After intron splicing (i.e. removal), the mRNA consists only of exon-derived sequences, which are then translated into proteins.

Junk DNA: Components of an organism's DNA sequences that do not encode for protein sequences. In many eukaryotes, a large percentage of an organism's total genome size is noncoding DNA, although the amount of noncoding DNA, and the proportion of coding versus noncoding DNA varies greatly between species.

Library: A collection of DNA fragments with adapters ligated to each end. There are different types of DNA libraries, including cDNA libraries (formed from reverse-transcribed RNA) and genomic libraries (formed from genomic DNA).

Long Reads: Sequence reads that generally are at least 400 bp long in a single direction. Long reads make nucleic acid scaffold construction from subsequences an easier bioinformatic task. Currently offered by platforms such as 454 and Pacific Biosciences.

MeDIP: Methylated DNA immunoprecipitation. See "Methylation Sequencing" section for more details.

Metagenomics: The study of genetic material recovered directly from environmental samples. See our "metagenomics application" section for more details.

MethylCap: Methylated DNA capture by affinity purification. See "Methylation Sequencing" section for more details.

Multiplexing: Pooling of multiple adapter-barcoded libraries into a single sequencing run. Ideal for leveraging NGS capacity with multiple samples of small sequence size.

N50: The contig length that when using equal or longer contigs produces half the bases of the entire genome. The N50 size is computed by sorting all contigs from largest to smallest and then determining the minimum set of contigs whose sizes total 50% of the entire genome.

Non-coding RNA: An RNA molecule that is not translated into a protein. This includes transfer RNA and ribosomal RNA (the primary constituent of ribosomes) as well as other classes of RNA.

Oligonucleotide: A short nucleic acid polymer, typically with fifty or fewer bases.

PAC: P1-artificial chromosome; a bacterially-propagated phagemid vector system suitable for cloning genomic inserts up to several hundred kilobases in size. PACs (as well as BACs) were key tools in the original work of sequencing the human genome.

Paired-end tags (PET): Short sequences at the 5'  and 3' ends of the DNA fragment of interest (e.g., genomic DNA or cDNA) which are generally between 200 to 1000 bases apart. The use of paired-end tags can dramatically improve the ability to align reads against a reference genome as it both increases the effective read length and can span repeat regions which would have otherwise prevented unique alignment.

Pharmacogenomics: The study of the interaction of an individual's genetic makeup and response to a drug.

Pyrosequencing: A method of DNA sequencing based on the "sequencing by synthesis"  principle. It differs from Sanger sequencing, in that it relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides.

RAD-Seq: Restriction site associated DNA sequencing; a form of reduced complexity DNA sequencing which foregoes whole genome sequencing for the added benefit of reduced cost and increased speed.

Raw Read Accuracy: The accuracy of individual reads as measured directly from the sequencer. This is in contrast to "consensus accuracy" which measures the error rate after aligning multiple reads to a reference sequence. "Raw read accuracy" is a more direct measure of the performance of the instrument and is especially important for de novo sequencing applications where there is no consensus sequence against which to measure.

Repeat Regions: Genomic spans that are characterized by the repetition of a subsequence.

SBL: See "sequencing-by-ligation".

SBS: See "sequencing-by-synthesis".

Scaffold: A series of contigs that are in the right order but not necessarily connected in one contiguous stretch of sequence.

Sequencing-By-Ligation (SBL): Method of DNA sequencing that relies on enzymatic ligation of oligonucleotides that are adjacent through local complementarity on a template DNA strand. Detection of oligonucleotide is normally through the release of a flourescently labelled probe that corresponds to a known nucleotide at a known position along the oligo. This method is primarily used by Life Technologies?? SOLiD sequencers.

Sequencing-By-Synthesis (SBS): Method of DNA sequencing that relies on incorporation of nucleotides by a DNA polymerase. The signal of nucleotide incorporation can vary with fluorescently labelled nucleotides, phosphate-driven light reactions and hydrogen ion sensing having all been used.

Serial Analysis of Gene Expression (SAGE): A technique used to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those transcripts.

Short Reads: Sequence reads that are normally 150 bp long or shorter in one direction. Normally have higher accuracy than long reads. Currently offered by platforms such as Illumina, SOLiD and Ion Torrent.

Shotgun sequencing: A method used for sequencing long DNA strands. In shotgun sequencing, DNA is broken up randomly into numerous small fragments, which are sequenced using the chain termination method to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence.

Single nucleotide polymorphism (SNP): A difference in one pair of nucleotides in the DNA sequence as compared to a reference sequence at the same locus.

Splice Junction: A boundary between exon and intron sequences in the genome. In most eukaryote cases, splice junctions are bordered by two dinucleotide sequences, GT-AG.

Structural Variant: Local mutations that affect stretches of DNA, such as InDels and inversions.

Transcript: An RNA species created from a genomic template. Transcripts may be translated into functional proteins, act in an expression-regulating fashion, and even act as a DNA cleavage unit.

Transcriptome: The set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one or a population of cells. The term can be applied to the total set of transcripts in a given organism, or to the specific subset of transcripts present in a particular cell type.

Ultra-deep sequencing: Using the amplicon library procedure, targeted genes and chromosomal regions can be selectively amplified and sequenced to a very high depth. This will allow the detection of variants with frequency as low as 0.5% in a pool of heterogeneous variants. This method can be used to find rare mutations or other events such as rare transcripts or splice junctions.

Methylation Sequencing: The methylation state of DNA (specifically that of the base cytosine) has been shown to influence the expression of genes. For example, in mammalian cells higher levels of methyl CpG around the transcription start site (TSS) have been associated with transcriptional silencing, although more complex patterns in other regions of the genome are being revealed. The rise of next generation sequencing has enabled the transition from studying the methylation patterns of just a few genes or small regions to that of truly genome-wide studies, and this is leading to a far richer view of the methylome.
There are a variety of methods for monitoring the methylation status of the genome, but they can generally be placed into one of two categories - they either rely on bisulfite conversion or they employ a form of methylated DNA enrichment or pulldown. Bisulfite treatment of DNA converts all unmethylated cytosines to uracil (which is read as thymine) while 5??-methyl-cytosine is left unchanged. Currently the most complete picture of the methylome is generated via whole genome bisulfite sequencing (WGBS). Unfortunately, it is also the most expensive at around 1.5X the cost of standard whole genome sequencing. In an effort to reduce the cost and to increase the sample throughput, several methods have been developed which limit sequencing to only a portion of the genome. One such example is "reduced representation bisulfite sequencing" (RRBS) which uses restriction enzymes and size selection to reduce the overall complexity of the genome while enriching for CpG islands (regions of high CpG density). It should be noted, however, that this enrichment introduces some biases as to which CpG sites end up being sequenced.
Another strategy for lowering the costs and increasing the throughput of methylation sequencing is to select for methylated DNA prior to the sequencing step. One common method is ?¡ãmethylated-DNA immunoprecipitation?¡À (MeDIP-Seq). Similar to ChIP-Seq, MeDIP-Seq is performed by immunoprecipitating methylated DNA with an antibody raised against 5??-methylcytosine. The unmethylated DNA is washed away, leaving the material highly enriched for methylated DNA. The presence or absence of a particular sequence gives an estimate of the level of methylation in that region of the genome. A similar method, called "methyl CpG immunoprecipitation" (MCIp), uses a methyl-CpG binding domain (MBD) protein to isolate the methylated regions of the genome. While offering genome-wide coverage, both of these methods introduce some level of bias and data interpretation can be tricky.
In general, methylation sequencing applications are most suitable for those platforms which generate a large amount of sequence per run. For example, WGBS requires about 1.5X the amount of sequence needed for standard genome sequencing. The reduced complexity methods require less sequencing, but the demands are still fairly high, with human samples requiring approximately 5-10 Gb per sample. For bisulfite converted samples, it should be noted that the reduced genomic complexity (with all unmethylated C??s being converted to T??s) can create alignment challenges. Finally, the ideal platform would be one that is able to read the methylation status of the DNA directly (without the need for bisulfite conversion). While preliminary proof of concept studies along these lines have been performed on the PacBio RS, it has not yet been transformed into a commercially viable application.