Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Machine Learning for Functional Genomics II

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Machine Learning for Functional Genomics II**Matt Hibbs http://cbfg.jax.org**Functional Genomics**Identify the roles played by genes/proteins Sealfon et al., 2006.**Promise of Computational Functional Genomics**Data & Existing Knowledge Laboratory Experiments Computational Approaches Predictions**Computational Solutions**• Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.**Computational Solutions**• Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.**Bayesian Networks**Raining? Jim brought umbrella Cloudy this morning Rain in forecast Encodes dependence relationships between observed and unobserved events**Bayesian Network Overview**• Graphical representation of relationships • Probabilistic information from data to concepts**Bayesian Network Overview**• Graphical representation of relationships • Probabilistic information from data to concepts**Bayesian Network Overview**P(FR | CE, AP, Y2H) P(FR | CE=yes, AP=yes, Y2H=yes) = α P(FR) P(CE=yes|FR) Σ P(PI|FR) P(AP=yes|PI) P(Y2H=yes|PI) Bayes’ Rule: P(A|B) ~ P(A) P(B|A) P(FR=yes) + P(FR=no) = 0.0105α + 0.0216α P(FR) = .327 (up from 0.10)**Naïve Bayes**No internal hidden nodes Greatly simplifies problem, reduces computational complexity and time Imposes independence assumption**Naïve Bayes**P(FR | D1, D2, D3, D4) = α P(FR) P(D1|FR) P(D2|FR) P(D3|FR) P(D4|FR) Bayes’ Rule: P(A|B) ~ P(A) P(B|A) Assumes that all measures are independent**Steps for Bayesian network integration**Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network**Steps for Bayesian network integration**Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network**Gold Standard Construction**• Gene Ontology annotations used to define known functional relationships Threshold for positive relationships Threshold for negative relationships Myers et al., 2006**Gold Standard Used For Training**positive relationships negative relationships Global Gold Standard**Steps for Bayesian network integration**Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network**Gene-Gene Scores**• Binary data • PPI, co-localization, synthetic lethality • Can use binary scores • Can use profiles to generate scores (dot product) • Continuous data • Profile distance metrics • Binning results • Converts everything to discrete case**Distance Metrics**Euclidean Distance Pearson Correlation Spearman Correlation • Choice of distance measure is important for quantifying relationships in datasets • Pair-wise metrics – compare vectors of numbers • e.g. genes x & y, ea. with n measurements**Distance Metrics**Euclidean Distance Pearson Correlation Spearman Correlation**Sensible Binning**• Commonly used Pearson correlation yields greatly different distributions of correlation • These differences complicate comparisons Histograms of Pearson correlations between all pairs of genes DeRisi et al., 97 Primig et al., 00**Sensible Binning**• Fisher Z-transform, Z-score equalizes distributions • Increases comparability between datasets Histograms of Z-scores between all pairs of genes**Pre-calculation and Storage**Pair-wise distances only need to be calculated once, even if using different binnings Typical mouse microarray ~5-20k genes 16M pair-wise distances ~50-700 MB of storage for one dataset ~800 datasets in GEO ~200 GB for all datasets**Steps for Bayesian network integration**Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network**Counting & Learning**• Conceptually straightforward • Counting • Just look at all of the pairs in each dataset, see which bin it falls into, increment a counter • But… you need to do this 16M times/dataset • “Dumb” parallelization – each dataset is independent • Learning CPTs • Fractions based on counts**Steps for Bayesian network integration**Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network**Inference**• Also pretty straightforward • For all pairs of genes… • For each dataset • Look-up value from pre-calculated distances • Determine bin and value from CPT • Multiply probability into product • Do this for FR=yes and FR=no • Normalize out α • Store Result • 1.5GB result file**Steps for Bayesian network integration**Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network**Evaluation Metrics**TPs, FPs, TNs, FNs Agnostic to pairs not appearing in standard ROC curves: Sensitivity-Specificity PR curves: Precision-Recall**Precision Recall Curves**Ordered Predictions 1 Precision TP TP TP + FP TP + FN 0 1 0 Recall**Summary Statistics**• AUC – area under the (ROC) curve • equivalent to Mann-Whitney U • Average Precision – average of the precisions calculated at each true positive • quantized version of area under precision recall curve (AUPRC) • Precision @ n% recall**Steps for Bayesian network integration**Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network**Graph Analysis for Predictions**gi ci = confidence of function S = set of genes in function G = set of all genes wi,j = weight of edge**Steps for Our Evaluation**Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network**Bayesian Network Integration**Gene expression dataset 1 Gene expression dataset 2 Gene expression Gene expression dataset N Data integration via a Bayesian network Yeast two-hybrid dataset 1 Probabilistic, weighted networks of gene function Physical interactions Co-precipitation dataset 1 Synthetic lethality dataset Synthetic rescue dataset Genetic interactions User-selected query focuses search Transcription factor bin sites New genes predicted to interact with known mitochondrial genes Localization Other Curated literature Results displayed Myers et al., 2005; Huttenhower et al., 2006; Guan et al., 2008**Basic Approach Applied Several Times**Huttenhower et al., 2009 Myers et al., 2005; 2007 Guan et al., 2008 Huttenhower et al., 2007**Limitations and Improvements**• Original work designed for yeast, and general notion of functionally related • Ignores reality that some genes are related only under certain conditions • Treats multi-cellular organisms as big single-celled organisms • Increased specificity can be used to improve results • 2nd iteration of bioPIXIE included biological processes into gold standards • Currently working on 2nd generation mouseNET to account for tissue and developmental stages**Global Gold Standard**positive relationships negative relationships Global Gold Standard**Specific Gold Standards**• Not all datasets capture all functional relationships • Process/Pathway specific • Functionally related genes aren’t always functionally related • Tissue specific • Developmental stage specific**Specific Gold Standard Construction**positive relationships negative relationships Global Gold Standard Specific Gold Standard**Tissue/Stage Gold Standards**• Based on data from GXD • Cross reference Theiler stages with mammalian anatomy hierarchy • 729 total intersections • ranging from 50 to ~3500 genes • not including post-natal stages**Preliminary Results**training evaluation test evaluation Running 4-fold cross validation using tissue/stage specific GO-based gold standards**Preliminary Results**training evaluation test evaluation Accounting for developmental stage helps**Preliminary Results**training evaluation test evaluation Many specific tissue/stage combinations are overfitting**Preliminary Results**Folds were randomly generated, are biased, need to balance positives and negatives**New Visualization Interface**Graphle**Simple Things Long Times**• No single step is too complicated • Mostly O(G2D) • 16M * 800 * 4 • Evaluating one fold ~7 hours • So far have results for ~200 tissue/stages • Should take ~3 days on the cluster • Actually took ~15 days