WO2013024173A1

WO2013024173A1 - Computer implemented method for identifying regulatory regions or regulatory variations

Info

Publication number: WO2013024173A1
Application number: PCT/EP2012/066144
Authority: WO
Inventors: Melina Christine CLAUSSNITZER; Helmut LAUMEN; Hans Hauner
Original assignee: Technische Universität München
Priority date: 2011-08-17
Filing date: 2012-08-17
Publication date: 2013-02-21
Also published as: WO2013024175A2; WO2013024175A3

Abstract

The present invention relates in one aspect to a method and a computer program of identifying a regulatory region or a regulatory variation in a reference genome of a species, comprising obtaining sequence data of DNA sequences of the reference genome, defining reference regions of interest in the sequence data, identifying orthologous regions of at least one further species corresponding to the reference regions, analyzing the identified orthologous regions with regard to common patterns of regulatory elements, classifying the reference regions based on the analysis of the orthologous regions, and rating each reference region in accordance with its corresponding classification. With the method according to the first aspect of the invention, a phylogenetic module complexity analysis (PMC analysis) based on modular organization of genomic regions analyzed in several vertebrate species can be established as a reliable tool to predict functionality for example of non-coding SNP at transcriptional level and particularly to identify regulatory regions in the genome.

Description

COMPUTER IMPLEMENTED METHOF FOR IDENTIFYING REGULATORY REGIONS OR REGULATORY VARIATIONS

The present invention relates to a method according to the preamble of independent claim 1 and more particularly to a computer program according to independent claim 25. Such methods and computer programs can be used in genomic research and development for identifying a regulatory region or a regulatory variation in a reference genome of a species.

Recent advances in genome wide association studies (GWASs) have led to an abundance of loci associated with diverse human traits ranging from body height and body mass index (BMI) to complex multifactorial diseases, such as neurological disorders, inflammatory diseases, cancer, cardiovascular diseases, and metabolic disorders (Hindorff, A Catalog of Published Genome-Wide Association Studies, Available at: www.genome.gov/gwastudies). Although the recent genetic technologies have provided important new information, signals emerging from GWASs, merely markers for linkage disequilibrium (LD) blocks have rarely been traced to phenotypically causal variants. GWASs show that the majority of human genetic variations conferring common disease risk are located in non-coding regions (Hindorff (2009) Proceedings of the National Academy of Sciences 106: 9362-9367). Gene expression is highly heritable (Cheung (2005) Nature 437: 1365-1369; Dixon (2007) Nat. Genet 39: 1202-1207; Stranger (2007) Nat. Genet 39: 1217-1224) and trait-associated loci are enriched for expression quantitative trait loci (eQTLs) (Nicolae (2010) PLoS Genet 6: e1000888; Nica (2010) PLoS Genet 6: e1000895EP; Nica (201 1 ) PLoS Genet 7: e1002003EP), suggesting that regulatory acting variants are major contributors to common disease susceptibility. Thus, the discovery of regulatory variants is central for delineating disease-promoting mechanisms underlying a genetic predisposition. In genomic research and development for example in the field of pharmacy or biotechnology the importance of regulatory elements in a genome is widely recognized and acknowledged. Type 2 diabetes (T2D) is a leading health problem, with a dramatically increasing prevalence worldwide (Danaei (201 1 ) Lancet 378: 31- 40; Doria (2008) Cell Metabolism 8: 186-200) resulting from a complex interaction of environmental factors acting on a susceptible genetic background (Prokopenko (2008) Trends Genet 24: 613-621 ). To date, GWASs have revealed 48 established T2D susceptibility loci (Bonnefond (2010) Trends Mol.Med 16: 407-416; Dupuis (2010) Nat Genet 42: 105-1 16; Voight (2010) Nat Genet 42: 579-589). However, only two functional c/^'s-regulatory SNPs could be mapped at the TCF7L2 (Gaulton (2010) Nat.Genet 42: 255-259; Stitzel (2010) Cell Metab 12, 443-455) and WFS (Stitzel (2010) Cell Metab 12, 443-455) loci by next generation sequencing approaches. Such experimental approaches are limited by the access to appropriate human tissue and further hampered by the spatial, temporal, environmental and epigenetic complexity of gene regulation (Nica (201 1 ) PLoS Genet 7: e1002003EP; Dimas (2009) Science 325: 1246-1250). These limitations increase the demand for feasible bioinformatic approaches, which generally assume functionality for a candidate variant when located in functional gene regulatory regions.

Recent studies imply evolutionary constraint as a tool in search of regulatory elements in the genome (Pennacchio (2006) Nature 444: 499-502). Thereby, sequences conserved over evolutionary distance are supposed to be more likely functional than those conserved over lesser distances. So far, phylogenetic conservation has been a common tool to search for these regulatory regions in the non-coding genome. However, genome-wide comparisons revealed that some highly conserved non-coding sequences do not exhibit the expected pattern for the known types of transcriptional regulatory elements (Siepel (2005) Genome Res 15: 1034- 1050; Maston (2006) Annu.Rev.Genomics Hum. Genet 7: 29-59) and several unconserved regions have been shown to be functionally relevant in driving specific gene expression (Identification and analysis of functional elements in 1 % of the human genome by the ENCODE pilot project (2007) Nature 447: 799-816). Also in disease genetics, novel powerful approaches (Lindblad-Toh (201 1 ) Nature 478: 476- 482) used constraint to predict regulatory variants in the non-coding genome for further follow-up. But, inter- individual and -species differences in gene expression often occur at the level of TF binding (Kasowski (2010) Science 328: 232-235), and rapid evolutionary turnover of transcription factor binding sites (TFBSs) results in regulatory regions with functional conservation (Sosinsky (2007) Proceedings of the National Academy of Sciences 104: 6305-6310; Ludwig (2000) Nature 403: 564- 567; He (201 1 ) PLoS Genet 7: e1002053EP; Taher (201 1 ) Genome Research 21 : 1 139-1149; Blow (2010) Nat Genet 42: 806-810; Kunarso (2010) Nat Genet 42: 631-634; Schmidt (2010) Science 328: 1036-1040) (Birney (2007) Nature 447: 799- 816), making the use of sequence conservation particularly challenging for detection of disease-conferring regulatory variants in the genome. In eukaryotes, gene expression is generally controlled by functional gene regulatory regions. These cis- regulatory modules (CRMs) comprise complex patterns of co-occurring TFBSs (here referred to as TFBS modules) for the coordinated binding of transcription factors (TFs) (Pennacchio (2006) Nature 444: 499-502; Arnone (1997) Development 124: 1851-1864). TFBS modules integrate a variety of upstream signals and translate this information into the level of gene expression, making them a likely source for phenotypic change and adaptive evolution (Junion (2012) Cell 148: 473-486; Kantorovitz (2009) Developmental Cell 17: 568-579; Kim (1997) Molecular Cell 1 : 1 19-129).

Increasing evidence suggests that computational methods for the identification of cis- regulatory sequences based on sequence similarities reveal relatively high false- positive rates and that sequence alignment does not provide a reliable approach to identify cis-regulatory regions (Maston (2006) Annu. Rev. Genomics Hum. Genet 7: 29-59; Balhoff (2005) Proc.Natl.Acad.Sci.U.S.A 102: 8591-8596; Cameron (2005) Proc.Natl.Acad.Sci.U.S.A 102: 1 1769-1 1774; Fisher (2006) Science 312: 276-279; Narlikar (2009) Brief.Funct.Genomic.Proteomic 8: 215-230; van Loo (2009) Brief. Bioinform 10: 509-524; Visel (2009) Nature 461 : 199-205).

Regarding the low predictive value of approaches relying on conservation analysis, identification of regulatory regions and regulatory variation in the human genome turned out to be a very challenging problem for genomic research.

Further, clustering and modular organization of cis-regulatory elements, necessary for interaction of transcription factors, is a hallmark of regulatory regioPns (Arnone (1997) Development 124: 1851-1864; Pennachio (2006) Nature 444: 499-502). Transcription factor binding sites (TFBS) being part of a module are separated by sequences with variable length and identity, thereby allowing certain degeneracy within an orthologous regulatory region (van Loo (2009) Brief. Bioinform 10: 509- 524). Consequently, a regulatory module might be disregarded by use of sequence similarity approaches notwithstanding complete functionality asking for a refinement of computational approaches.

Therefore, there is a need for a reproducible method or system allowing an efficient and reliable identification of a regulatory region or a regulatory variation in a genome.

According to the invention this need is settled by a method as it is defined by the features of independent claim 1 . In particular, by a method, preferably a computer implemented method, of identifying a regulatory region or a regulatory variation in a reference genome of a species, comprising: obtaining sequence data of DNA sequences of the reference genome; defining reference regions of interest in the sequence data; identifying orthologous regions of at least one further species corresponding to the reference regions; analyzing the identified orthologous regions with regard to common patterns of regulatory elements; classifying the reference regions based on the analysis of the orthologous regions; and rating each reference region in accordance with its corresponding classification. Preferably, the reference regions and the identified orthologous regions are analyzed with regard to common patterns of regulatory elements. Analyzing the identified orthologous regions with regard to common patterns of regulatory elements preferably comprises analysing a set of orthologous sequences for each orthologous region. Classifying the reference regions may be further performed based on the analysis of the reference regions and the othologous regions. Classifying the reference regions based on the analysis of the othologous regions preferably comprises the step of summarizing the regulatory elements in modules. According to the invention this need is further settled by a computer program as it is defined by the features of independent claim 25, i.e., by a computer program comprising code means being adapted to implement the method according to the invention when being executed. Preferred embodiments are subject of the dependent claims. GWASs have revealed numerous risk loci associated with a diverse range of diseases. However, prioritizing the disease-causing variants remains a major challenge in medical genetics. Divergence in gene expression influenced by c/^'s- regu!atory variation is central to human disease susceptibility. The invention provides the phylogenetic module complexity analysis (PMCA), a complementary bioinformatic approach for delineating c s-regulatory variants within GWAS loci. For example, PMCA can be used to prioritize c/s-regulatory variants within disease susceptibility loci from GWASs. PMCA integrates evolutionary conservation with a complexity assessment of co-occurring regulatory elements, such as transcription factor binding sites (TFBSs). In accordance with the invention it was demonstrated that the evolutionary conserved complexity of co-occurring TFBSs (TFBS modularity) in a candidate variant-surrounding region is predictive for its c/s-regulatory functionality. Thus, one aspect of the invention, i.e., the assessment of conserved TFBS patterns within variant-surrounding genomic regions, may help to translate genetic association signals to disease-underlying molecular mechanisms.

In particular, the gist of the invention is the following: A method of identifying a regulatory region or a regulatory variation in a reference genome of a species, comprises obtaining sequence data of DNA sequences of the reference genome, defining reference regions of interest in the sequence data, identifying orthologous regions of at least one further species corresponding to the reference regions, analyzing the identified orthologous regions with regard to common patterns of regulatory elements, classifying the reference regions based on the analysis of the orthologous regions, and rating each reference region in accordance with its corresponding classification. Preferably, classifying the reference regions based on the analysis of the orthologous regions comprises summarizing the regulatory elements in modules (classification strategy 1 ). In this context, the term "orthologous regions" relates to regions of sequences of genomes of diverse separate species being separated by a speciation event. In other words, orthologs are genomic regions in different species that are similar to each other because they originated by vertical descent from a single gene of the last common ancestor. The identified reference regions can be of interest for a variety of purposes such as the described identification of regulatory regions in the genome and the understanding of evolution. The method according to the invention can particularly be a computer implemented method and the reference genome can particularly be a human reference genome. Within the method according to the invention orthologous regions can be searched and identified in the genomes of several species for reference sequence data of interest from the reference genome. In particular by combining the identification of the orthologous regions with the analysis thereof regarding the common patterns and classifying the reference regions using the results of said analysis, the method particularly when being implemented on a computer allows for an efficient identification of regulatory regions in a genome wherein based thereon also regulatory variations can efficiently be identified. In this context, "regulatory variation" relates to a genomic mutation in the genome that affect regulatory activity of a regulatory region.

In particular, the method according to the invention can allow refining the computational search for functionally relevant sequences which might for example be affected by single nucleotide polymorphisms (SNP). In this context, the term "single nucleotide polymorphism" relates to DNA sequence variations occurring if a single nucleotide in the genome differs between members of one single species or paired chromosomes in an individual. Accordingly, a "regulatory variation" can be, e.g., a SNP or another mutation. Preferably, the "regulatory variation" is a SNP. For example, PMCA can separate candidate SNPs that are located in complex region (predicted c/s-regulatory) from SNPs located in non-complex regions.

Further, by means of the method according to the invention a sizable set of functionally conserved but non-orthologous elements in the human genome that might be unconstrained across mammals can be supposed which can improve the identification of regulatory regions compared to the assumption of the presence of conserved function encoded by conserved orthologous bases. With the method according to the invention, a phylogenetic module complexity analysis (PMC analysis) based on modular organization of genomic regions analyzed in several vertebrate species can be established as a reliable tool to predict functionality for example of a non-coding SNP at transcriptional level and particularly to identify regulatory regions in the genome.

Within identification of the orthoiogous regions of at least one further species corresponding to the reference regions of the method according to the invention, for each reference region an orthoiogous region of each further species can be searched wherein to each reference region the corresponding orthoiogous region can be associated if found in the identified orthoiogous regions of the further species.

Preferably, defining the reference regions of interest in the sequence data and identifying the orthoiogous regions of the at least one further species corresponding to the reference regions comprise determining a single nucleotide polymorphism (SNP) and identifying the reference regions spanning the SNP. Thereby, the reference region can also span plural SNP. Like this, orthoiogous regions can be searched and identified in the genome of several species for sequence data of interest from the reference genome, with a SNP in the reference sequence. Thus, the method can allow refining the computational search for functionally relevant sequences which might for example be affected by the SNP.

Preferably, identifying the orthoiogous regions of the at least one further species corresponding to the reference regions comprises aligning the reference regions with the orthoiogous regions. Thereby, the aligning of the reference regions with the orthoiogous regions preferably comprises obtaining further sequence data of DNA sequences of the at least one further species and aligning the sequence data with the further sequence data. Aligning the sequence data with the further sequence data preferably comprises providing a specific data set of input sequences for each reference region and orthoiogous regions. Therein, the method preferably further comprises assessing a minimum number of the input sequences for each reference region and determining a modular structure of each reference region in stepwise manner up to the minimum number. In this context, the term "modular structure" relates to the structure with regard to the occurrence of the regulatory elements. Like this, the number of input sequences can be stepwise extended in order to assess a modular pattern of the reference region. Thereby, the minimum number of the input sequences preferably is assessed as a percentage of the total number of the input sequences. This can allow for a convenient handling and efficient assessment of the minimum number of input sequences.

Preferably, the minimum number of the input sequences is the minimum number of input sequences to contain a common module, i.e., e.g., a module present in different species, comprising at least one regulatory element. Thereby, the common module preferably comprises a plurality of regulatory elements in a specific order and/or in a specific distance from each other wherein the specific distance can be defined in as a distance range. In this regard, analysing the identified orthologous regions of the sequence data with regard to the common regulatory elements preferably comprises predefining a maximum distance variance between two regulatory elements within the common module, a range of the distance between two regulatory elements within the common module and a range of number of the regulatory elements within the common module.

Preferably, aligning the sequence data with the further sequence data comprises aligning a plurality of base pair sequences of the sequence data with a corresponding plurality of base pair sequences of the further sequence data. In case the SNP is determined within the method, the base pair sequences of the sequence data preferably comprises a base pair sequence having the single nucleotide polymorphism essentially in the middle.

Preferably, analysing the identified orthologous regions of the sequence data with regard to the common regulatory elements comprises extracting a common framework of regulatory elements from the specific data set of the input sequences.

As mentioned above, classifying the reference regions based on the analysis of the orthologous regions preferably comprises summarizing the regulatory elements in modules (classification strategy 1 ). Thereby, the regulatory elements preferably are summarized in modules according to the formula: max

mm x sequences

∑∑ ⁿ Υ ~ ^el^ement module in

total input sequences Alternatively or in addition thereto, the regulatory elements preferably are summarized in modules according to the formula: max j I

T ^in y - element module in min x sequences

Preferably, classifying the reference regions based on the analysis of the orthologous regions comprises summarizing all common regulatory elements (classification strategy 2). Thereby, all common regulatory elements preferably are summarized according to the formula: V sites in— ^{mm x} sequences— Alternatively or in addition

^■~ total input sequences

thereto, all common regulatory elements preferably are summarized according to the

max

formula: ^ sites in min x sequences .

Preferably, classifying the reference regions based on the analysis of the orthologous regions comprises summarizing the number modules (classification strategy 3). Thereby, the number of modules preferably is summarized according to the formula:

V V y - element modules in— ^{mm x} sequences— Alternatively or in addition thereto, ~~ total input sequences

the number of modules preferably is summarized according to the formula: max j I

∑∑ y " ^el^ement modules in min x sequences .

χ— i y—k

In the formulas provided herein, above and below, the following definitions apply: x or ζ is the minimum number of sequences per total input sequences that has to contain a common framework, j is the number of input sequences.

The terms "site(s)", "element(s)" and "regulatory element(s)" are used interchangeably herein. Preferably, the regulatory elements are transcription factor binding sites (TFBS), methylation sites, miRNA seats and/or regions of open chromatin. Most preferably, the regulatory elements are TFBS. Preferably, rating each reference region in accordance with its corresponding classification comprises determining that a region is a regulatory region when a condition with regard to the classification is met. Meeting the condition with regard to the classification can, e.g., be that one or several values as calculated above or below a specific threshold. In particular, a region or a SNP-spanning region meeting each criterion can be rated as regulatory. As described herein below, in accordance with the present invention, a PMCA score can be identified for three different classification strategies.

Another aspect of the invention relates to a computer program comprising code means being adapted to implement the method defined above when being executed. Such a computer program allows for efficiently and conveniently implementing and distributing the method according to the invention as well as preferred embodiments thereof.

Still another aspect of the invention relates to a computing system such as, e.g., a processor-based desktop computer or a processor based computing apparatus integrated into another device (embedded system). The system comprises computing means (such as a processor) and a memory which are arranged to perform at least part of the steps of the methods according to the invention. In particular, the computing system comprises means for identifying a regulatory region or a regulatory variation in a reference genome of a species.

Preferably, the computing means and the memory of the computing system are arranged to perform further steps of the methods according to the invention as described above. Also, the computing system can comprise an interface for interacting with other devices, such as for example a device for interacting input/output devices. For instance the computing system may comprise an interface for interacting with a device for optaining sequence data of DNA sequences.

As used herein, the term "computing system" refers to a device/system that computes, especially a programmable electronic machine that performs high-speed mathematical or logical operations or that assembles, stores, correlates, or otherwise processes information. Examples include, without limitation, mainframe computers, personal computers (desktop computers or laptops) and handheld devices (e.g. tablet computers).

The computer system may further include a user interface device which may include, but is not limited to, a computer, a television, a portable media device, a keyboard and/or a web-enabled device, such as a cellular phone, a personal data assistant, and the like. The computer system may also include a display device for receiving a signal from the electronic memory and for displaying the results of the algorithm according to the present invention.

Definitions

According to the invention, the PMCA analysis can be used to assess the modularity of regulatory elements (such as TFBS), i.e. to assess functional conservation of regulatory elements (such as TFBS)

Therefore, for each reference region (e.g. each SNP-surrounding region), the identified orthologous sequences (ortholog sets) can be analysed for the occurrence of complex modules (e.g. complex TFBS modules) common to a defined subset x of input sequences. It is noted that x and ζ are used interchangeably herein. In addition, y and γ are also used interchangeably herein.

1. module

A module describes the occurrence of two or more regulatory elements (e.g. TFBSs) in a defined orientation and distance range in all or a subset of the input sequences. The terms "module" and "framework" are used interchangeably herein. The detection of conserved modules (e.g. TFBS modules) relies on the Quorum constraint, Element constraint and Distance constraints (see below).

2. Quorum constraint

The minimum number x of input sequences to contain (a) common module(s). The quorum constraint can be assessed by both, sequence number constraint (absolute number of input sequences) and percent sequence constraint (percentage of input sequences) (see Materials and Methods 2.1 , below). The threshold x of sequences that are required to share the common module(s) (e.g. TFBS module(s)) can be stepwise raised.

3. Element constraint

The number γ of regulatory elements (e.g. TFBSs) in the module. The threshold γ of regulatory elements (e.g. TFBSs) that are required to fulfil the percent sequence constraint and sequence number constraint can be stepwise raised.

4. Distance constraints

As used herein, the "distance constraint" is a parameter which sets the minimum and maximum possible distance between the anchors of the regulatory elements. The "distance variation" is defined as a parameter which sets the maximum possible variation of distances between the anchors of the regulatory elements. An "anchor" of a regulatory element is the center position of a matrix. A "distance variation" is a particular range of base pairs, and a module satisfies the distance variation parameter if the distances between the anchors of the regulatory elements in the input sequences do not differ more than said particular range of base pairs. For example, the maximum distance variance between two TFBSs within a TFBS module may be set to 10 bp.

5. Common elements

A regulatory element (e.g. a TFBS) was considered a "common element" if the number x of input sequences within the ortholog sequence set containing the element on any strand (sense or antisense) was above quorum (percent sequence constraint and sequence number constraint). These common elements are used to build the potential element modules.

6. Classification strategies

As described herein above and below, complexity assessment of element modularity (e.g. TFBS modularity) in a candidate variant-surrounding region can be performed based on three different measures for a defined quorum:

6.1 Number of elements in y-element modules (classification strategy 1)

6.2 Number of common elements (classification strategy 2) 6.3 Number of y-element modules (classification strategy 3) 7. P-Estimates and scoring criteria

The classification strategies 1 to 3 are used to rank reference regions (e.g. SNP- surrounding regions). Scoring criteria for separating complex regions versus non- complex regions are preferably obtained by simulations with random sequences derived from the orthologous sets. From each orthologous set random sets are derived by shuffling the sequences. Each sequence is traversed with a particular window (e.g. a 10 bp window) in steps (e.g. steps of 10 bp). Within the window, nucleotide positions are exchanged randomly leaving the local nucleotide distribution nearly untouched though changing the exact sequence. Preferably, each of these sets may be used for each classification strategy. Corresponding observations from analysis of the original ortholog sets for each classification strategy to the random data is preferably compared. Thus, an estimate for random occurrence for each

#occurence(random≥ original)

classification strategy # random sets _j j _e. the number of occurrences where the value from a random set is equal or larger than that obtained from analysis of the original ortholog set divided by the number of random sets, was obtained. This can be taken as an estimate for the probability (p-estimate, further on called p-est.) to obtain the classification value of the orthologous set by chance.

An additional score (here referred to as combined score) is preferably obtained by multiplying the p-est.. of each classification strategy, resulting in a scale ranging from 0 (completely random in all classifications) to nine (one or less random occurrences) for any of the three classification strategies. combined score - -Iog10[(p-est.. common TFBS) ^* (p-est. TFBS modules) ^* (p-est. TFBS in modules)].

In one embodiment of the invention, to separate complex versus non-complex regions, the following cut-off criteria can be used:

# common TFBS (percent sequence constraint) < 0.15; # common TFBS (sequence number constraint) < 0.075; combined score (sequence number constraint) > 6.5;

These and other aspects of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

The method and computer program according to the invention are described in more detail hereinbelow by way of exemplary embodiments and with reference to the attached drawings, in which:

Fig. 1 shows a flowchart providing a general overview of an embodiment of the invention;

Fig. 2 shows a flowchart of the embodiment of Fig. 1 providing further information about the involved parameters, tools and results;

Fig. 3 shows a flowchart providing information about the determination of single nucleotide polymorphisms within the embodiment of Fig. 2;

Fig. 4 shows a flowchart providing information about the identification of reference regions spanning the single nucleotide polymorphisms resulting from the embodiment of Fig. 2;

Fig. 5 shows a flowchart providing information about the definition of common complex modules depending on quorum constraint and element constraint within the analysis of the orthologous regions of the embodiment of Fig. 2; Fig. 6 shows a flowchart providing information about the classification of the reference regions based on the analysis of the orthologous regions within the embodiment of Fig. 2;

Fig. 7 shows a plot of r² values and recombination rates as a function of genomic position within the embodiment of Fig. 2;

Fig. 8 shows a plot of the classification strategy 1 assessing sum of all elements which are part of common modules within the embodiment of Fig. 2;

Fig. 9 shows another plot of the classification strategy 2 within the embodiment of

Fig. 2;

Fig. 10 shows a plot of the classification strategy 3 assessing the number of modules within the embodiment of Fig. ; shows plots regarding the responsibility of rs10830956 for genotype and cell type specific difference in transcriptional activity within the embodiment of Fig. 2;

shows plots regarding the analysis of SNP-adjacent regions for modular complexity;

shows plots relating to EMSA with Cy5-labeled probes matching the respective major and minor allele of regulatory and non-regulatory SNP; shows plots regarding firefly luciferase from constructs transfected into different cell lines;

shows a plot documenting the assessment of the degree of conservation in the respective SNP-spanning regions (1000bp, UCSC multiz alignment); shows a further plot documenting the assessment of the degree of conservation in the respective SNP-spanning regions (1000bp, UCSC multiz alignment);

shows schemata and plots providing information about the discovery and validation of c/s-regulatory SNPs; a, b, Analysis workflow of PMCA. a, This figure illustrates the discovery of c/^'s-regulatory variants among multiple candidate SNPs derived from GWASs identified LD blocks. PMCA uses functional conservation of a SNP-surrounding genomic region to predict SNPs that influence gene expression at c/s-regulatory level. At the end, SNPs that are located in regions with a conserved complex TFBSs modularity (complex SNP regions, predicted c/^'s-regulatory) were sorted out from non-complex SNP regions, b, Main features of PMCA performance (schematic). PMCA combines phylogenetic conservation with a complexity assessment of TFBS modules, represented by TFBS module constraint scores. The genomic sequence (120 bp) around a candidate SNP was extracted (human sequence, GRCh37 / hg19) and used to search for orthologous sequences in 16 vertebrate species (ortholog set, upper panel). The resulting ortholog sets were analysed for the presence of TFBSs (squares) that are common among several species, and conserved TFBS modules (patterns of TFBSs (colored squares) with at least two co-occurring TFBSs in a defined orientation and distance range, grey shading). The graphic displays the three classifications based on number (#) of common TFBS, # of conserved TFBS modules, and # of TFBS in conserved TFBS modules. For phylogenetic comparison, the measures of all three strategies were assessed for sequence number constraint and percent sequence constraint. Random occurrence of all three measures was estimated by applying the same analysis for 1 ,000 random shuffled sequences per ortholog set (lower panel). Random sequences were generated by shuffling the nucleotide positions for each ortholog sequence within 10 bp windows. Details on PMCA are given in the Materials and Methods section (see below). Performance of PMCA for eight T2D susceptibility loci comprising 200 candidate SNP-surrounding regions (1000 genome Pilot 1 data (CEU), LD block settings: R² > 0.7, distance limit 500 kb). c-e, Box-and-whisker plots of the numbers obtained for each classification strategy in the analysis based on sequence number constraint. Plots show the distributions for #common TFBSs (c), #conserved modules (d) and #TFBSs in conserved modules (e), including the median (horizontal bars), the interquartile region (IQR) representing the middle 50% range (boxes), extreme values (whiskers) and outliers (dots). Data points covered by the IQR and the whiskers values were explicitly added as rug at the sides of the plot. The median for complex regions (highlighted in red) was higher than for the non- complex regions for each classification. f,g, Density histograms for the distribution of common TFBS and combined TFBS module constraint scores. PMCA data were related to the random occurrence of measures in 1000 shuffled sequences per ortholog set. Results are depicted by estimated random probabilities (p-est.). The p-est. distributions derived from the analysis based on sequence number constraint are shown as -log 10 (p-est.) for common TFBS (f) and for the combined scores obtained over all classification strategies (g). The blue curve illustrates the empirical density function of the histogram data. The red vertical dashed line indicates the cutoff separating complex from non-complex regions (SNP-surrounding regions with a value left of this were defined as non-complex). The isolated peak at the right (low p-est. / high score data) refers to data points that hit the lower limit of p-est. calculations, h-j, Validation of predicted c/^'s-regulatory effects for complex SNP regions. h,i, C/^'s-regulatory predictions for complex SNP regions (red dots) were validated at level of DNA-binding activity and transcriptional activity. Non-complex SNP regions were included as a control (half of control regions were matched or exceed the median of common TFBS density of complex regions, median = 88). h, Electrophoretic mobility shift assays (EMSAs) were performed with Cy5-labelled probes matching the risk and non-risk alleles of the respective SNP-surrounding regions, reflecting the allele-specific change in DNA-binding activity. The quantified change in fluorescence comparing the risk and non-risk alleles is shown for each SNP as mean and standard deviation of repeated measurements ( | risk / non-risk I > 1 ), logarithmic scale. P-values are derived from paired t-tests, error bars show s.d., n = 4. i, Reporter assays were performed with luciferase promoter constructs matching the risk and non-risk alleles of the respective SNP-surrounding region, reflecting the allele-specific change in transcriptional activity. The change in relative luciferase expression comparing the risk and non-risk alleles is shown as for each SNP as mean of repeated measurements ( | risk / non-risk | > 1 ), logarithmic scale. P-values are derived from paired t-tests, error bars show s.d., n = 9. j, Cell type- specific c/^'s-regulatory effects of complex SNP regions. Luciferase constructs of the respective complex SNP regions were transfected in INS1 pancreatic β-cells (insulin secretory cell line), and 3T3-L1 adipocytes, C2C12 myocytes and Huh7 cells (insulin responsive cell lines), respectively. The allele- dependent fold change in relative luciferase activity comparing the risk allele and non-risk alleles is shown for each SNP, representing an activating or a repressing effect of the risk allele on transcriptional activity (risk allele)/non- risk allele) > 1 , or < 1 , respectively), logarithmic scale, n = 9.

Fig. 18 shows plots showing positional bias of distinct homeobox TFBSs at SNP position in T2D complex SNP regions;a,b,e,f, Distribution of TFBS matrices relative to SNP position (denoted by grey dashed lines) within complex SNP regions at eight T2D loci (a,b), eight asthma loci (d,e) and an extended set of 47 T2D susceptibility loci (i,j), assessed by positional bias analysis. Positional bias was calculated from TFBS match occurrence over 1 ,000 bp SNP regions for 192 TFBS matrix families (Genomatix Matrix Library version 8.4) within sliding 50 bp windows under a binomial distribution model (detailed in the Materials and Methods section, below) a,e, Positional bias profiles are shown for TFBS matrix family distribution in complex SNP regions matching the criteria of central SNP position and -Iog10 (P) > 6. Positional bias analysis within complex SNP regions reveals specific clustering at SNP position of five distinct homeobox TFBS matrix families for T2D (a) and EBOX and EGRF for asthma susceptibility loci (e). b,f, Positional bias profiles within T2D and asthma complex SNP regions do not show a peak at SNP position for all, but the limited TFBS matrix set shown in (a) and (e) (exemp!arily represented by a subset of analysed TFBS families). c-h,l,m, Comparison of bias profiles for the same TFBS matrices as in (a), (e) and (i) versus 100 sets of size-adjusted randomly selected NCBI dbSNP regions (grey lines) (see also Materials and Methods section, below, P-value distribution of positional bias is shown in Supplementary Figure 25). c-h, The specific clustering of distinct homeobox TFBS families at SNP position in T2D complex region versus the specific clustering of EGRF and EBOX matrices at asthma complex SNPs regions is a distinguishing feature of the respective traits, i-k, Distribution of TFBS matrices relative to SNP position within complex and non-complex SNP regions calculated for an extended set of 47 T2D risk loci. i,j, Bias profiles for TFBS matrix family distribution in complex SNP regions (i, j) and non-complex SNP regions (k) calculated for an extended set of 47 T2D risk loci, i, Specific clustering at T2D complex SNP region position for the same set of homeobox TFBS matrix families as shown for (a) and additionally six further homeobox TFBS matrix families (- Iog10 (P) > 6) is shown, j, Positional bias profiles do not show a peak at SNP position for all, but the limited TFBS matrix set shown in (i) (exemplarily represented by a subset of analysed TFBS matrix families), k, All analysed TFBS matrix families displayed equal distributions within non-complex SNP regions (exemplarily represented by a subset of TFBS families). I, Random distribution and frequencies of positional bias profiles for the same homeodomain TFBS matrix families as described in (i) obtained from 100 sets of 487 randomly selected NCBI dbSNP regions (see Materials and Methods section, below). The plot displays the relative frequencies of the maximum bias positions, i.e., the bias in randomly chosen SNP regions is not focused on the SNP position, m, Comparison of positional bias profiles for the same TFBS matrices as in (i) versus in total 48,700 random NCBI dbSNP region sets (grey lines). The specific clustering of distinct homeobox TFBS matrix families at SNP position in T2D complex regions distinguish from randomly selected SNPs (P-value distribution in Figure 25).

Fig. 19 shows plots providing information about correlations of PMCA results with evolutionary constraint elements; a, The occurrences of 487 complex and 978 non-complex T2D-associated SNP regions within constraint elements according to the SiPhy-π algorithm described in Lindblad-Toh (201 1 ) Nature 478: 476-482 is shown (± 500 bp from the midpoints of constraint elements) (see Materials and Methods section, below). Complex SNP regions are enriched nearby evolutionary constraint regions in contrast to non-complex SNP regions. For localization of SNPs within the analysed 47 T2D risk loci relative to TSS see Figure 27. b, The Venn diagram illustrates the number of complex and non-complex SNP regions that directly map a constraint element (overlap). Complex and non-complex SNP regions do not directly overlap with constraint elements, c, Experimentally validated c s-regulatory complex SNP regions at the PPARG gene locus map nearby constraint elements, though not directly matching the constraint regions (for experimental validation see Figure 20); Zoom out: c/s-regulatory rs4684847- surrounding region, located 393 bp upstream to the nearest constraint element is shown. The TFBS modularity at rs4684847-surrounding region is exemplarily illustrated by one TFBS module that is conserved across five vertebrate species (sequence logo for rs4684847-surrounding region was constructed from 5 orthologous regions, visually showing the conservation at each of the alignment positions; for a given position in the matrix, the combined height of the bases represents the information content at that position, whereas the relative heights of the individual bases represent the frequency of that base at that position). The representative TFBS module harbours the homeobox TFBS matrix PRRX1 , whose T2D-distinct matrix family clustering at complex SNP position was found by positional bias analysis (Figure 18 c,g,m) and whose regulatory function on endogenous PPARG2 gene expression was validated by siRNA knockdown experiments (Figure 20). shows plots and schemata providing validation of genotype-dependent cis- regulatory predictions at the PPARG diabetes risk locus and identification of the homoeobox TF PRRX1 regulating endogenous PPARy2 expression; a, Regional LD plot for the PPARG gene locus at 3p25.2, associated with T2D. The lead SNP rs1801282 (Pro12Ala) is plotted with SNPs in LD (minor allele frequency > 1 %) against genomic position in a 200 kb interval (NCBI GRCh37/hg19). SNPs are shown as diamonds and coloured according to their pair-wise correlation (pair-wise LD based on 1000 Genomes Pilot 1 CEU data using the SNAP Proxy Tool, Broad Institute). The dashed lines indicate the location of SNPs which are in strong LD (R² > 0.7) with rs1801282. Predicted c/s-regulatory SNPs are denoted with red lines, the PPARG gene and exons are denoted in blue. Plots were prepared using R version 2.15. Zoom-in, schematic: The human PPARG gene, PPARy1 -3 mRNA isoforms (coding exons: boxes; untranslated exons: dashed boxes; introns: lines; promoters: arrows) and predicted c/s-regulatory SNPs (red). b,c, Genotype-dependent mRNA levels in samples of primary human adipose tissue cells, homozygous and heterozygous for the risk allele (genotyped for Pro12Ala and rs4684847, R² = 1 .0). P-values from Man Whitney U test, error bars show s.d. b,c, Mean mRNA levels of PPARv2 and PPARyl isoforms were assessed by qRT-PCR in subjects homozygous risk allele carriers (n = 9) and heterozygous subjects (n = 5), normalized to mean levels in homozygotes. d,f,k,l, Reporter assays were performed with luciferase promoter constructs matching the risk and non-risk alleles of the respective SNP-surrounding region, reflecting the allele-specific change in transcriptional activity. P-values were derived from paired t-tests, error bars show s.d. d, Validation of c/s-regulatory predictions for discovered complex SNP regions (red) at the PPARG locus (LD block settings: R² > 0.7, distance limit 500 kb). Non-complex SNP regions (black) were included as control. The allele-dependent fold change in relative luciferase activity in differentiated 3T3-L1 adipocytes comparing the risk allele and non-risk alleles is shown on the log2 scale for each SNP region, representing an activating or repressing effect of the risk-allele on transcriptional activity, n = 3-14 repeated measurements per SNP. e, Luciferase assays with constructs harbouring rs4684847-surrounding region in 5^'-, 3^'-, forward and reverse orientation (arrows) transfected in differentiated 3T3-L1 adipocytes, n = 9. f, Density histogram of the positional bias distribution for the homeobox CART matrix family obtained from 48,700 random NCBI dbSNPs versus complex SNP regions calculated from 47 T2D susceptibility loci (denoted by a red arrow), g, The CART matrix family member PRRX1 matches the rs4684847 (C->T) variant. The TFBS modularity at the complex region surrounding rs4684847 is illustrated by one conserved TFBS module comprising the putative binding sequences for TEF, LHXF (grey shading) and PRRX1 (in a consistent orientation and distance range across several species) as an example. The motif logo for the PRRX1 , TEF and LHXF positional weight matrix⁶⁵ that match to a putative binding site was rendered by WebLogo⁶⁶ h,i EMS As with Cy5-labelled probes matching the risk and non-risk alleles of rs4684847. h, Adipocyte protein extracts reveal allele specific protein binding, i, Increased PRRX1 binding at the risk allele. Competition assays were performed with increasing excess of cold PRRX1 probe, and PRRX1 antibody was used to shift the protein-DNA complex in EMSAs with PRRX1 protein ectopically expressed in 293T cells, j, Luciferase assays with truncated constructs (deletion of predicted PRRX matrix without affecting the rs4684847 position) in differentiated 3T3-L1 adipocytes show abrogated allelic c/s-regulatory activity, n = 9. k, Luciferase assays with or without concomitant ectopic expression of PRRX1 in 3T3-L1 adipocytes reveal inhibition of luciferase activity at the rs4684847 risk allele. Luciferase activity normalized to pCMV control, n = 9. I,m,n. Silencing of endogenous PRRX1 expression in samples of primary human adipose tissue cells (m,n) and human SGBS preadipocytes (o), all homozygous for the risk allele at rs4684847. (siPRRXI , siNT = non-targeting siRNA control) l,m, Mean mRNA levels of PRRX1 , PPARyl , PPARy2 (m) and PPARv2 target genes (n) assessed by qRT-PCR (calculated relative to IP08 mRNA and normalized to the mean expression level in siNT treated cells) 3-4 days after induction of adipogenic differentiation. P-values were derived from Wilcoxon signed rank test, error bars show s.d. n, Oil Red O lipid staining in human SGBS preadipocytes treated with siPRRXI or siNT for six days after induction of adipogenic differentiation, o, Correlation of PRRX1 mRNA levels with In- transformed BMI, HOMA-IR, and TG/HDL ratio. PRRX1 mRNA levels were assessed by qRT-PCR in mature human adipocytes obtained from patients undergoing elective surgery (n = 22). Partial regression plots adjusting for age and sex are shown.

Fig. 21 shows plots showing that there is no apparent positional bias of TFBSs matrices at non-complex regions; Distribution of TFBS matrices relative to SNP position (denoted by grey dashed lines) within non-complex regions calculated for the set of eight T2D loci (a) and eight asthma loci (b) by positional bias analysis. Positional bias was calculated from TFBS match occurrence over 1 ,000 bp SNP regions for 192 TFBS matrix families (Genomatix Matrix Library version 8.4) within sliding 50 bp windows under a binomial distribution model (detailed in the Materials and Methods section, below). Bias profiles are exemplarily presented for a subset of analysed TFBS matrix families including the matrix families which matched the selection criteria of central SNP position and -log 10 (P) > 6 in the complex SNP regions (Figure 18). At the end, positional bias analysis within non- complex SNP regions reveals no apparent clustering at SNP position for any of the analyzed TFBS matrix families in either T2D-associated or asthma- associated risk loci. TFBS matrix family distribution within non-complex SNP regions for the extended set of 47 T2D risk loci is shown in Figure 18k.

Fig. 22 shows plots showing random distribution and frequencies of positional bias profiles in eight T2D and eight asthma risk loci; Positional bias analysis within complex SNP regions revealed specific clustering at SNP position for a set of five distinct homeobox TFBS matrix families (DLXF, PAXH, LHXF, CART, PDX) in eight T2D susceptibility loci and for two TFBS matrix families (EBOX, EGRF) in eight asthma susceptibility loci (Figure 18a,e). Random distribution and frequencies of positional bias profiles for both sets of TFBS matrix families were obtained from 100 sets of 487 randomly selected NCBI dbSNP regions (see Materials and Methods section, below). The plot displays the relative frequencies of the maximum bias positions for the set of T2D- (a) and of asthma- (b) specific TFBS matrix families. The results show that the bias in randomly chosen SNP regions is not focused on the SNP position for both sets of TFBS matrix families. shows plots showing performance of PMCA for eight asthma susceptibility loci;

PMCA results are shown for eight asthma susceptibility loci comprising 208 candidate SNP-surrounding regions (1 ,000 genome Pilot 1 data CEU, LD block settings: R² > 0.7, distance limit 500 kb). PMCA results are illustrated for number (#) of PMCA measures (a-c) and for estimated random probability constraint scores (d,e). a-c, Box-and-whisker plots of the numbers obtained for each classification strategy in the analysis based on sequence number constraint. Plots show the distributions for # common TFBSs (a), # conserved modules (b) and # TFBSs in conserved modules (c), including the median (horizontal bars), the interquartile region (IQR) representing the middle 50% range (boxes), extreme values (whiskers) and outliers (dots). Data points covered by the IQR and the whiskers values were explicitly added as rug at the sides of the plot. The median for complex regions (highlighted in red) was higher than for the non-complex regions for each classification. d,e, Density histograms for the distribution of common TFBS and combined TFBS module constraint scores. PMCA data were related to the random occurrence of measures in 1000 shuffled sequences per ortholog set. Results are depicted by estimated random probabilities (p- est.). The p-est. distributions derived from the analysis based on sequence number constraint are shown as -log 10 (p-est.) for common TFBS (d) and for the combined scores obtained over all classifications (e). The blue curve illustrates the empirical density function of the histogram data. The red vertical dashed line indicates the cut-off separating complex from non- complex regions (SNP-surrounding regions with a value left of this were defined as non-complex). The isolated peak at the right (low p-est. / high score data) refers to data points that hit the lower limit of p-est. calculations.

Fig. 24 shows plots showing performance of PMCA for 47 T2D susceptibility loci;

PMCA results are shown for 47 T2D susceptibility loci comprising 1 ,453 candidate SNP-surrounding regions (1000 genome Pilot 1 data CEU, LD block settings: R² > 0.7, distance limit 500 kb). PMCA results are illustrated for number (#) of PMCA measures (a-c) and for estimated random probability constraint scores (d,e). a-c, Box-and-whisker plots of the numbers obtained for each classification strategy in the analysis based on sequence number constraint. Plots show the distributions for # common TFBSs (a), # conserved modules (b) and # TFBSs in conserved modules (c), including the median (horizontal bars), the interquartile region (IQR) representing the middle 50% range (boxes), extreme values (whiskers) and outliers (dots). Data points covered by the IQR and the whiskers values were explicitly added as rug at the sides of the plot. The median for complex regions (highlighted in red) was higher than for the non-complex regions for each classification. d,e, Density histograms for the distribution of common TFBS and combined TFBS module constraint scores. PMCA data were related to the random occurrence of measures in 1 ,000 shuffled sequences per ortholog set. Results are depicted by estimated random probabilities (p- est.). The p-est. distributions derived from the analysis based on sequence number constraint are shown as -log 10 (p-est.) for common TFBS (d) and for the combined scores obtained over all classifications (e). The blue curve illustrates the empirical density function of the histogram data. The red vertical dashed line indicates the cut-off separating complex from non- complex regions (SNP-surrounding regions with a value left of this were defined as non-complex). The isolated peak at the right (low p-est. / high score data) refers to data points that hit the lower limit of p-est. calculations.

Fig. 25 shows plots providing information about the distribution of P-values for positional bias in NCBI dbSNP random sets versus complex SNP regions calculated from 47 T2D susceptibility loci; A specific set of homeobox TFBS matrices revealed a central positional bias at SNP position in the complex SNP regions of 47 T2D loci (Figure 18). Each density histogram shows the positional bias distribution obtained from a random set of SNPs (48,700) for the distinct set of homeobox TFBS matrices

(a) and for eight representative TFBS matrices without positional bias (b). The dashed grey vertical lines indicate the 95% and 99% quantiles of the P- value distribution for the set of randomly selected SNPs. The P-values obtained within the complex SNP regions (predicted c/^'s-regulatory) is indicated by a red arrow. At the end, the P-value position for complex SNP regions is significantly higher than the 99% quantile in the random SNP set for the homeobox TFBS matrices (a), as opposed to all other TFBS matrices

(b) .

Fig. 26 shows a plot providing information about the frequency distribution for fractions of complex SNP regions obtained for 47 analysed T2D LD blocks; PMCA separates the SNPs at susceptibility loci in complex and non-complex SNP regions. The frequency histogram (bin of LD block sizes = 0.05) displays the fractions of complex SNP regions in the 47 analysed T2D susceptibility LD blocks. The frequency distribution illustrates that the number of complex SNP regions identified per LD block spreads over a large range (median = 29 %, average = 34.2 % (vertical dashed line), s.d. = 22.6 (horizontal arrow)).

Fig. 27 shows plots providing information about the distance to transcriptional start sites (TSSs) for complex and non-complex SNP regions obtained for 47 analysed T2D LD-blocks;

Density histograms show all distances (bin size 500 bp) between SNPs and TSSs (TSS annotated within 30,000 bp downstream of SNP position). The distance distribution is shown for 474 complex SNP regions (a) and 976 non- complex SNP regions (b) identified by PMCA within the set of 47 T2D loci. The histogram shapes of (a) and (b) illustrate the equal positioning of both PMCA categories, complex and non-complex SNP regions, relative to downstream TSSs. Fig. 28 shows a plot providing the validation of c/s-regulatory predictions at the PPARG T2D risk locus;

C/s-regulatory predictions for complex SNP regions were validated at the level of transcriptional activity. Non-complex SNP regions were included as a control. Reporter assays were performed with luciferase promoter constructs matching the risk and non-risk alleles of the respective SNP-surrounding regions, reflecting the allele-specific change in transcriptional activity. The fold change in relative luciferase expression comparing the risk and non-risk alleles is shown as for each SNP as mean of repeated measurements (I risk/non-risk I >1 ), logarithmic scale. P-values are derived from paired t- tests, error bars show s.d., n = 9.

Fig. 29 shows a plot providing information about the allele-dependent repression of reportergene activity in C2C12 myocytes and Huh7 hepatocytes;

Luciferase assays in 3T3-L1 adipocytes, Huh7 hepatoma cells, C2C12 muscle cells, INS1 pancreatic β-cells and 293T cells reveal cell type-specific c/s-regulatory activity of rs4684847. Reporter assays were performed with luciferase promoter constructs matching the risk and non-risk alleles of the rs4684847 SNP-adjacent region, reflecting the allele-specific change in transcriptional activity. P-values were derived from paired t-tests, error bars show s.d., n=9.

Fig. 30 shows a plot providing information about the regulation of PRRX1 , PPARyl and PPARy2 mRNA expression by siRNA knockdown of PRRX1 in human SGBS adipocytes; PRRX1 siRNA (siPRRXI ) and non-targeting siRNA (siNT) was transfected in human SGBS preadipocytes (homozygous for the risk allele genotyped for rs4684847 and Pro12Ala, R² = 1 .0). The mean mRNA levels of PRRX1 , PPARyl , PPARy2 assessed by qRT-PCR (normalized to the mean expression levels in siNT treated cells) is shown, 3- 4 days after induction of adipogenic differentiation. P-values were derived from Mann Whitney U test, error bars show s.d., n = 3; and Fig. 31 shows plots providing information about the association of PRRX1 mRNA levels with In-transformed BMI, HOMA-IR, and TG/HDL ratio. PRRX1 mRNA levels were assessed by qRT-PCR in mature human adipocytes (n = 22). Partial regression plots adjusting for age, sex and BMI are shown.

Mode(s) for Carrying Out the Invention

In the Figs, an embodiment of the method according to invention is shown which is also referred to as phylogenetic module complexity analysis (PMC analysis). Within this embodiment, the complexity of modules can be proven as a predictor for identification of single nucleotide polymorphism (SNP) spanning functional regulatory regions. Based on this, the PMC analysis is established realizing for the precise bioinformatic identification of regulatory regions or regulatory mutations, respectively.

In the PMC analysis shown in Fig. 1 as a bioinformatic approach, defining reference regions of interest in sequence data of DNA sequences of a reference genome and identifying orthologous regions of at least one further species corresponding to the reference regions, i.e. a phylogenetic analysis (quorum constraint), is combined with analyzing the identified orthologous regions with regard to common patterns of regulatory elements, i.e. a module complexity analysis (element constraint).

The detailed workflow of the PMC analysis is as follows:

First Exam lified Mode for Carrying Out the Invention

Firstly, the phylogenetic analysis or an analysis of module conservation for each SNP spanning sequence over 13 vertebrate species is performed. Thereby, as a first step the genomic context of each SNP is determined and orthologous regions are searched. As shown in Fig. 4, using the software RegionMiner of the company Genomatix, Munich, Germany, the subtask 'search for orthologous regions' of 120 base pair (bp) sequences of the human genome comprising the respective SNP in the middle are aligned with 12 closely related and distant vertebrate species. The reference genome is human (Homo sapiens and the aligned genomes are rhesus macaque (Macaca mulatta), common chimpanzee (Pan troglodytes), mouse (Mus musculus), rat (Rattus norvegicus), horse (Equus caballus), dog (Canis lupus familiaris), cow (Bos taurus), pig (Sus scrofa), opossum (Monodelphis domestica), platypus (Ornithorhynchus anatinus), zebrafish (Danio rerio) and chicken (Gallus gal I us).

This initial identification of orthologous regions by RegionMiner yields a specific data set of sequences for each SNP spanning reference region. As shown in Table 1 , variable numbers of orthologous regions are found for each input region posing problems of comparability.

Table 1 : Orthologous regions of FAT3_ MTNR1 B linkage disequilibrium (LD) locus, total and quorum constrain input numbers

In Table 1 the quorum constraints are shown for a MTNR1 B locus. In particular, a minimum number x of input sequences to contain a common module in percent of total input sequences is shown and determined modular structure of a given region in stepwise determined 50%-100% of total input sequences, i.e. the quorum constraint. To meet the mentioned problems of comparability, for all further analysis the minimum number x of input sequences to contain a common module are assessed in percent of total input sequences and determined modular structure of a given region in stepwise determined 50%-100% of total input sequences. This approach might eventually bring in a bias towards low-conserved regions, meaning regions that do not exceed 5 orthologous regions. For example: the variant of reference SNP (RS) rs71 12766 located in the FAT3JVITNR1 B linkage disequilibrium (LD) block implicates such bias as the quorum constraint 50-60% of input sequences targets exclusively primate orthologous regions (see Table 1 ), as does the variant rs1 1523890 which per se has 3 orthologous regions. To meet these concerns, results are counterchecked for input sequences per se, intending to avoid systematic effects of the drawback explained above. As a result, minimum number x of input sequences to contain a common module is defined, and the quorum constraint is set as the percent of total input sequences and sequences per se, respectively.

Secondly, the module complexity analysis is performed, wherein sets of the orthologous sequences are analyzed for common patterns of regulatory elements, such as transcription factor binding sites (TFBS) patterns, to identify phylogenetically conserved modules, i.e. regulatory structures. Thereby, the software marked as "FrameWorker" of the company Genomatix Company, Munich, Germany may be used which is a complex software tool allowing users to extract a common framework of elements or modules from a set of input DNA sequences.

In one exemplified embodiment of the invention, the FrameWorker library is selected as follows: library version is matrix library 8.1 , matrix group is vertebrates as the general core promoter elements, matrix filters are ubiquitous, core similarity is 0.75 and matrix similarity is optimized. Further, as shown in Fig. 7, Fig. 8, Fig. 9 and Fig. 10 as well as Fig. 5, for the definition of common modules, certain FrameWorker parameter settings are defined as follows: quorum constraint is the minimum number of input sequences to contain a framework, sequence constraints is none, distance constraints are 10 as maximum distance variance between two elements, distance between two elements is minimum 5 and maximum 200 and element constraints are minimum 2 and maximum 10 as number of elements in models.

The distance range between two elements is prescribed between minimum 5 and maximum 200 base pairs. The quorum constraint, meaning the lower limit of sequences within the input set that has to contain the common framework, is set as the percent of total input sequences and sequences per se as described above. Modules are defined by all regulatory elements, such as TFBS, that occur in the same order and in a certain distance range in a defined subset of input sequences. Depending on the element constraint a module consists in the following of a distinct number y of elements (y-element module).

The modular complexity of a region is assessed by classifying the reference regions based on the analysis of the orthologous regions and rating each reference region in accordance with its corresponding classification. In particular, this is performed based on three different classification strategies for each defined quorum constraint, i.e. the minimum number of orthologous regions considering phylogenetics, as follows:

In a classification strategy 1 (elements in modules) the number of regulatory elements, such as TFBS, in y-element modules (2 <y <10) are set equal to the sum of all elements which are part of common modules for a defined quorum constraint as percentage of total input sequences with the formula

m∞n ^lo _^ min x sequences

( 1 ) > sites in y - element module in ⁱ

^i total input sequences

and as sequences per se with the formula

max 13 10

(2) ^ sites in y - element module in min x sequences .

x=2 v=2 i a classification strategy 2 (all common elements) the number of regulatory lements, such as TFBS, in 1 -element modules (all common sites) are calculated for defined quorum constraint as percentage of total input sequences with the formula _1ΏΜ ΐ3 min x sequences

( ) > sites m ⁱ

~ total input sequences

and as sequences per se with the formula

max 13

(4) ^ sites in min x sequences .

x=2

In a classification strategy 3 (number of modules) the number of y-element modules (2 <y <10) is set equal to the number of modules for a defined quorum constraint as percentage of total input sequences with the formula

nwc i3 ⁱo min x sequences

(5) > y - element modules m ¹

total input sequences

and as sequences per se with the formula

max 13 10

(6) ^~ _j ^ y - element modules in min x sequences .

The classification and rating approach is explained in more detail with Fig. 7, Fig. 8, Fig. 9 and Fig. 10 for the SNP-spanning sequences of the FAT3JV1TNR1 B locus. Initially, SNP in close linkage disequilibrium with the GWAS identified tagSNP are defined (r² > 0.7, 500kb, HapMap CEU). The SNP rs10830963 and rs1387153 tag seven SNP in close linkage, any of which is a putative causal candidate for the association signal with FPG and T2D (see Fig. 7).

Figs. 7 to 10 further show that FAT3_MTNR1 B LD block harbors the functional regulatory SNP rs10830956. In particular, Fig. 7 shows estimated r² values and recombination rates from HapMap CEU with a cut off r² > 0.7, being plotted as a function of genomic position (NCBI build Hg18). In linkage disequilibrium (LD), r² is equal to D² divided by the product of the allele frequencies at the two loci, wherein D is a measure for the disequilibrium and D = PAB-PA^XPB with A and B being two loci with two alleles (A, a and B, b), P_AB being the frequency of the haplotype that consists of alleles A and B, P_A being the frequency of allele A at the first locus and P_B being the frequency of allele B at the second locus. The absolute value of D' is determined by dividing D by its maximum possible value, given the allele frequencies at the two loci. The GWAS identified SNP rs1387153 and rs10830963, which are denoted in Fig. 7 by circles, tag 7 highly linked SNP (r² > 0.7, 500kb). The novel identified regulatory SNP rs10830956, which is denoted in Fig. 7 by stars, is located intergenic between the FAT3 and MTNR1 B genes. The tag rs10830963 is found to be non-functional. The arrows in Fig. 7 indicate the direction of gene transcription. As shown in Figs. 8 to 10 SNP-adjacent regions (120 bp) FAT3_MTNR1 B LD block SNP (r² > 0.7; 500 kb) are analyzed for modular complexity based on the classification strategies. In Fig. 8 classification strategy 1 is shown assessing sum of all elements which are part of common modules. In Fig. 9 classification strategy 2 is shown assessing all common elements. In Fig. 10, classification strategy 3 is shown assessing the number of modules in a defined percental and absolute subset of input sequences, respectively. For each strategy the analysis is performed with a defined subset of input sequences (percental and absolute). For each region, sum of elements and modules, respectively are summed up.

In particular within classification strategy 1 all elements (e.g. TFBS) in y-element modules are summed up for each percental quorum constraint, i.e. the minimum number of input sequences per total input sequences that has to contain a common framework. As shown in the upper panel of Fig. 8, this is performed for each element constraint (2 <y <10) for all 8 FAT3_MTNR1 B regions comprising the indicated SNP by rs-number. Additionally, as shown in the lower panel of Fig. 8, the same regions are counterchecked for their complexity in absolute terms of sequences. Only the regions meeting highest PCMA scores both criteria are classified as regulatory in the subsequent classification of complexity.

Within classification strategy 2, if the number of sequences with a regulatory element (such as a TFBS) on one strand (sense or antisense) is above quorum, then the regulatory element (e.g. the TFBS) is considered a "common element" and can be used to build potential frameworks. In order to avoid a false estimation due to hard constraints, all common elements are assessed for each region. As shown in the upper panel of Fig. 9, the total amount of all common elements found in a sequence is summed up for all quorum constraints and as shown in the lower panel of Fig. 9 again the same sequences or regions are counterchecked for their complexity in absolute terms of sequences. For this analysis again the highest PCMA scoring hits are classified as regulatory. Within classification strategy 3, as shown in Fig. 10, the number of y-element modules is summed up for the different element constraints (2 <y 10) depending on each quorum constraint. The results of this classification strategy illustrate the described necessity for correction to total input sequences. As shown in the upper panel of Fig. 10, without correction a region for which only primates could be included as orthologous (rs1 1523890) shows a much higher, possibly overestimated, score compared to the score after correction as shown in the lower panel of Fig. 10. A region that reaches highest PMCA scores for all three classification strategies is assigned or rated to category I, i.e. a regulatory region. All remaining, rather uncomplex regions are assigned or rated to category II, i.e. not a regulatory region. As can be seen in Table 2, for the exemplified FAT3_MTNR1 B LD block comprising 8 highly linked SNP, 3 SNP are rated as category I based on their high modular complexity and 5 are rated as category II. Notably, the GWAS identified tagSNP is classified as not regulatory. In Table 2 the abbreviation CS represents the classification strategy, % the percental quorum constraint and n the absolute quorum constraint.

Table 2: PMC analysis driven categorization of SNP highly linked with the tagSNP rs1387153 (r² > 0.7, 500kb, HapMap CEU)

SNP category CS1 CS2 CS3

% n % n % n

rs 10830956 1 224 1091 217 425 594 61912 rs1387153 1 49 684 51 130 58 78296 rs7936247 1 59 564 58 129 1 13 1 15535 rs10765573 2 0 165 17 54 0 1 1442 rs10830963 2 0 368 6 51 4 82362 rs1 1020124 2 16 286 24 59 4 36168 rs1 1523890 2 172 174 32 32 21461 23458 rs71 12766 2 1 19 262 30 54 1999 10066

The described implementation of modular complexity analysis with different classification strategies and counterchecking for input sequences in a conventional conservation approach faces the problem of phylogenetics relying exclusively on sequence alignment per se which disregards the complex modular composition of regulatory regions. Thereby, it can made possible to discriminate regulatory regions and regulatory SNP, respectively from non-regulatory linkage block regions.

As shown in Fig. 15 and in Fig. 16, in order to dissect the implemented PMC analysis approach, the degree of conservation in the respective SNP-spanning regions (1000bp, UCSC multiz alignment) is assessed. Representatively, the experimental proven regulatory category I region rs10830956 is less conserved than the control region rs10830963 highlighting the obvious importance of the module analysis within the described method or bioinformatic approach. Likewise, the quantified level of conservation for each genomic region surrounding a SNP does not reflect the experimentally verified functional results. This finding is in line with the under- representation of associated SNP in conserved non-coding sequences and confirms the approach based on the analysis of modular organization of genomic regions in several species.

In Fig. 1 1 it is shown that rs 10830956 is responsible for genotype and cell type- specific difference in transcriptional activity. Section a of Fig. 1 1 shows the FAT3_MTNR1 B locus regulatory rs10830956 and non-regulatory rs10830963 being analyzed for influence on DNA-binding activity by EMSA (Electrophoretic mobility shift assays) and on transcriptional activity by reporter assays. Both, the major and minor allele of each SNP in the adjacent 273bp or 41 bp regions are subcloned into a basal firefly luciferase construct with the TK-promoter. Section b of Fig. 1 1 shows EMSA with Cy5-labeled probes matching the rs10830956 minor and major allele with INS-1 insulinoma (left panel of section b of Fig. 1 1 ) and Huh7 hepatoma cell (right panel of section b of Fig. 1 1 ) protein. Section c of Fig. 1 1 shows relative firefly luciferase expression from constructs transfected into INS-1 cells (left panel) or Huh7 cells (right panel). Constructs matching either the major and minor allele of the causal regulatory rs10830956 in forward or reverse orientation (indicated by arrows) are transfected. Section d of Fig. 1 1 shows EMSA with Cy5-labeled probes matching the rs10830963 minor and major allele using INS-1 insulinoma cell protein. Section e of Fig. 1 1 shows relative firefly luciferase expression from constructs transfected into INS-1 cells. Both, constructs matching the control rs10830963 major and minor allele are transfected. For all transfections, ratios of firefly luciferase expression to Renilla luciferase expression (expressed from cotransfected plasmid) are shown, measured 48 h after transfection and normalized to the TK-promoter construct. P values derived from paired t-test, error bars show s.e.m., n=9. For EMSA one representative shown, n=3. Section f of Fig. 1 1 shows expression of endogenous FAT3 gene in EBV-LCL cells (standardized to HPRT gene expression) analyzed by qRT-PCR. Genotype and haplotype-specific effects on gene-expression are shown for the indicated genotypes of category 1 rs10830956 and category 2 rs10830963 (2 = homozygous minor allele, 0 = heterozygous major allele). P values derived from paired t-test, error bars show s.e.m., n=8.

With regard to the exemplary embodiment of the method according to the invention described hereinbefore, rs10830956 is the causal regulatory SNP in the FAT3_MTNR1 B locus. Subsequent to the steps described above, the PMC based prediction of regulatory SNP for the FAT3_MTNR1 B locus is experimentally validated. As shown in Fig. 1 1 , rs10830956 is analyzed as a representative of category I regarding allele-specific regulatory function. As shown in section b of Fig. 1 1 , performing EMSA experiments with allele-specific probes, a distinct difference in protein-DNA interaction in INS-1 β-cells is observed, but not in Huh7 hepatocytes. Next, allele specific reporter constructs are created and transcriptional activity in both, INS-1 β-cells and hepatic Huh-7 cells are measured. A genotype-specific, twofold change (p=10^"4) in reporter gene activity in INS-1 β-cells independent of forward (p<10^"4) or reverse (p=0.0028) sequence orientation indicating the enhancer function of the analyzed regions is observed. As shown in section c of Fig. 1 1 , in line with EMSA results, allele-specific differences are not observed in hepatic Huh-7 cells. The PMC identified category I SNP rs10830956 therefore modulates protein-DNA binding and reporter gene expression in both, an allele- and cell type-specific manner. The cell type specificity perfectly reflects the reported association of the FAT3_MTNR1 B locus to β-cell function and insulin secretion, rather than insulin sensitivity (Staiger (2008) PLoS.One 3: e3962; Prokopenko (2009) Nat.Genet 41 : 77-81 ) . The achieved results highlight the importance of analyzing allelic effects on transcriptional regulation in adequate cell types (Dimas (2009) Science 325: 1246-1250) . Indeed, no allelic effect on expression of nearby candidate genes is observed in diverse tissue QTL datasets, yet lacking β-cells (Prokopenko (2009) Nat.Genet 41 : 77-81 ) . Notably, the regulatory SNP rs10830956 identified is located in an open chromatin region in primary human β-cells (Stitzel (2010) Cell Metab 12: 443-455) , distinctly supporting the results of the PMC approach. Allele specific regulation of endogenous gene expression is analyzed in genotyped EBV-LCL (Epstein Barr Virus infected lymphob!ast cell lines). Using quantitative RT-PCR FAT3 mRNA expression levels are determined, whereas MTNR1 B is not expressed in this cell type. As shown in section f of Fig. 1 1 , a threefold difference in FAT3 mRNA expression levels is observed depending on the genotype comprising both alleles in the minor form.

Further genotyping of EBV-LCL cells for the PMC-identified causal category I SNP rs10830956 and the category II SNP rs10830963 in close LD (r^O.705) allows to compare the genotype effect of the respective SNP in distinct haplotype combinations. Strikingly, the significant difference of FAT3 mRNA expression solely depends on the genotype of the category I SNP rs10830956, whereas the genotype of the category II SNP rs10830963 does not affect gene expression. Regulation of endogenous FAT3 expression is therefore specifically targeted by the PMC-identified category I SNP rs10830956. As shown in Fig. 7, unlike the tagSNP rs10830963 mapping the first intron of MTNR1 B, the causal SNP rs10830956 is located intergenic nearby the FAT3 and MTNR1 B genes. The identified SNP might regulate both endogenous FAT3 and MTNR1 B expression, or even long distance acting enhancers. Notably, the FAT3_MTNR1 B and FAT1_MTNR1A loci are described as paralogous regions in the human genome (Katoh (2006) Int.J.Mol.Med 18: 523-528). MTNR1 B is implicated in T2D pathogenesis in several reports suggesting a possible link between circadian rhythm and glucose homeostasis (Bouatia-Naji (2009) Nat.Genet 41 : 89-94). Little is known about the biological function of FAT3. However, members of the fat cadherin superfamily are involved in the regulation of Frizzled receptors (Yang (2002) Cell 108: 675-688) and interact with β-catenin (Nelson (2004) Science 303: 1483-1487) , both key components of the WNT signaling pathway. The WNT pathway is strongly implicated in T2D pathogenesis, since variants in the TCF7L2 gene are reproducibly associated with T2D in several studies (Grant (2006) Nat Genet 38: 320-323). A dual effect of the identified SNP rs 10830956 on both regulation of circadian insulin secretion and WNT-TCF7L2 signaling might be possible.

The here presented genotype and haplotype dependent regulation of FAT3 mRNA expression markedly proofs, that PMC-analysis of SNP in a LD-block is up to identify a precise causal SNP which is responsible for genotype-dependent endogenous gene expression. Unravelling genotype-dependent gene expression and identification of causal SNP is essential to dissect the pathophysiological mechanisms of genetic predisposition, and is therefore currently a important challenge in genetics. Identification of the precise causal variant allows analysis of upstream signalling, which in turn is the important step towards personalized medicine. Also for diagnostic purpose the identification of a regulatory SNP in LD of 0.7 is intriguingly essential for individual prediction of risk. Strikingly, regarding category II control SNP, the tagSNP rs10830963 in the FAT3_MTNR1 B locus which is located in a non-complex genomic region, lacks an allele-specific effect on cis-acting transcriptional activity. As shown in sections d and e of Fig. 1 1 , neither allele-specific changes in relative luciferase expression, nor differential DNA-protein interaction is observed for the tagSNP rs10830963 in INS-1 β-cells as it is shown in sections b and c of Fig. 1 1 for rs10830956 in hepatic Huh7 cells. 1 1 . Using the PMC analysis, a causal variant reflecting the β-cell phenotype of the MTNR1 B locus LD-block is presented which is different from any so far described tagSNP.

Fig. 12, Fig. 13 and Fig. 14 show PMC-analysis driven identification of regulatory SNP in six LD blocks within identification and functional validation of regulatory SNP in six diabetes LD blocks. As can be seen in Table 3, in total six GWAS-identified LD blocks associated with T2D or FPG (Prokopenko (2008) Trends Genet 24: 613-621 ; Dupuis (2010) Nat. Genet 42: 105-1 16; Voigt (2010) Nat. Genet 42: 579-589) are analyzed to validate the predictive value of the PMC-approach for further GWAS identified loci as shown in Fig. 12. The 6 selected tagSNP are located in strong LD with 84 non-coding SNP. In search of the causal regulatory SNP, the PMC-analysis is performed for all 90 SNP within these six LD blocks as described for the MTNR1 B FAT3 locus. Fig. 12 shows SNP-adjacent regions of 90 SNP in 6 different T2D loci being analyzed for modular complexity based on three classification strategies for both, percental (left panel) and absolute number of input sequences (right panel). Sum of elements in module, elements and number of modules are shown for all 90 regions hierarchal ordered. Fig. 13 shows EMSA with Cy5-labeled probes matching the respective major and minor allele of 6 representative category I regulatory and four category II non-regulatory SNP. The quantified change in fluorescence comparing the major and minor allele of each SNP is shown, reflecting the genotype-dependent difference of DNA-binding activity for each of the ten analyzed SNP n=3. Fig. 14 shows firefly luciferase expression from constructs transfected into different cell lines. The major and minor allele of each representative category I regulatory and category II non- regulatory SNP are subcloned into a basal firefly luciferase construct with the TK- promoter. Ratios of firefly luciferase expression to renilla luciferase expression (expressed from cotransfected plasmid), measured 24 h after transfection, are calculated and normalized to the mean ratio from the TK-promoter construct. The difference of major and minor allele luciferase activity for each SNP is shown, reflecting genotype specific transcriptional activity, n=9 for each analyzed SNP and P values derived from t-test.

Table 3: Initially selected tag SNPs for bioinformatical analysis tagSNP Chr. Hgl8 gene locus

rsl0830963

11 92348358 MTNR1B FAT3

rsl387153

rs7903146 10 114748339 TCF7L2

rsl801282 3 12368125 PPARy

rs 1552224 1 1 72110746 CENTD2 FCHSD2 STARD 10

rs972283 7 130117394 upstream LF14

rsl l21980 16 52366748 FTO

JAZF1

CAM 1 B For experimental validation of the PMC-analysis based prediction, one representative chromosomal region is selected for each category and LD-block and investigated for allele-specific cis-regu!atory potential. Allele-dependent change is quantified in fluorescence intensity and relative luciferase activity for EMSA (Electrophoretic mobility shift assay) and reporter gene assay, respectively, for each category I and each category II SNP. Overall, category I SNP clearly differ in their regulatory function for both measures versus category II SNP. The predicted regulatory category I SNP alter DNA-protein binding activity ranging from 3- to 72-fold allele- dependent change in fluorescence (p=0.0035, Figure 13) and transcriptional rate ranging from 1 .2- to 3.6-fold (p=0.0047, Fig. 14). On the other side, none of the four analyzed control category II SNPs - predicted as non-regulatory - reveals any allele- dependent regulatory potential on the level of protein-DNA interaction and transcriptional activity, respectively. Notably, the experimentally verified category I SNP rs7903146 in the TCF7L2 locus shows excellent concordance with previous studies reporting this variant as regulatory functional in primary human β-cells based on FAIRE sequencing (Gaulton (2010) Nat. Genet 42: 255-259) and on a genome wide open chromatin approach (Stitzel (2010) Cell Metab 12: 443-455). All in all, for each of the 6 T2D conferring loci analyzed, a causal variant is identified ascribing a strong predictive potential to the established PMC-analysis. The established PMC- approach combining phylogenetic classification and module complexity analysis therefore allows the precise detection of functional SNP located in widespread LD blocks and thereby the prediction of complex regulatory genomic regions.

The bioinformatic PMC-approach according to the invention in the presented embodiments combining phylogenetic classification and module complexity analysis allows the precise detection of functional SNP located in widespread LD blocks. Facing the problem of de novo identification of the functional variants in the human genome, the PMC-approach substantially conduces to the understanding of complex diseases pathogenesis governed by the allelic architecture of SNP affecting cis- regulation. Particularly with regard to the currently emerging discussion of missing heritability of common, complex disorders, the demand of mapping the causal regulatory variants takes on significance (Select: GWAS Gets Functional (2010) Cell 143: 177; Baker (2010) Nature 467: 1 135-1 138; Eichler (2010) Nat.Rev.Genet 1 1 : 446-450; Manolio (2010) N.Engl.J.Med 363: 166-176). Given that effect sizes are determined based on LD with tagSNP that are often imperfectly linked with the true causal variants, explainable amount of heritability is likely to be underestimated. In addition, proposed epistatic and environmental interactions and contribution of rare variants to disease risk adds to the pressing demand for a feasible strategy to identify the functional ones within the human genome (Maher (2008) Nature 456: 18-21 ; Phillips (2008) Nat.Rev.Genet 9: 855-867; Manolio (2009) Nature 461 : 747-753; Cirulli (2010) Nat.Rev.Genet 1 1 : 415^25; Ritchie (201 1 ) Ann.Hum.Genet 75: 172- 182). Whole genome sequencing is essential to cope with these concerns (Cirulli (2010) Nat.Rev.Genet 1 1 : 415-425), and henceforth upcoming new variants will even enlarge the demand for a feasible strategy like the here introduced. In the rising era of personalized medicine (Malandrino (201 1 ) Clin.Chem 57: 231-240) a methodological approach for the highly effective identification of cis-regulatory SNP thereby is provided herein enabling to define a map of causal variation in the human genome and to detail the biological upstream and downstream pathways and processes critical for a disorder.

Second Examplified Mode for Carrying Out the Invention 1. Detailed description of the Figures

Fig. 17 shows schemata and plots providing information about the discovery and validation of c/^'s-regulatory SNPs; a, b, Analysis workflow of PMCA. a, This figure illustrates the discovery of c/^'s-regulatory variants among multiple candidate SNPs derived from GWASs identified LD blocks. PMCA uses functional conservation of a SNP-surrounding genomic region to predict SNPs that influence gene expression at c/^'s-regulatory level. At the end, SNPs that are located in regions with a conserved complex TFBSs modularity (complex SNP regions, predicted c/s-regulatory) were sorted out from non-complex SNP regions, b, Main features of PMCA performance (schematic). PMCA combines phylogenetic conservation with a complexity assessment of TFBS modules, represented by TFBS module constraint scores. The genomic sequence (120 bp) around a candidate SNP was extracted (human sequence, GRCh37 / hg19) and used to search for orthologous sequences in 16 vertebrate species (ortholog set, upper panel). The resulting ortholog sets were analysed for the presence of TFBSs (squares) that are common among several species, and conserved TFBS modules (patterns of TFBSs (colored squares) with at least two co-occurring TFBSs in a defined orientation and distance range, grey shading). The graphic displays the three classifications based on number (#) of common TFBS, # of conserved TFBS modules, and # of TFBS in conserved TFBS modules. For phylogenetic comparison, the measures of all three strategies were assessed for sequence number constraint and percent sequence constraint. Random occurrence of all three measures was estimated by applying the same analysis for 1 ,000 random shuffled sequences per ortholog set (lower panel). Random sequences were generated by shuffling the nucleotide positions for each ortholog sequence within 10 bp windows. Details on PMCA are given in the Materials and Methods section (see below). Performance of PMCA for eight T2D susceptibility loci comprising 200 candidate SNP-surrounding regions (1000 genome Pilot 1 data (CEU), LD block settings: R² > 0.7, distance limit 500 kb). c-e, Box-and-whisker plots of the numbers obtained for each classification strategy in the analysis based on sequence number constraint. Plots show the distributions for #common TFBSs (c), #conserved modules (d) and #TFBSs in conserved modules (e), including the median (horizontal bars), the interquartile region (IQR) representing the middle 50% range (boxes), extreme values (whiskers) and outliers (dots). Data points covered by the IQR and the whiskers values were explicitly added as rug at the sides of the plot. The median for complex regions (highlighted in red) was higher than for the non- complex regions for each classification. f,g, Density histograms for the distribution of common TFBS and combined TFBS module constraint scores. PMCA data were related to the random occurrence of measures in 1000 shuffled sequences per ortholog set. Results are depicted by estimated random probabilities (p-est.). The p-est. distributions derived from the analysis based on sequence number constraint are shown as -log 10 (p-est.) for common TFBS (f) and for the combined scores obtained over all classification strategies (g). The blue curve illustrates the empirical density function of the histogram data. The red vertical dashed line indicates the cutoff separating complex from non-complex regions (SNP-surrounding regions with a value left of this were defined as non-complex). The isolated peak at the right (low p-est. / high score data) refers to data points that hit the lower limit of p-est. calculations, h-j, Validation of predicted c/s-regulatory effects for complex SNP regions. h,i, C/^'s-regulatory predictions for complex SNP regions (red dots) were validated at level of DNA-binding activity and transcriptional activity. Non-complex SNP regions were included as a control (half of control regions were matched or exceed the median of common TFBS density of complex regions, median = 88). h, Electrophoretic mobility shift assays (EMSAs) were performed with Cy5-labelled probes matching the risk and non-risk alleles of the respective SNP-surrounding regions, reflecting the allele-specific change in DNA-binding activity. The quantified change in fluorescence comparing the risk and non-risk alleles is shown for each SNP as mean and standard deviation of repeated measurements ( | risk / non-risk I > 1 ), logarithmic scale. P-values are derived from paired t-tests, error bars show s.d., n = 4. i, Reporter assays were performed with luciferase promoter constructs matching the risk and non-risk alleles of the respective SNP-surrounding region, reflecting the allele-specific change in transcriptional activity. The change in relative luciferase expression comparing the risk and non-risk alleles is shown as for each SNP as mean of repeated measurements ( | risk / non-risk | > 1 ), logarithmic scale. P-values are derived from paired t-tests, error bars show s.d., n = 9. j, Cell type- specific c/s-regulatory effects of complex SNP regions. Luciferase constructs of the respective complex SNP regions were transfected in INS1 pancreatic β-cells (insulin secretory cell line), and 3T3-L1 adipocytes, C2C12 myocytes and Huh7 cells (insulin responsive cell lines), respectively. The allele- dependent fold change in relative luciferase activity comparing the risk allele and non-risk alleles is shown for each SNP, representing an activating or a repressing effect of the risk allele on transcriptional activity (risk allele)/non- risk allele) > 1 , or < 1 , respectively), logarithmic scale, n = 9. Fig. 18 shows plots showing positional bias of distinct homeobox TFBSs at SNP position in T2D complex SNP regions;a,b,e,f, Distribution of TFBS matrices relative to SNP position (denoted by grey dashed lines) within complex SNP regions at eight T2D loci (a,b), eight asthma loci (d,e) and an extended set of 47 T2D susceptibility loci (i,j), assessed by positional bias analysis. Positional bias was calculated from TFBS match occurrence over 1 ,000 bp SNP regions for 92 TFBS matrix families (Genomatix Matrix Library version 8.4) within sliding 50 bp windows under a binomial distribution model (detailed in the Materials and Methods section, below) a,e, Positional bias profiles are shown for TFBS matrix family distribution in complex SNP regions matching the criteria of central SNP position and -Iog10 (P) > 6. Positional bias analysis within complex SNP regions reveals specific clustering at SNP position of five distinct homeobox TFBS matrix families for T2D (a) and EBOX and EGRF for asthma susceptibility loci (e). b,f, Positional bias profiles within T2D and asthma complex SNP regions do not show a peak at SNP position for all, but the limited TFBS matrix set shown in (a) and (e) (exemplarily represented by a subset of analysed TFBS families). c-h,l,m, Comparison of bias profiles for the same TFBS matrices as in (a), (e) and (i) versus 100 sets of size-adjusted randomly selected NCBI dbSNP regions (grey lines) (see also Materials and Methods section, below, P-value distribution of positional bias is shown in Supplementary Figure 25). c-h, The specific clustering of distinct homeobox TFBS families at SNP position in T2D complex region versus the specific clustering of EGRF and EBOX matrices at asthma complex SNPs regions is a distinguishing feature of the respective traits, i-k, Distribution of TFBS matrices relative to SNP position within complex and non-complex SNP regions calculated for an extended set of 47 T2D risk loci. i,j, Bias profiles for TFBS matrix family distribution in complex SNP regions (i, j) and non-complex SNP regions (k) calculated for an extended set of 47 T2D risk loci, i, Specific clustering at T2D complex SNP region position for the same set of homeobox TFBS matrix families as shown for (a) and additionally six further homeobox TFBS matrix families (- log 10 (P) > 6) is shown, j, Positional bias profiles do not show a peak at SNP position for all, but the limited TFBS matrix set shown in (i) (exemplarily represented by a subset of analysed TFBS matrix families), k, All analysed TFBS matrix families displayed equal distributions within non-complex SNP regions (exemplarily represented by a subset of TFBS families). I, Random distribution and frequencies of positional bias profiles for the same homeodomain TFBS matrix families as described in (i) obtained from 100 sets of 487 randomly selected NCBI dbSNP regions (see Materials and Methods section, below). The plot displays the relative frequencies of the maximum bias positions, i.e., the bias in randomly chosen SNP regions is not focused on the SNP position, m, Comparison of positional bias profiles for the same TFBS matrices as in (i) versus in total 48,700 random NCBI dbSNP region sets (grey lines). The specific clustering of distinct homeobox TFBS matrix families at SNP position in T2D complex regions distinguish from randomly selected SNPs (P-value distribution in Figure 25). shows plots providing information about correlations of PMCA results with evolutionary constraint elements; a, The occurrences of 487 complex and 978 non-complex T2D-associated SNP regions within constraint elements according to the SiPhy-π algorithm described in Lindbiad-Toh (201 1 ) Nature 478: 476-482 is shown (± 500 bp from the midpoints of constraint elements) (see Materials and Methods section, below). Complex SNP regions are enriched nearby evolutionary constraint regions in contrast to non-complex SNP regions. For localization of SNPs within the analysed 47 T2D risk loci relative to TSS see Figure 27. b, The Venn diagram illustrates the number of complex and non-complex SNP regions that directly map a constraint element (overlap). Complex and non-complex SNP regions do not directly overlap with constraint elements, c, Experimentally validated cis-regulatory complex SNP regions at the PPARG gene locus map nearby constraint elements, though not directly matching the constraint regions (for experimental validation see Figure 19); Zoom out: cis-regulatory rs4684847- surrounding region, located 393 bp upstream to the nearest constraint element is shown. The TFBS modularity at rs4684847-surrounding region is exemplarily illustrated by one TFBS module that is conserved across five vertebrate species (sequence logo for rs4684847-surrounding region was constructed from 5 orthologous regions, visually showing the conservation at each of the alignment positions; for a given position in the matrix, the combined height of the bases represents the information content at that position, whereas the relative heights of the individual bases represent the frequency of that base at that position). The representative TFBS module harbours the homeobox TFBS matrix PRRX1 , whose T2D-distinct matrix family clustering at complex SNP position was found by positional bias analysis (Figure 18 c,g,m) and whose regulatory function on endogenous PPARG2 gene expression was validated by siRNA knockdown experiments (Figure 20).

Fig. 20 shows plots and schemata providing validation of genotype-dependent c/s- regulatory predictions at the PPARG diabetes risk locus and identification of the homoeobox TF PRRX1 regulating endogenous PPARy2 expression; a, Regional LD plot for the PPARG gene locus at 3p25.2, associated with T2D. The lead SNP rs1801282 (Pro12Ala) is plotted with SNPs in LD (minor allele frequency > 1 %) against genomic position in a 200 kb interval (NCBI GRCh37/hg19). SNPs are shown as diamonds and coloured according to their pair-wise correlation (pair-wise LD based on 1000 Genomes Pilot 1 CEU data using the SNAP Proxy Tool, Broad Institute). The dashed lines indicate the location of SNPs which are in strong LD (R² > 0.7) with rs1801282. Predicted c/^'s-regulatory SNPs are denoted with red lines, the PPARG gene and exons are denoted in blue. Plots were prepared using R version 2.15. Zoom-in, schematic; The human PPARG gene, PPARy1 -3 mRNA isoforms (coding exons: boxes; untranslated exons: dashed boxes; introns: lines; promoters: arrows) and predicted c/^'s-regulatory SNPs (red). b,c, Genotype- dependent mRNA levels in samples of primary human adipose tissue cells, homozygous and heterozygous for the risk allele (genotyped for Pro12Ala and rs4684847, R² = 1 .0). P-values from Man Whitney U test, error bars show s.d. b,c, Mean mRNA levels of PPARy2 and PPARyl isoforms were assessed by qRT-PCR in subjects homozygous risk allele carriers (n = 9) and heterozygous subjects (n = 5), normalized to mean levels in homozygotes. d,f,k,l, Reporter assays were performed with luciferase promoter constructs matching the risk and non-risk alleles of the respective SNP-surrounding region, reflecting the allele-specific change in transcriptional activity. P-values were derived from paired t-tests, error bars show s.d. d, Validation of c/s-regulatory predictions for discovered complex SNP regions (red) at the PPARG locus (LD block settings: R² > 0.7, distance limit 500 kb). Non-complex SNP regions (black) were included as control. The allele-dependent fold change in relative luciferase activity in differentiated 3T3-L1 adipocytes comparing the risk allele and non-risk alleles is shown on the log2 scale for each SNP region, representing an activating or repressing effect of the risk-allele on transcriptional activity, n = 3-14 repeated measurements per SNP. e, Luciferase assays with constructs harbouring rs4684847-surrounding region in 5^'-, 3^'-, forward and reverse orientation (arrows) transfected in differentiated 3T3-L1 adipocytes, n = 9. f, Density histogram of the positional bias distribution for the homeobox CART matrix family obtained from 48,700 random NCBI dbSNPs versus complex SNP regions calculated from 47 T2D susceptibility loci (denoted by a red arrow), g, The CART matrix family member PRRX1 matches the rs4684847 (C- T) variant. The TFBS modularity at the complex region surrounding rs4684847 is illustrated by one conserved TFBS module comprising the putative binding sequences for TEF, LHXF (grey shading) and PRRX1 (in a consistent orientation and distance range across several species) as an example. The motif logo for the PRRX1 , TEF and LHXF positional weight matrix⁶⁵ that match to a putative binding site was rendered by WebLogo⁶⁶ h,i EMSAs with Cy5-labelled probes matching the risk and non-risk alleles of rs4684847. h, Adipocyte protein extracts reveal allele specific protein binding, i, Increased PRRX1 binding at the risk allele. Competition assays were performed with increasing excess of cold PRRX1 probe, and PRRX1 antibody was used to shift the protein-DNA complex in EMSAs with PRRX1 protein ectopically expressed in 293T cells, j, Luciferase assays with truncated constructs (deletion of predicted PRRX matrix without affecting the rs4684847 position) in differentiated 3T3-L1 adipocytes show abrogated allelic c/^'s-regulatory activity, n = 9. k, Luciferase assays with or without concomitant ectopic expression of PRRX1 in 3T3-L1 adipocytes reveal inhibition of luciferase activity at the rs4684847 risk allele. Luciferase activity normalized to pCMV control, n = 9. I,m,n. Silencing of endogenous PRRX1 expression in samples of primary human adipose tissue cells (m,n) and human SGBS preadipocytes (o), all homozygous for the risk allele at rs4684847. (siPRRXI , siNT = non-targeting siRNA control) l,m, Mean mRNA levels of PRRX1 , PPARyl , PPARy2 (m) and PPARy2 target genes (n) assessed by qRT-PCR (calculated relative to IP08 mRNA and normalized to the mean expression level in siNT treated cells) 3-4 days after induction of adipogenic differentiation. P-values were derived from Wilcoxon signed rank test, error bars show s.d. n, Oil Red O lipid staining in human SGBS preadipocytes treated with siPRRXI or siNT for six days after induction of adipogenic differentiation, o, Correlation of PRRX1 mRNA levels with In- transformed BMI, HOMA-IR, and TG/HDL ratio. PRRX1 mRNA levels were assessed by qRT-PCR in mature human adipocytes obtained from patients undergoing elective surgery (n = 22). Partial regression plots adjusting for age and sex are shown.

Fig. 21 shows plots showing that there is no apparent positional bias of TFBSs matrices at non-complex regions; Distribution of TFBS matrices relative to SNP position (denoted by grey dashed lines) within non-complex regions calculated for the set of eight T2D loci (a) and eight asthma loci (b) by positional bias analysis. Positional bias was calculated from TFBS match occurrence over 1 ,000 bp SNP regions for 192 TFBS matrix families (Genomatix Matrix Library version 8.4) within sliding 50 bp windows under a binomial distribution model (detailed in the Materials and Methods section, below). Bias profiles are exemplarily presented for a subset of analysed TFBS matrix families including the matrix families which matched the selection criteria of central SNP position and -log 10 (P) > 6 in the complex SNP regions (Figure 18). At the end, positional bias analysis within non- complex SNP regions reveals no apparent clustering at SNP position for any of the analyzed TFBS matrix families in either T2D-associated or asthma- associated risk loci. TFBS matrix family distribution within non-complex SNP regions for the extended set of 47 T2D risk loci is shown in Figure 18k. Fig. 22 shows plots showing random distribution and frequencies of positional bias profiles in eight T2D and eight asthma risk loci; Positional bias analysis within complex SNP regions revealed specific clustering at SNP position for a set of five distinct homeobox TFBS matrix families (DLXF, PAXH, LHXF, CART, PDX) in eight T2D susceptibility loci and for two TFBS matrix families (EBOX, EGRF) in eight asthma susceptibility loci (Figure 19a,e). Random distribution and frequencies of positional bias profiles for both sets of TFBS matrix families were obtained from 100 sets of 487 randomly selected NCBI dbSNP regions (see Materials and Methods section, below). The plot displays the relative frequencies of the maximum bias positions for the set of T2D- (a) and of asthma- (b) specific TFBS matrix families. The results show that the bias in randomly chosen SNP regions is not focused on the SNP position for both sets of TFBS matrix families.

Fig. 23 shows plots showing performance of PMCA for eight asthma susceptibility loci;

PMCA results are shown for eight asthma susceptibility loci comprising 208 candidate SNP-surrounding regions (1 ,000 genome Pilot 1 data CEU, LD block settings: R² > 0.7, distance limit 500 kb). PMCA results are illustrated for number (#) of PMCA measures (a-c) and for estimated random probability constraint scores (d,e). a-c, Box-and-whisker plots of the numbers obtained for each classification strategy in the analysis based on sequence number constraint. Plots show the distributions for # common TFBSs (a), # conserved modules (b) and # TFBSs in conserved modules (c), including the median (horizontal bars), the interquartile region (IQR) representing the middle 50% range (boxes), extreme values (whiskers) and outliers (dots). Data points covered by the IQR and the whiskers values were explicitly added as rug at the sides of the plot. The median for complex regions (highlighted in red) was higher than for the non-complex regions for each classification. d,e, Density histograms for the distribution of common TFBS and combined TFBS module constraint scores. PMCA data were related to the random occurrence of measures in 1000 shuffled sequences per ortho!og set. Results are depicted by estimated random probabilities (p- est). The p-est. distributions derived from the analysis based on sequence number constraint are shown as -log 10 (p-est.) for common TFBS (d) and for the combined scores obtained over all classifications (e). The blue curve illustrates the empirical density function of the histogram data. The red vertical dashed line indicates the cut-off separating complex from non- complex regions (SNP-surrounding regions with a value left of this were defined as non-complex). The isolated peak at the right (low p-est. / high score data) refers to data points that hit the lower limit of p-est. calculations.

Fig. 24 shows plots showing performance of PMCA for 47 T2D susceptibility loci;

Fig. 25 shows plots providing information about the distribution of P-values for positional bias in NCBI dbSNP random sets versus complex SNP regions calculated from 47 T2D susceptibility loci;

A specific set of homeobox TFBS matrices revealed a central positional bias at SNP position in the complex SNP regions of 47 T2D loci (Figure 18). Each density histogram shows the positional bias distribution obtained from a random set of SNPs (48,700) for the distinct set of homeobox TFBS matrices

(a) and for eight representative TFBS matrices without positional bias (b). The dashed grey vertical lines indicate the 95% and 99% quantiles of the P- value distribution for the set of randomly selected SNPs. The P-values obtained within the complex SNP regions (predicted c/s-regulatory) is indicated by a red arrow. At the end, the P-value position for complex SNP regions is significantly higher than the 99% quantile in the random SNP set for the homeobox TFBS matrices (a), as opposed to all other TFBS matrices

(b) .

Fig. 27 shows plots providing information about the distance to transcriptional start sites (TSSs) for complex and non-complex SNP regions obtained for 47 analysed T2D LD-blocks; Density histograms show ail distances (bin size 500 bp) between SNPs and TSSs (TSS annotated within 30,000 bp downstream of SNP position). The distance distribution is shown for 474 complex SNP regions (a) and 976 non- complex SNP regions (b) identified by PMCA within the set of 47 T2D loci. The histogram shapes of (a) and (b) illustrate the equal positioning of both PMCA categories, complex and non-complex SNP regions, relative to downstream TSSs.

Fig. 28 shows a plot providing the validation of c/^'s-regulatory predictions at the

PPARG T2D risk locus;

C/s-regulatory predictions for complex SNP regions were validated at the level of transcriptional activity. Non-complex SNP regions were included as a control. Reporter assays were performed with luciferase promoter constructs matching the risk and non-risk alleles of the respective SNP-surrounding regions, reflecting the allele-specific change in transcriptional activity. The fold change in relative luciferase expression comparing the risk and non-risk alleles is shown as for each SNP as mean of repeated measurements ( I risk/non-risk I >1 ), logarithmic scale. P-values are derived from paired t- tests, error bars show s.d., n = 9.

Fig. 30 shows a plot providing information about the regulation of PRRX1 , PPARyl and PPARy2 mRNA expression by siRNA knockdown of PRRX1 in human SGBS adipocytes; PRRX1 siRNA (siPRRXI ) and non-targeting siRNA (siNT) was transfected in human SGBS preadipocytes (homozygous for the risk allele genotyped for rs4684847 and Pro12Ala, R² = 1 .0). The mean mRNA levels of PRRX1 , PPARyl , PPARy2 assessed by qRT-PCR (normalized to the mean expression levels in siNT treated cells) is shown, 3- 4 days after induction of adipogenic differentiation. P-values were derived from Mann Whitney U test, error bars show s.d., n = 3; and

Fig. 31 shows plots providing information about the association of PRRX1 mRNA levels with In-transformed BMI, HOMA-IR, and TG/HDL ratio. PRRX1 mRNA levels were assessed by qRT-PCR in mature human adipocytes (n = 22). Partial regression plots adjusting for age, sex and BMI are shown.

2. Discovery and characterization of c/s-regulatory SNPs

It was choosen to limit the primary analysis to eight reported T2D susceptibility loci (MTNR1B, TCF7L2, PPARG, CENTD2, FTO, GCK, CAMK1D, KLF14) (see Dupuis (2010) Nat Genet 42:105-1 16; Voight (2010) Nat Genet 42: 579-589; and Zeggini (2008) Nat Genet 40: 638-645) comprising 200 SNPs in strong LD with the respective lead SNPs. (R² > 0.7, 1000 Genomes CEU). The set of 200 SNP-adjacent regions was subjected to PMCA. The approach relies on identifying orthologous regions in 16 vertebrate species, determining the number of species-common TFBSs, TFBS modules, and co-occurring TFBSs in these modules, and a test for random occurrence of all three features (Figure 17a, b, see also Materials and Methods section, below). At the end, PMCA assesses functional conservation prioritizing SNPs that are located in genomic regions with a conserved complex modular build-up of TFBSs (complex SNP region). 64 SNPs in such complex regions (predicted c/s-regulatory) were separated from 136 SNPs in non-complex regions (predicted non c/s-regulatory) (Figure 17c-g). In an initial analysis, the 64 complex SNP regions were ranked according to the number of species-common co-occurring TFBSs in modules and tested the c/s-regulatory potential of SNPs within the upper 25% of the list. Non-complex SNP regions were included as a control (50% of controls were selected by similar TFBS density relative to complex SNP regions, to evaluate the importance of TFBS modularity beyond TFBS enrichment per se). To validate the predictions, electrophoretic mobility shift assays (EMSAs) and luciferase assays were used and the allele-dependent change in fluorescence intensity and relative luciferase activity was quantified. As predicted, SNPs located in complex regions significantly differ in their regulatory function from non-complex SNP regions for both measures of allele-dependent c/s-regulatory activity (Figure 17h, P = 2.82 x 10^"4, Figure 17i, P = 1 .45 x 10^"4). Complex SNP regions revealed confirming allele- specific effects, with a mean 3.1 - to 101 -fold change in DNA-protein binding and 1 .3- to 3.5-foid change in transcriptional rate. Corroborating the finding that common regulatory variants operate in a cell type-specific manner (Nica (201 1 ) PLoS Genet 7, e1002003EP; Dimas (2009) Science 325: 1246-1250; Ernst (201 1 ) Nature 473: 43- 49), in the herein described study repressing and activating cell type-specific allelic effects for the complex SNP regions were found (Figure 17j). For rs1421085 at the FTO locus the most pronounced effect in adipocytes was observed, confirming the reported association primarily with obesity (Frayling (2007) Science 316: 889-894). To the contrary, the observed allelic effects in pancreatic β-cells for rs7903146, rs10830956 and rs2908289 located at the TCF7L2, MTNR1B and GCK loci support the established association with β-cell function and glucose-stimulated insulin secretion (Dupuis (2010) Nat Genet 42: 105-1 16; Prokopenko (2009) Nat Genet 41 : 77-81 ). The rs7903146 allelic effect in adipocytes is in accordance with the genotype- dependent TCF7L2 expression in human adipose tissue (Mondal (2010) Journal of Clinical Endocrinology & Metabolism 95: 1450-1457). The herein presented analysis is further supported by regulatory inferences from chromatin profiling, as rs10830956 at the MTNR1B locus (Gaulton Nat.Genet 42: 255-259) and rs7903146 at the TCF7L2 locus (Gaulton Nat.Genet 42: 255-259; Stitzel (2010) Cell Metab 12: 443- 455) have been recently mapped in either FAIRE sequencing or genome-wide open chromatin approaches. The other discovered c/s-regulatory SNPs are hidden in GWAS-identified FTO and GCK gene loci. Functionality of these SNPs has not been suspected so far. To check if these c/s-regulatory variants associate with susceptibility in man, lookups in large population-based cohorts of the MAGIC and DIAGRAM consortium were performed. Note, that varying numbers of complex SNP regions per LD block were discovered (Figure 26), any of which may contribute to an association signal in human genetic studies. Yet, though the use of human genetic approaches for proof of the predicted SNP functionality might therefore be hampered, a similar or even stronger association was observed compared to the initially reported signal inferred from the lead SNP for continuous glycaemic measures such as fasting glucose and insulin and T2D, further sustaining the functionality of predicted variants (Table 4). In case of the TCF7L2 diabetes risk locus, the GWAS lead SNP is the same as the one that has been discovered in this study.

3. Clustering of distinct homeobox TFBSs as distinguishing feature of T2D complex SNP regions

Second, evidence for a discerning functional feature shared among T2D complex SNP regions were sought. Noting that TFBS clustering relative to TSS indicates biological significance (FitzGerald (2004) Genome Research 14: 1562-1574; Kranz (201 1 ) Nucleic Acids Research 39: 8689-8702), the TFBS distribution with respect to SNP position was examined. Given a SNP-surrounding genomic region, positional bias analysis scanning 1 ,000 bp for the occurrence of TFBS matrix families to visualize TFBS distribution profiles were used (see Materials and Methods section, below). Applying positional bias on eight diabetes risk loci analysed above (64 complex SNP regions versus 136 non-complex SNP regions), a significant positional bias for five TFBS families (-log (P) > 6) were observed exactly at SNP position for T2D complex regions (Figure 18a) contrary to non-complex regions (Figure 21 a). TFBS matrices other than the small subset of homeobox TFBS families in complex SNP regions did not show a peak over the 1 ,000 bp sequences (Figure 18b). The striking SNP-directed clustering in T2D complex regions was restricted to distinct TFBSs, all belonging to the superfamily of homeobox TFBSs including DLXF, PAXH, LHXF, CART and PDX1 . The maximum central positional bias of these homeobox matrices exactly covered the SNP position and a second weaker bias was found 200 bp upstream to SNP position. The positional bias analysis on commensurate control sets of randomly gathered NCBI dbSNP regions (in total 3,600 sequences) revealed equal distributions of positions for bias maxima (Figure 22a). Thus, the positional bias profile of the identified homeobox matrices exhibited a central maximum peak, which clearly distinguishes these profiles from corresponding profiles obtained with random selections (Figure 18c). It has been reported that TFBS combination coupled with the TFs recruited to a CRM determines its regulatory function (Kantorovitz (2009) Developmental Cell 17: 568-579; Zinzen (2009) Nature 462: 65-70) suggesting that regulatory regions with similar TFBS patterns might display functional similarity. It was therefore hypothesized that TFBS matrices with specific bias at SNP position are likely to be biologically meaningful for the studied trait, here T2D. To evaluate specificity of the identified homeobox matrices for T2D susceptibility loci, PMCA for a second T2D non-related trait, asthma, was performed. Analysing a set of eight asthma risk loci (Moffatt (2010) N Engl J Med 363: 121 1-1221 ) with an appropriate number of SNPs in high LD (Figure 23), it was found that none of the T2D overrepresented homeobox matrices displayed a significant positional bias at SNP position in complex asthma regions (Figure 18f,g). Contrary, a positional clustering, which exactly covered SNP positions in the asthma complex regions (-log (P) > 6) was found for two matrix families, the EGRF and EBOX families (Figure 18e,h, Figure 22b). Notably, in the asthma-associated GLCCI1 locus an EBOX matrix was targeted by a c/s-regulatory SNP (Tantisira (201 1 ) N Engl J Med 365: 1 173— 1 183 and the EGRF matrix binding factor EGR1 regulates IL13-induced inflammation (Cho (2006) Journal of Biological Chemistry 281 : 8161-8168. For T2D, the EGRF and EBOX matrices lacked a positional bias at SNP position (Figure 18b,d). Thus, the overrepresented homeobox TFBS matrices clearly distinguished T2D susceptibility loci from the EGRF and EBOX matrices overrepresented in asthma. To test whether these findings can be retrieved in a larger set of T2D associated SNPs, the analysis was extended to all but one established T2D susceptibility loci. PMCA was carried out with the total of 47 loci comprising 1 ,453 SNPs in high LD separating 486 complex from 978 non-complex SNP regions (Figure 24). Positional bias analysis revealed confirmatory positional clustering at SNP position exclusively in T2D complex regions with SNPs co-localising again with distinct homeobox TFBS matrices (Figure 18i,l,m), in strong contrast to the majority of TFBS matrices that do not show a peak in the bias profile over the 1 ,000 bp sequences (Figure 18j). The matrices initially found in the eight diabetes risk loci made up a part of the overrepresented homeobox TFBS matrices found in the larger set of 47 T2D risk loci. Note that the peak 200 bp upstream relative to SNP position observed in the initial analysis declined in the extended analysis, in contrast to the maximum peak at SNP position. Non-complex regions show no apparent positional overrepresentation of any analysed TFBS (Figure 18k). Overall, these results corroborate that the observed positional clustering of distinct homeobox TFBSs relative to SNP position might be a functional feature of T2D complex SNP regions. Of note, homeobox factors, which are known to be involved in embryonic and tissue developmental processes and pattern specification, have so far not been implicated in genetic predisposition of T2D. However, several homeobox TFs which may bind to the discovered overrepresented homeobox TFBS matrices are functionally linkable to T2D biology, such as PDX-1 (Jonsson (1994) Nature 371 : 606-609) and NKX (lype (2004) Molecular Endocrinology 18: 1363-1375) in β-cell biology and numerous homeobox TFs that are differentially expressed in adipose tissue depending on body fat distribution (Gesta (2006) Proceedings of the National Academy of Sciences 103: 6676-6681 ; Dankel (2010) PLoS ONE 5: e1 1033EP). In the present study, striking evidence are provided that the homeobox TF PRRX1, whose predicted TFBS matrix belongs to the overrepresented CART family, is a TF at play in endogenous PPARG2 regulation via the rs4684847-surrounding complex region at the PPARG diabetes risk locus.

4. Sequence conservation versus functional conservation

In contrast to commonly used approaches that rely on sequence alignment, PMCA extends sequence conservation to functional conservation based on the analysis of TFBS modularity across several species, allowing for sequence variability. 487 complex and 978 non-complex SNP regions of the analysed set of 47 T2D susceptibility loci were tested for correlations to evolutionary constraint elements detected by the Siphy-π- method (Lindblad-Toh (2011 ) Nature 478: 476^182) (Figure 19a). It was observed that non-complex regions are depleted near constraint elements (relative to constraint element mid-positions as anchor). Conversely, complex SNP regions were found enriched nearby constraint elements, consistent with the 1 .37-fold enrichment of disease-associated SNPs relative to HapMap SNPs observed in the original paper (Lindblad-Toh (201 1 ) Nature 478: 476-482). Compared to non-complex SNP regions (35 of 978), a 1 .88 fold enriched direct overlap of constraint elements with complex SNP region was found (58 of 487 SNP, hypergeometric distribution, right sided, P = 2.4x10^"9, Figure 19b). However, though enriched nearby constraint elements, the majority of complex regions did not directly overlap with constrained regions. Note, that complex SNP regions that have been experimentally characterised as c/^'s-regulatory above and c/s-regulatory complex SNP regions at the PPARG gene locus (experimental validation see below, Figure 20d) accumulate near constrained regions, though lacking a direct overlap. These findings emphasize the value of combining TFBS module analysis and evolutionary constraint for the computational identification of c/s-regulatory acting variants located in non-coding genomic regions.

If the underlying hypothesis is correct, then one would expect variants with known genotype-dependent regulatory activity as located in complex regions. The PMCA analysis was tested on recently reported c/s-regulatory SNPs associated with diverse disease-related traits, including different forms of cancer, myocardial infarction, thyroid hormone resistance, hypercholesterolemia and plasma adiponectin levels (MYC (Pomerantz (2009) Nat Genet 41 : 882-884), MDM2 (Post (2010) Cancer Cell 18: 220-230), PSMA6 (Ozaki (2006) Nat Genet 38: 921-925), THRB (Alberobello (201 1 ) Journal of Translational Medicine 9: 144), SORT1 (Musunuru (2010) Nature 466: 714-719), and APM2/ACDC (Laumen (2009) Diabetes 58, 984-991 ). In agreement with the functional proof of genotypic effects from the original publications, the herein presented analysis predicts all but one variants to be c/s-regulatory functional. Best measures inferred from PMCA analysis were conformingly derived for the reported myocardial infarction risk variant, which has been recently shown to regulate hepatic SORT1 expression (Musunuru (2010) Nature 466, 714-719).

5. Detailed analysis of the PPARG

The PPARG gene gives rise to PPARyl , PPARy3, and PPARy4 mRNA isoforms encoding the PPARyl protein, while PPARy2 mRNA encodes the PPARy2 protein (Figure 20a). PPARyl is widely expressed in different tissues, contrary to PPARy2 whose expression is restricted almost exclusively to adipose. PPARy, the molecular target of thiazolidinedione (TZD) class of anti-diabetic drugs, is considered the master regulator of adipocyte differentiation and glucose homeostasis (Miles (2003) American Journal of Physiology - Endocrinology And Metabolism 284: E618; Zhang (2004) Proceedings of the National Academy of Sciences of the United States of America 101 : 10703-10708; Medina-Gomez (2005) Diabetes 54: 1706-1716). Indeed, several human genetic studies support a pivotal role of PPARy in T2D (Dupuis (2010) Nat Genet 42: 105-1 16; Voight (2010) Nat Genet 42: 579-589; Zeggini (2008) Nat Genet 40: 638-645; Deeb (1998) Nat Genet 20: 284-287; Altshuler (2000) Nat Genet 26: 76-80; Knouff (2004) Endocrine Reviews 25: 899- 918; Barroso (1999) Nature 402: 880-883). The common Pro12Ala coding variant, located in the PPARy2-specific exon B has been reproducibly associated with decreased risk of T2D (risk allele frequency=0.93, CEU) (Dupuis (2010) Nat Genet 42: 105-1 16; Voight (2010) Nat Genet 42: 579-589; Zeggini (2008) Nat Genet 40: 638-645; Deeb (1998) Nat Genet 20: 284-287; Altshuler (2000) Nat Genet 26: 76- 80). Contradictorily, the protective alanine allele results in decreased transcriptional activity of the PPARy2 protein (Deeb (1998) Nat Genet 20: 284-287), though associated with increased insulin sensitivity. This may partially be explained by c/s- regulatory variants in LD with the reported coding Pro12Ala variant, provoking allele- dependent regulation of endogenous PPARy2 expression. Indeed, Pro12Ala is located in strong LD with 23 non-coding variants spanning all PPARG transcriptional start site regions (Figure 20a). To narrow down potential c/s-regulatory SNPs, PMCA was applied and six complex regions with R² > 0.7 were found (analysis of 47 T2D loci yielded different fractions of complex regions per LD block, mean = 34.2 %, median = 29%, Figure 26). It was sought to evaluate the sensitivity of the herein described analysis and to examine predicted complex SNPs at the PPARG diabetes risk locus, though recognizing that the herein presented LD block analysis is not comprehensive. Overall, the allele-specific c/s-regulatory activity of complex SNP regions significantly differed from non-complex regions (P = 0.02, Figure 28). To validate the herein described approach, the genotype-dependent regulation of endogenous PPARyl and PPARy2 mRNA expression was evaluated. Isoform- specific qRT-PCR in isolated human adipocyte progenitor cells derived from homozygous and heterozygous risk allele carriers was performed. A 3.8-fold decrease of PPARy2 transcript levels in risk allele carriers was found (P = 1 .00 x 10^" ³, Figure 20b), whereas PPARyl expression was not affected by the genotype (Figure 20c).. The decrease of PPARy2 and the increase of total PPARy expression levels might be explained by the coincidence of activating and repressing risk variants within the analysed haplotype. Indeed, either activating or repressing risk allele-dependent c/s-regulatory effects on luciferase expression were found for all but one (rs35000407) predicted complex SNP region (Figure 20d). It is notable that each single variant showing decreased c/s-regulatory activity or a combined effect of several variants could contribute to the decreased PPARY2 expression that was observed in subjects at risk.

To prioritize the identified c/s-regulatory SNPs for further follow-up, the LD block structure at the PPARG locus was compared in Asians and Europeans since patterns of LD might differ between different ethnic populations. Among the identified c/s- regulatory SNPs, evidence for different pairwise LD of the reported Pro12Ala variant and rs4684847 was obtained between the two populations (pairwise LD CEU R² = 1 .0, CHBYPT R² = 0.487). In line with the established ethnic disparities in diabetes prevalence, the inventors asked for an independent association signal for rs4684847 in the Singapore Diabetes Cohort Study (SDCS). Yet, the established association signal at the PPARG locus was not detected due to the limited power of the study (917 T2D cases and 936 controls), neither for the reported GWAS Pro12Ala variant nor the discovered c/s-regulatory SNP rs4684847 (0.0348% and 0.0345% MAF, Pro12Ala and rs4684847, respectively). Awaiting the results of fine-mapping studies, the inventors sought to evaluate possible regulatory mechanisms underlying rs4684847 and the genotype-dependent down-regulation of PPARy2 mRNA in subjects carrying the PPARG risk genotype. First, in reporter assays a corroborative 3.1 -fold decrease of transcriptional activity was observed for the rs4684847 risk allele (Figure 20e, P = 1 .81 x 1 0^~3). This effect was independent of both, 5^'- or 3 ^'- orientation to the luciferase gene (P = 0.03) and forward or reverse sequence orientation (P = 0.03), suggesting an enhancer function of this region which might be repressed for the risk allele. Cell type specificity was assessed and the most pronounced allele-dependent transcriptional repression was found in 3T3-L1 adipocytes, and a weaker effect in C2C12 myocytes and Huh7 hepatocytes (Figure 29). Pancreatic INS-1 β-cells and 293T cells lacked allele-dependent activity, which is consistent with the association of risk alleles at the PPARG locus in genetic studies with reduced insulin sensitivity, rather than insulin secretion (Voight (2010) Nat Genet 42: 579-589; Zeggini (2008) Nat Genet 40: 638-645). In EMSAs, when nuclear extract from mouse 3T3-L1 was incubated, risk allele-specific DNA-protein binding, which was almost completely abolished with rs4684847 non-risk allele sequence, was observed (Figure 20h).

6. The homeobox factor PRRX1 regulates endogenous PPARy2 expression To relate the in silico inferred distinct homeobox TFBS clustering at T2D complex SNPs to the functional factor that governs the repression of transcriptional activity and that might perturb PPARy2 gene regulation, the rs4684847-surrounding TFBS modules were scrutinized. Consistent with the strong positional bias of homeobox CART matrices that have been discovered for the predicted c/s-regulatory complex regions among the analysed subset and the total set of 47 T2D susceptibility loci (- log(P) = 13.0; Figure 20f, Figure 18i), the consensus binding site of the CART matrix family member paired-related homeobox protein-1 (PRRX1) overlapping rs4684847 is shown (Figure 20g). To explore the functional relevance of the PRRX1 consensus sequence at rs4684847, affinity chromatography and LC-MS/MS were performed and 1 .5-fold allele-dependent binding of the PRRX1 TF to the risk allele sequence of rs4684847 was observed (see Materials and Methods section, below). To test PRRX1 as a TF that is involved in allele-specific DNA-protein binding in EMSA experiments, nuclear extract of cells ectopically expressing PRRX1 protein was used. A 1 .5-fold increased binding of PRRX1 to the risk allele compared to the non-risk allele sequence was observed (Figure 20i). Competition with PRRX1 consensus sequence confirmed an increased binding affinity of PRRX1 to the risk allele compared to the non-risk allele sequence. The addition of a PRRX1 antibody provided evidence that PRRX1 binds to the identified rs4684847-adjacent c/s- regulatory region, with a 1 .5-fold increased supershift of the risk allelic DNA-PRRX1 complex compared to the non-risk allele. To investigate whether the allele-specific recruitment of PRRX1 modulates gene expression, a series of reporter assays was carried out. Targeting the predicted PRRX1 consensus binding site with a 5 ^'-deletion, which perturbed the PRRX1 consensus sequence but did not affect the SNP position itself, was sufficient to abolish the allele-specific change in transcriptional activity and completely abrogated the risk-allelic repressing effect (Figure 20j). Further, reporter assays in which PRRX1 was overexpressed in 3T3-L1 adipocytes, were performed and a significantly increased repression of relative luciferase activity for the risk allele was found (P = 1 .54 x 10^"3), whereas the non-risk allele was not affected (Figure 20k). It was concluded by the inventors that the diabetes-risk allele of rs4684847 (in perfect LD with the reported Pro12Ala lead SNP, CEU), is targeted by PRRX1 which may function as a transcriptional repressor at play of the identified c/s-regulatory region, located 6.5 kb upstream of the PPARy2 specific promoter (for correlation with evolutionary constraint elements (Lindblad-Toh (201 1 ) Nature 478: 476-482) and PMCA performance on rs4684847-surrounding region see Figure 19c).

The inventors next sought to prove PRRX1 as repressor at play in the decreased PPARy2 expression observed in risk allele carriers. To model the functional effects on endogenous PPARy2 expression, PRRX1 siRNA knockdown experiments were performed. Silencing of PRRX1 in human adipose tissue cells homozygous for the risk allele significantly increased PPARy2 mRNA expression for each individual, with a mean 2.5-fold increase (P = 0.043, Figure 20I), whereas PPARyl expression was not regulated by PRRX1 . The isoform-specific inhibitory effect of PPRX1 on PPARy2 expression was confirmed in the human adipocyte cell line SGBS (p = 0.0156, Figure 30), the homozygous risk allele-genotype of which was confirmed by sequencing. In keeping with the fact that PPARy2 stimulates adipocyte differentiation (Rosen (1999) Molecular Cell 4: 61 1-617), the effect of PRRX1 -silencing on differentiation-induced lipid accumulation was assessed in SGBS adipocytes and a clear promotion of adipogenic differentiation, assessed by Oil Red O lipid staining was observed (Figure 20n). As PPARy2 acts as a transcriptional activator of several adipocyte-specific target genes involved in lipid synthesis, lipid storage, insulin signalling and lipolysis, PPARy2-specific target gene expression was evaluated in PRRX1 siRNA-transfected human adipose tissue cells of risk allele carriers and it was found that target gene expression was significantly increased in PRRX1 -silenced cells (Figure 20m). In human genetic studies, the PPARG risk genotype appear to be associated with higher fasting insulin and decreased insulin sensitivity (Voight (2010) Nat Genet 42: 579-589; Deeb (1998) Nat Genet 20: 284-287; Altshuler (2000) Nat Genet 26: 76- 80). The inventors speculated that PRRX1 expression might associate with the reported metabolic phenotypes. Recent findings highlight the importance of the metabolic context, especially BMI in modulating the genotype effect at the PPARG locus (Heikkinen (2009) Cell Metabolism 9: 88-98). Indeed, a strong association between BMI and PRRX1 mRNA levels in mature adipocyte samples was found (Figure 20q), suggesting that the environmental factor BMI might partially modulate the susceptibility at the PPARG locus via PRRX1 . Further studies are required to evaluate PRRX1 as factor acting at the PPARG gene-environment interface. Importantly, the inventors also found a significant negative association between PRRX1 mRNA and insulin sensitivity, assessed by either HOMA-IR or TG/HDL ratio, both indices of insulin sensitivity (Fig. 20o). These associations were partially driven by an effect of BMI and the association with TG/HDL ratio remained significant after adjusting for BMI (Figure 31 ), corroborating a role of PRRX1 in T2D-related phenotypes. To conclude, the specific homeobox TFBS clustering for T2D risk loci inferred at the genetic level was taken one step forward at the c/s-regulatory level, validating the homeobox TF PRRX1 as repressor of PPARy2 gene expression at the PPARG diabetes risk locus.

7. Discussion

In this study, an unbiased bioinformatic approach, namely PMCA, for the discovery of c/s-regu!atory SNPs in disease-associated loci, is described. Central to this complementary analysis is a complexity assessment of TFBS modularity, represented by TFBS module constraint scores in candidate variant surrounding regions. Through several lines of evidence, the predictive value of the herein described analysis is validated in terms of c/s-regulatory functionality leveraging GWAS identified T2D susceptibility loci. It is of note, that a varying number of complex SNP regions at a given locus was found, implying that combined effects of several c/s-regulatory SNPs might govern disease association signals. The results of the currently ongoing transethnic fine mapping studies in large worldwide consortia will help to tease apart the effect of LD providing stronger genetic proof of causality in man for identified c/s-regulatory variants. One important finding emerging from this work is the strong clustering of specific homeobox TFBSs within T2D complex regions at SNP position. This distinctive positional bias exceeds chance expectation and distinguishes T2D from a different etiological trait - asthma. The inventors support this causal inference drawn from the herein described T2D analysis with knockdown of the homeobox factor PRRX1 , identified at rs4684847 and link this unexpected TF to endogenous PPARy2 expression. In the rising era of personalised medicine identifying c/s-regulatory SNPs and factors that control gene regulation lays the groundwork for capitalising on the huge amount of available data by detailing the biological upstream and downstream pathways (Green (201 1 ) Nature 470: 204-213). As gene expression is primarily controlled through the integration of signalling and regulatory networks converging on CRM) (Pennacchio (2006) Nature 444: 499-502; Arnone (1997) Development 124: 1851-1864), it is interesting to ask for common upstream and downstream mechanisms of the homeobox factors discovered here and their detailed role converging at T2D risk loci. But, certain issues for the herein described analysis will require careful consideration, as for instance validation at genome-wide scale, mapping with tissue-specific eQTL and chromatin state data. Furthermore, the next step should be extending the herein described analysis to comprehensive haplotypes including indels, CNVs, structural variants and further SNPs, which should further inform on the genetic underpinnings of phenotypic diversity in humans. Whole genome and exome sequencing is required to cope with this concern (Cirulli (2010) Nat. Rev. Genet 1 1 : 415-425), and it is expected that the usefulness of the analysis presented here will further increase as more sequencing data become available. The results presented herein indicate that the extension of sequence analysis to functional conservation may help integrate biological information with statistical signals in the discovery of both rare and common allelic players in Mendelian and common human diseases.

Table 4: Association of lead SNPs and predicted c/^'s-regulatory SNPs with Glucose- traits in MAGIC.

Reported gene

MTNR 1 B FTO GCK TCF7L2 locus

rsl38715 rs993960 rs460751

Lead SNP - - - 3 9 7

rs7903146

Predicted cis- 9 rsl 42108 rs290828 * rsl0830

regulatory - - 56 5 9

proxy SNP

Chr. 11 1 1 16 16 7 7 10

92,681,0 9,2673,8 53,820,5 53,800,9 44,235,6 44,223,9 1 1 ,4758,3

Position (bp)

13 28 27 54 68 42 49

R² - 1.000 - 0.901 - 1.000 -

D' - 1 .000 - 0.965 - 1.000 -

EA t t a c a a t

OA c c t t g g c

EA frequency 0.27 0.28 0.46 0.46 0.2 0.2 0.28 β 0.060 0.062 0.006 0.007 0.062 0.063 0.023

0 SD 0.004 0.004 0.004 0.004 0.005 0.005 0.004

6.60 x 8.63 x 10^" 9.53 x 5.29 x 4.56 x 1.06 x 2.80 x 10^"

P-value

lo-⁴⁵ 51 36 ₁₀-36 8

1 Q-² lo-² _{1 0}-

>—<

β -0.003 -0.002 0.014 0.015 0.006 0.005 -0.012 SD 0.005 0.004 0.004 0.004 0.005 0.005 0.004

1.89 x 7.48 x 4.64 x 10^"

P-value 0.439 0.715 0.225 0.383 3 l o-⁴ l o-⁵

β 0.006 0.008 0.015 0.016 0.015 0.013 -0.010

< - SD 0.005 0.005 0.004 0.004 0.006 0.005 0.005

O

33 3.18 x 9.55 x 6.32 x 1.36 x 3.36 x 10^"

P-value 0.219 0.075

l o-⁴ l o-⁵ l o-³ 10^"2 2 ea β -0.029 -0.03 0.008 0.008 -0.025 -0.026 -0.020

SD 0.004 0.004 0.003 0.003 0.005 0.005 0.004

1.13 x 2.90 x 10^" 2.35 x 1.97 x 4.84 x 6.00 x 1.39 x 10^"

P-value _{1 0}->4 12

l o-² l o-² 10^"8 7 l o-⁹

PMCA results for complexity classification and confirmed cz -regulatory activity. Chromosomal position on GRCh37/hgl9, R2 and D' are displayed referring to the indicated lead SNP at each locus based on the Genome 1000 Pilot dataset. Data were derived from publicly available MAGIC results (Dupuis (2010) Nat Genet 42: 105-116): EA = effect allele, OA = other allele, EAF = effect allele frequency, FG = fasting glucose, FI = fasting insulin, HOMAIR = homeostasis model assessment of insulin resistance, HOMAB = homeostasis model assessment of β-cell function. *Lead SNP is predicted cz^'s-regulatory proxy SNP.

Materials and Methods

1. Definition of LD blocks

Lead SNPs were derived from reported T2D or asthma GWASs. For each lead SNP, LD blocks were defined based on 1000 Genomes Pilot 1 CEU data (A map of human genome variation from population-scale sequencing (2010) Nature 467: 1061-1073) (R² > 0.7, distance limit 500 kb, NCBI build HG19) using the SNAP viewer tool (Johnson (2008) Bioinformatics 24: 2938-2939), Broad Institute. Analysis was performed with SNP sets for 8 (initial analysis) and 47 (extended analysis) GWAS- derived T2D and 8 asthma susceptibility loci.

2. Phylogenetic Module Complexity Analysis

PMCA is a complementary approach that combines phylogenetic conservation with a complexity assessment of TFBS modularity within a candidate variant-surrounding region, as represented by transcription factor binding site module constraint scores (reflecting functional conservation). 2.1 Search for orthologous sequences

The human genomic sequence (120 bp with the SNP at central position, GRCh37 / hg19) around each SNP was extracted. Orthologous sequences were searched for each 120 bp SNP-surrounding region of the human founder sequence in 16 closely and distantly related vertebrate species (RegionMiner, Genomatix, Munich, Germany). A proprietary algorithm was used for identification of orthologous regions in a target species. First, homologous loci were searched in the target organisms. In case no homologous loci could be identified, the flanking genes (up to 20 gene loci in both directions) were considered to identify a syntenic region in the target species. To be assigned as syntenic region, two homologous genes in the target organism need to be on the same contig and must show the same relative strand orientation as the genes in the source organism. Second, the input sequence is aligned to the syntenic region using a Smith-Waterman alignment. The regions had to fulfil the following alignment criteria: the alignment contained a highly conserved 50 bp stretch; the alignment had to be shorter than 1.5-fold the length of the input region and a sufficient overall alignment quality had to be reached.

Reference genome: Human (Homo sapiens)

Aligned genomes: Rhesus macaque (Macaca mulatta)

Common chimpanzee (Pan troglodytes)

Mouse (Mus musculus)

Rat (Rattus norvegicus)

Rabbit (Oryctolagus cuniculus)

Horse (Equus caballus)

Dog (Canis lupus familiaris)

Cow (Bos Taurus)

Pig (Sus scrofa)

Opossum (Monodelphis domestica)

Platypus (Ornithorhynchus anatinus)

Zebrafish (Danio rerio)

Chicken (Gallus gallus)

Western clawed frog (Xenopus tropicalis) Zebra finch (Taeniopygia guttata)

The search for orthologous regions yields a specific data set of sequences for each SNP-surrounding human founder sequence (ortholog set). The variable numbers of input sequences within ortholog set (orthologous regions) that are found for a human reference region raises the problem of comparability. To address this problem, the inventors assessed the minimum number ζ of input sequences that contain a common module in a certain percent of total input sequences by increasing the quorum constraint from 50 % of input sequences to 100 % in 10 increments (percent sequence constraint). It is noted that the variable "x" which is used herein (e.g. in the First Examplified Mode for Carrying Out the Invention, supra), is identical to the variable "ζ" which is used in the Materials and Methods section. Simlarly, the variable "y" which is used herein (e.g. in the First Examplified Mode for Carrying Out the Invention, supra) is identical to the variable "γ" which is used in the Materials and Methods section.

This approach may increase bias towards poorly conserved regions, i.e., regions with ζ < 6. Therefore, the same sequence sets were analysed in parallel, and a fixed number of sequences were set as a quorum constraint (sequence number constraint).

2.2. Assessment of TFBS modularity (Functional Conservation)

For each SNP-surrounding region, the identified orthologous sequences (ortholog sets) were analysed for the occurrence of complex TFBS modules common to a defined subset ζ of input sequences using the FrameWorker tool (Genomatix, Munich, Germany). A TFBS module describes the occurrence of two or more TFBSs in a defined orientation and distance range. The detection of conserved TFBS modules relies on certain FrameWorker parameter settings.

Quorum constraint The minimum number ζ of input sequences to contain a common

TFBS module (2 <ζ <16) assessed by both, sequence number constraint (absolute number of input sequences) and percent sequence constraint (percentage of input sequences) (see Materials and Methods 2.1 ). The number ζ of sequences that is required to share the common features (TFBS, TFBS modules, TFBS, TFBS in modules) was stepwise raised.

Element constraint The number γ of TFBSs in the module (1 < γ < 10)

The threshold γ of TFBSs that are required to fulfil the percent sequence constraint and sequence number constraint was stepwise raised.

Distance constraints The maximum distance variance between two TFBSs within a

TFBS module was set to 10 bp;

PMCA is performed over all possible sequence number constraints, corresponding percent sequence constraints and for each of these quorum constraints all possible element constraints. In an initial step, common TFBSs were identified. A TFBS was considered a 'common TFBS' if the number ζ of input sequences within ortholog set containing a TFBS on any strand (sense or antisense) were above quorum (percent sequence constraint and sequence number constraint). These common TFBSs were used to build the potential TFBS modules. The found common TFBSs were stepwise increased (2 <γ <10), and the resulting TFBS module was checked if it was found in at least ζ sequences (fulfilling the quorum constraint and the distance constraint). This step was repeated until the TFBS modules could not be further extended or the maximum number of TFBS γ was reached.

Complexity assessment of TFBS modularity in a candidate variant-surrounding region was performed based on three different measures for a defined quorum:

Common TFBSs (Classification Strategy 2)

The total counts of TFBSs in human founder sequences that are common to a defined quorum constraint (minimum number ζ of input sequences per total input sequences within ortholog set) were summed up for both, increasing percentage quorum constraints within ortholog set (percent sequence constraint) from 50% of the

" 2_i.⁶ . min ζ sequences

TFBSs m — - input sequences to 100% in 10% increments ( (=² total input sequences ^ _{a nd} for increasing fixed numbers of input sequences within ortholog set (sequence in min ζ sequences

number constraint) (

). γ-TFBS modules (Classification Strategy 3)

The total counts of γ-TFBS modules (2 <γ <10) in human founder sequences that are common to a defined quorum constraint were summed up for both, increasing percentage quorum constraints within ortholog set (percent sequence constraint) from 50% of the input sequences to 100% in 10% increments

"¾¾⁶ ^ , , , · n in ζ sequences

2__j __j 7' element modules in^■

( f=2 r=2 total input sequences j _{and for increas}j_ng fj_xe(j numbers of input sequences within ortholog set (sequence number constraint) max 16 10

∑∑ ' ^el^ement modules in min ζ sequences

( f=2 =2 ) .

TFBSs in y-TFBS modules (Classification Strategy 1)

The total counts of TFBSs in human founder sequences forming γ-TFBS modules (2

<Y <10) that are common to a defined quorum constraint were summed up for both, increasing percentage quorum constraints within ortholog set (percent sequence constraint) from 50% of the input sequences to 100% in 10% increments

"H¾⁶ J2, . min ζ sequences

2- J BS^S in y- element module m^■

total input sequences _{} m(j for increasing fixed} numbers of input sequences within ortholog set (sequence number constraint)

The TFBS modularity assessment raises the following issues:

- Ortholog set: different ortholog set sizes for candidate variants might raise the issue of an artificial bias. For instance, a set of only three sequences allows only two combinations of sequences that contain the founder sequence and fulfil the 50% quorum in contrast to larger sets. Contrary, a region with only primate sequences as orthologous shows a much higher, probably overestimated score.

- Nucleotide composition of human founder sequence: certain TFBSs might be favoured merely due to the sequences nucleotide composition, i.e. high GC content may predict additional SP1 matches; which might provoke overestimation of the variant-surrounding sequence.

In order to cope with these concerns the inventors used simulation on highly similar sequence sets by shuffling, obtaining an estimate of the probabilities associated with the random occurrence of common TFBS, TFBS modules and TFBS in modules in the analysed ortholog sequence sets (details for p-estimate simulations see below).

2.3. Assessment by simulations (p-estimates) and scoring criteria

The classification criteria were used to rank SNP-surrounding regions. Scoring criteria for separating complex SNP region versus non-complex SNP region were obtained by simulations with random sequences derived from the orthologous sets. From each orthologous set 1000 random sets were derived by shuffling the sequences. Each sequence is traversed with a 10 bp window in steps of 10 bp. Within the 10 bp window, nucleotide positions are exchanged randomly leaving the local nucleotide distribution nearly untouched though changing the exact sequence. Each of these 1000 sets was subjected to FrameWorker and the numbers for the described PMCA measures were obtained. Corresponding observations from analysis of the original ortholog sets for each classification to the random data was compared. Thus, an estimate for random occurrence for each

# occurence(random≥ original)

classification # random sets _ j,_e_ the number of occurrences where the value from a random set is equal or larger than that obtained from analysis of the original ortholog set divided by the number of random sets (1000) was obtained. This was taken as an estimate for the probability (p-estimate, further on called p-est.) to obtain the classification value of the orthologous set by chance.

An additional score (here referred to as combined score) was obtained by multiplying the p-est.. of each classification, resulting in a scale ranging from 0 (completely random in all classifications) to nine (one or less random occurrences for any of the three classifications. combined score = -Iog10[(p-est. common TFBS) ^* (p-est. TFBS modules) ^* (p-est. TFBS in modules)]. Using this combined scoring approach (under the assumption of independency), SNPs that have a high p-est.. in any classification were eliminated.

To separate complex versus non-complex SNP-surrounding regions for further experimental validation, the following cut-off criteria were used:

# common TFBS (percent sequence constraint) < 0.15;

# common TFBS (sequence number constraint) < 0.075; combined score (sequence number constraint) > 6.5;

To rank the list of the resulting complex SNP regions, the # TFBS in modules (sequence number constraint) was used..

3. Positional Bias Analysis

3.1 Calculation of the TFBS positional bias

The positional bias of a TFBS was calculated as outlined for the assessment of de novo detected motifs (Hughes (2000) Journal of Molecular Biology 296: 1205-1214). Sequences of 1 ,000 bp with the respective SNP at central position were extracted from the human genome build (NCBI Build 37, hg19) for all complex SNP regions and non-complex SNP region. The sequences were scanned by Matlnspector (Quandt (1995) Nucleic Acids Research 23: 4878^1884; Cartharius (2005) Bioinformatics 21 : 2933-2942) (Genomatix, Munich, Germany) for the presence of TFBS matrix family matches with respect to SNP position (192 TFBS matrix families; 182 vertebrate families plus 10 other general families, Genomatix Matrix Library version 8.4). Match positions on the sequences were scanned using overlapping 50 bp sliding windows in steps of 10 bp. Given the total number of matches for a given TFBS matrix family is regarded as independent individual trials that may find a match anywhere in the sequence, the positional bias for a scan window becomes the cumulative binomial probability to obtain the exact number of matches found there up to the total number of matches in the sequence. The probability for the occurrence of a single match within a scan window, independent of any sequence constraints, is given as the ratio of the window to the sequence length. The positional bias (P) was calculated for each matrix family and each window. For graphical visualisation, -log (P) was plotted over the mid-positions of the scan windows .

3.2. Simulation of the TFBS positional bias with random SNP regions and bias distribution

As the basis of the positional bias calculation, the binomial model assumes that the probability of finding a TFBS match in the scan window is independent of the sequence composition. To exclude random occurrence of bias, 100 sets of SNPs with an equal set size as for the respective complex SNP region set were randomly selected from NCBI dbSNP, and 1 ,000 bp of genomic sequence with cantered SNP position were extracted as described for the complex SNP regions. The sequences were scanned with Matlnspector, and the TFBS positional bias was calculated for each random set of SNPs. For visualisation, the positional bias profiles were plotted as described, and the result for complex SNP regions was overlaid for individual TFBS families.

Density distribution of positional bias P-values was assessed given that the binomial model for the positional bias calculation does not consider whether the underlying sequence composition favours or disfavours the detection of individual TFBS families. To address whether such an effect may occur, the -log(P) obtained for all considered TFBS matrix families is shown as a histogram plot with an additional explicit marking of the positions of (a) the -log(P) in complex SNP regions, (b) the 95 % and (c) the 99 % quantiles of the -log(P) distribution for the random SNPs.

3. Correlation of PMCA identified regions with evolutionary constraint regions.

The 487 SNPs classified as complex and 978 SNPs classified as non-complex from 47 T2D susceptibility loci were correlated to evolutionary constraint regions according to the method and data from Lindblad-Toh (201 1 ) Nature 478: 476-482. The RegionMiner-Genomelnspector tool (Genomatix, Munich) was used for this task. From the mid position (anchor position; 0 on the x axis of the plot) of each constraint region (determined by Siphy-TT-method (Lindblad-Toh (201 1 ) Nature 478: 476-482)) 500 bp in up and downstream direction were scanned for the positions overlapping with the 120 bp of analysed SNP regions. For each position relative to the anchor the overlaps are counted (correlations) and these correlations vs. position relative to the anchor are plotted. A preferred distance of complex or non-complex SNPs to constraint elements would be visible as an enrichment at defined positions relative to the anchor position. The 120 bp extended SNP regions were used in this analysis since PMCA also used these regions in determining the TFBS module complexity. The use of 120 bp regions further has the effect of smoothing the correlation graph, which in case of using exact SNP positions would more adopt the shape of a bar graph since accumulation of overlaps for extended regions is more likely than for single positions. The use of the midpoint of constraint regions as an anchor was chosen since constraint regions do not have the same size.

4. Assessment of SNP to TSS distance annotations

SNPs were analysed by the Annotation and Statistics task of RegionMiner tool (Genomatix, Munich) with the option next neighbour analysis. This results in the transcript starts which are next to each SNP upstream and downstream and on either strand of the DNA. For visualisation all distances were used where a transcript start was annotated within 30,000 bp downstream of a SNP. To directly compare theses distances for complex and non-complex SNPs, density histograms with a bin size of 500 bp were used (Figure 27).

5. Culture of cell lines

The insulinoma cell line INS-1 was cultured in RPMI medium (supplemented with 10 % FBS (fetal bovine serum), 100 mM sodium pyruvate, penicillin/streptomycin and 50 μΜ 2-mercaptoethanol). Huh7 hepatoma, C2C12 myoblast and 3T3-L1 preadipocyte cell lines were cultured in DMEM medium (supplemented with penicillin/streptomycin and 10 % FBS). The SGBS (Simpson-Golabi-Behmel syndrome) cell strain was cultured as previously described (Fischer-Posovszky (2008) Obes Facts 1 : 184-189) in DMEM/Ham's F12 (1 : 1 ) medium (supplemented with 10% FCS, 17 μΜ biotin, 33 μΜ pantothenic acid and 1 % penicillin/streptomycin). All cells were maintained at 37°C and 5% CO2. To promote adipose differentiation of the mouse preadipocyte cell line 3T3-L1 , medium was additionally supplemented with 250 nM dexamethasone and 0.5 mM isobutyl-methylxanthine for the first three days and 66 nM insulin throughout the entire differentiation period. C2C12 myoblasts were cultured in DMEM medium containing 10% horse serum to induce differentiation. The SGBS preadipocyte cell strain was grown to confluence. For induction of adipocyte differentiation cells were cultured in serum free MCDB-131/DMEM/Ham's F12 (1 :2) medium supplemented with 1 1 μΜ biotin, 22 μΜ pantothenic acid, 1 % penicillin/streptomycin, 10 Mg/ml human transferrin, 66 nM insulin, 100 nM Cortisol, 1 nM triiodothyronine 20 nM dexamethasone, 500 μΜ 3- isobutyl-1 -methyl-xanthine (Serva, Germany) and 2 μΜ rosiglitazone (Alexis, Germany). 72 hours after induction of differentiation the cells were harvested in TRIzol reagent (Invitrogen, Germany). Unless other suppliers are mentioned, all cell culture materials were obtained from Invitrogen (Germany) and all chemicals from Sigma-Aldrich (Germany).

6. Isolation, culture and differentiation of primary human adipose tissue cells

Primary human preadipocytes were obtained from lipoaspiration or surgical excision of subcutaneous tissue, and were isolated and cultured as previously described (Skurk (Humana Press 2012) edited by R. R. Mitry & R. D. Hughes, pp. 215-226) with some modification. Briefly, after isolation cells were cultured in in 6-well plates DMEM/F12 (1 : 1 ) medium (supplemented with 10% FCS and 1 % penicillin/streptomycin) for 18 h, followed by expansion in DMEM/F12 medium (supplemented with 2.5% FCS, 1 % penicillin/streptomycin, 17μΜ biotin, 33μΜ pantothenic acid), 132nM insulin (Sigma, Germany), 10ng/ml EGF (R&D, Germany), and 1 ng/ml FGF (R&D, Germany)) until confluence. The cells were harvested in TRIzol reagent (Invitrogen, Germany). Primary human preadipocyte for siRNA experiments were cultured in 12-well plates (200,000 cells / well) with DMEM GlutaMax (supplemented with 10% FCS and 1 % penicillin/streptomycin). For induction of adipocyte differentiation, the medium was supplemented with 100nM Cortisol, 66nM insulin, 10pg/ml transferrin, 33μΜ biotin, 17μΜ pantothenate, 1 nM T3 and 10μΜ rosiglitazone the day after seeding (day 0). Cells were harvested in buffer RLT (Qiagen, Germany). Mature human adipocytes were isolated by fractionation of surgically excised subcutaneous adipose tissue as previously described (Veum (201 1 ) Int J Obes). A total of 22 subjects undergoing elective surgeries (hernia repairs or laparoscopic sleeve gastrectomy) were included, ranging in BMI from 18 to 52. The patients (14 women and eight men) were between 25 and 69 years of age, had body mass index (BMI, kg/m²) ranging from 18 to 52, homeostatic model assessment of insulin resistance (HOMA-IR, fasting glucose mmol/L x fasting insulin mlU/L / 22.5) ranging from 0.41 to 8.71 , and triglyceride (TG) / high density lipoprotein (HDL) ratio ranging from 0.20 to 8.98. All cell culture material was obtained from Invitrogen, chemicals from Sigma-Aldrich, unless otherwise stated. Primary human preadipocyte cells and SVF cells were genotyped for rs1801282 and rs4684847 with a concordance rate of > 99.5% using the MassARRAY system with iPLEX™ chemistry (Sequenom, USA), as previously described (Holzapfel (2008) European Journal of Endocrinology 159: 407-416). Informed consent was obtained from all patients before the surgical procedure. The study protocol was approved by the ethical committee of the Faculty of Medicine of the Technical University of Munich, Germany or the Regional Committee for Medical Research Ethics (REK), Norway.

7. Gene knock-down by siRNA

Primary human preadipocytes in 12-well plates were cultured with differentiation medium as described above. On the first day after plating (day 1 ) or the following day (day 2), cells were treated with 25nM non-targeting (NT) control or siRNA targeting PRRX1 ON-TARGETplus human siRNA SMARTpool (Dharmacon, USA) using HiPerFect (Qiagen, Germany) according to the manufacturer's protocol. After 72 hours, the cells were harvested in buffer RLT (Qiagen, Germany) and frozen at - 80°C. For SGBS cells, confluent 6-well plates (day 0) were treated to induce adipocyte differentiation and simultaneously transfected with the same siRNA and protocol used for primary human preadipocytes. 72 hours after induction of differentiation, the cells were harvested in TRIzol reagent (Invitrogen, Germany) and frozen at -80°C.

8. Quantitative RT-PCR

RNA from SGBS cells and primary human preadipocytes was isolated by TRIzol reagent (Invitrogen, Germany), followed by NucleoSpin Kit (Macherey-Nagel, Germany). High capacity cDNA Reverse Transcription kit (Applied Biosystems, Germany) was used for transcription of total RNA into cDNA. qPCR analysis of the human PPARyl and PPARy2 isoform transcripts (NCBI Accession: NM_138712, NM_015869) was performed using quantitative PCR SYBR-Green ROX Mix (ABgene, Germany) using the Mastercycler Realplex system (Eppendorf, Germany) with an initial activation of 15 min at 95°C followed by 40 cycles of 15 sec at 95°C, 30sec at 60°C and 30 sec at 72°C. Mean target mRNA level was calculated relative to the concentration of hypoxanthin phosphoribosyltransferase (HPRT) based on technical duplicates.

RNA of cultured primary human adipose tissue cells for siRNA experiments and from isolated mature human adipocytes was extracted using the RNeasy Lipid Tissue Mini Kit (Qiagen, Germany). cDNA was synthesized from 250 ng total RNA per sample using the Superscript® VILO™ cDNA Synthesis Kit (Invitrogen, Germany), cDNA for standard curves was synthesized from RNA extracted from in vitro differentiated human preadipocytes. qPCR was performed using the LightCycler480 Probes Master kit and the LightCycler480 rapid thermal cycler system (Roche, Germany). cDNA was amplified using target specific primers and Universal ProbeLibrary (UPL) probes. qPCR analysis of the human PPARyl and PPARy2 isoform transcripts was performed with the SYBRGreen PCR mix (Biorad, Germany). mRNA concentrations were calculated based on the amplification efficiency for each specific primer/probe set, determined by standard curves of 1 :5 cDNA dilutions. Mean target mRNA level was calculated relative to the concentration of IP08 (primary cell culture) or TBP mRNA (primary human adipocytes) based on technical triplicates. For the isolated adipocytes, mRNA for PRRX1 and TBP was measured in duplex.

For allele specific qPCR analysis of the human PPARy2 isoform transcript in primary human preadipocytes (heterozygous for rs1801282 and rs4684847) mRNA was reverse transcribed into cDNA using random hexamers. Next, the region surrounding the SNP rs1801282 was amplified using the cDNA-Primers as reverse primer. Genomic DNA regions surrounding the SNP rs1801282 was amplified using the genomic DNA primers. Annealing temperatures for genomic DNA PCR and RT-PCR were 59°C and 60°C respectively. PCR products were analyzed on an agarose gel and purified by gel extraction using the Wizard VS Gel and PCR Clean-Up System (Promega, Germany). Using equal amount of amplicons from cDNA and genomic DNA, primer extension assays were carried out with Snapshot forward (51 °C annealing temperature) and reverse (54°C annealing temperature) primers using the ABI Prism SNaPshot Kit. cDNA synthesis and primer extension assays were performed with kits from Applied Biosystems. For amplification the GoTaq DNA Polymerase Kit (Promega, Germany) was used. The reaction products were analyzed by gel capillary electrophoresis on ABI 3700 DNA Analyzer and the electropherograms were analyzed with the Gene Mapper 4.0 software. Allelic genomic DNA ratios were used to normalize the cDNA ratios. Means and SD were calculated with JMP7 (SAS, USA).

Primers for UPL probe PCRs were designed at https://www.roche-applied- science.com/sis/rtpcr/upl/index.jsp?id=UP020300 according to the optimized protocol (Roche, Germany). Isoform specific primers for PPARG mRNA (MWG, Germany) were designed using NCBI primer blast software (http://www.ncbi.nlm.nih.gov/tools/primer-blast/) and optimised for secondary structures using the net primer analysis software (http://www.premierbiosoft.com/netprimer/).

mRNA expression P-values in the siRNA experiments were calculated using the Wilcoxon signed rank test, in all other mRNA expression experiments P-values were calculated using paired t-test. Homeostasis model assessment (HOMA-IR) was used as an index of insulin resistance, calculated by the following formula: fasting serum insulin (mlU/L) x fasting glucose (mmol/L)/22.5. Distribution of PRRX1 mRNA expression was assessed using Shapiro-Wilk test and considered normal. Association of PRRX1 mRNA expression with BMI, HOMA-IR, and TG/HDL ratio was investigated using linear regression, using natural log-transformed BMI, HOMA-IR, and TG/HDL ratio and adjusting for age and sex (Figure 20o). In a second step, BMI was additionally included as a covariate in the HOMA-IR and TG/HDL analyses (Figure 31 ). All statistical analyses were done using the Statistical Software R, version 2.14.2.

Table 19: Primers and probes used for qPCR.

Target

gene Forward primer (SEQ ID NO) Reverse primer (SEQ ID NO) UPL probe

ADRB3 tggctgggacagctagaga (31 ) gacagcaaggcatgagagc (32) 64 ATGL ctgccgggagaagatcac (33) agagggtggtcagcaggtc (34) 1

CD36 cctccttggcctgatagaaa (35) gtttgtgcttgagccaggtt (36) 9

CIDEC gaggacctcctcctcaaggt (37) agggcttggaagtactcttctg (38) 65

DGAT1 actaccgtggcatcctgaac (39) ataaccgggcattgctca (40) 9

FABP4 cctttaaaaatactgagatttccttca (41 ) ggacacccccatctaaggtt (42) 72

FASN caggcacacacgatggac (43) cggagtgaatctgggttgat (44) 1 1

GLUT4 ctgtgccatcctgatgactg (45) ccagggccaatctcaaaa (46) 18

IP08 cggattatagtctctgaccatgtg (47) tgtgtcaccatgttcttcagg (48) 48

LIPE agaagatgtcggagcccata (49) ggtcaggttcttgagggaatc (50) 23

LPL Tgctcgtgctgactctgg (51 ) gggcaaatttactttcgatgtc (52) 70

PLIN1 tttctgcctgaggagacactc (53) ccatcctcgctcctcaagt (54) 64

PRRX1 gtggagcagcccatcgta (55) tgggagggacgaggatct (56) 15

SCD cctagaagctgagaaactggtga (57) acatcatcagcaagccaggt (58) 82

TBP Universal ProbeLibrary Human TBP Gene Assay (Roche, Germany)

Isoform Forward primer (SEQ ID NO) Reverse primer (SEQ ID NO) Dye

SYBR

PPARyl cgtggccgcagatttga (59) agtgggagtggtcttccattac (60) Green

SYBR

PPARy2 gaaagcgattccttcactgat (61 ) tcaaaggagtgggagtggtc (62) Green

SYBR

PRRX1 gtggagcagcccatcgta (63) tgggagggacgaggatct (64) Green

TGAAAAGGACCCCACGAAG AAGCAGATGGCCACAGAACTAG SYBR

HPRT (65) (66) Green

Allelic PGR PPARv2 Forward primer (SEQ ID NO) Reverse primer (SEQ ID NO)

TCCATGCTGTTATGGGTGAA GGAGCCATGCACAGAGATAA

genomic DNA (67) (68)

TCCATGCTGTTATGGGTGAA GATGCAGGCTCCCATTTGAT

cDNA (69) (70)

CTCTGGGAGATTCTCCTATTGA TATCAGTGAAGGAATCGCTTTCT

snapshot C (71 ) G (72)

9. Luciferase expression constructs

To characterise the SNP-adjacent regions for allele-specific transcriptional activity, genomic sequences surrounding the respective SNPs were cloned into a basal pGL4.22 promoter vector. For the promoter construct, a 752 bp thymidine kinase (TK) promoter was cloned upstream of the firefly luciferase gene into the EcoRV and Bglll sites of the pGL4.22 firefly luciferase reporter vector (Promega, Germany). SNP-adjacent regions were extracted from human genome build (NCBI Build 36, hg18). SNP-adjacent regions were commercially synthesised as plasmid vectors (Mr. Gene, Germany) and as double-stranded oligonucleotides (MWG, Germany). Complementary oligonucleotides were annealed and purified on a 12% polyacrylamide gel. SNP-adjacent regions were subcloned either upstream of the TK promoter into the Kpnl and Sacl sites of the pGL4.22-TK vector or downstream of the luciferase gene into the BamHI site of the pGL4.22-TK vector. To further test for enhancer activity, SNP-adjacent regions were subcloned downstream of the luciferase gene in both 5^'-to-3^' and 3^'-to-5^' orientations into the BamHI site. The QuickChange Site-Directed Mutagenesis Kit (Stratagene, Germany) was used to alter single nucleotides (for the respective SNP, NCBI dbSNP). The orientation and integrity of each luciferase vector was confirmed by sequencing (MWG, Germany).

10. Luciferase expression assays

Huh7 cells (96-well plate, 1 .1 x 10⁴ / well) were transfected one day after plating with approximately 90% confluence, INS-1 cells (12-well plate, 8 x 10⁴ / well) were transfected three days after plating with approximately 70% confluence, 3T3-L1 cells (12-well plate, 8 x 10⁴ / well) were transfected at day eight after the induction of differentiation with approximately 80% confluence and C2C12 cells (12-well plate, 2 x 10⁵ / well) were transfected at day four after induction of differentiation with approximately 90% confluence. Huh7 were transfected with 0.5 pg of the respective firefly luciferase reporter vector and 1 μΙ Lipofectamine 2000 transfection reagent (Invitrogen, Germany), differentiated C2C12 myocytes were transfected with 1 pg of the respective pGL4.22-TK construct and 2 μΙ Lipofectamine reagent, and both INS-1 β-cells and differentiated 3T3-L1 adipocytes were transfected with 2 pg of the respective pGL4.22-TK construct and 2 μΙ Lipofectamine reagent. The firefly luciferase constructs were co-transfected with the ubiquitin promoter-driven renilla luciferase reporter vector pRL-CMV (Promega, Germany) to normalise the transfection efficiency (Hughes (2000) Journal of Molecular Biology 296: 1205- 1214). Twenty-four hours after transfection, the cells were washed with PBS and lysed in 1x passive lysis buffer (Promega, Germany) on a rocking platform for 30 min at room temperature. Firefly and renilla luciferase activity were measured (substrates D-luciferin and Coelenterazine from PJK, Germany) using a Luminoscan Ascent microplate luminometer (Thermo, Germany) and a Sirius tube luminometer (Berthold, Germany), respectively. The ratios of firefly luciferase expression to renilla luciferase expression were calculated and normalised to the TK promoter control vector.

In the validation experiments for both, the PMCA-predicted sorting of c/s-regulatory from non-c/s-regulatory SNPs at the transcriptional level for four T2D and for the comprehensive analysis of the PPARG gene locus (Figure 17i, Figure 28, respectively), the allele-dependent change in reporter gene activity was calculated from 3-10 independent experiments for each analysed SNP (ratio of the respective allelic activities). The quantified change in luciferase activity comparing the risk and non-risk alleles ( j risk / non-risk or non-risk / risk ratio ) > 1 ) was calculated for each SNP as mean and standard deviation. P-values are derived from linear mixed models comparing the binary logarithm of the quantified ratios in allelic luciferase activity between SNPs in complex versus SNPs in non-complex regions.

11. Electrophoretic mobility shift assay (E SA)

EMSA was performed with Cy5-labelled oligonucleotide probes. Respective SNP- adjacent region oligonucleotides were commercially synthesised containing either the major or the minor variant (MWG, Germany). Cy5-labelled forward strands were annealed with non-labelled reverse strands, and the double-stranded probes were separated from single-stranded oligonucleotides on a 12% poiyacrylamide gel. Complete separation was visualised by DNA shading. The efficiency of the labelling was tested by a dot plot, which confirmed that all of the primers were labelled similarly. For analysis of overexpressed PRRX1 protein in EMSA, a PRRX1 expression vector (pCMV-PRRX1 -flag, provided by M. Kern) and the empty expression vector as control were transiently transfected into 293T cells using Lipofectamine 2000 (Invitrogen, Germany). 24 h after transfection, the transfected cells were harvested as total native protein. Nuclear protein extracts from each analyzed cell line were prepared with adapted protocols based on the method described by Schreiber (1989) Nucleic Acids Research 17: 6419). The supernatant was recovered and stored at -80°C. DNA-protein binding reactions were conducted in 50 mM Tris-HCI, 250 mM NaCI, 5 mM MgCI₂, 2.5 mM EDTA, 2.5 mM DTT, 20% v/v glycerol and the appropriate concentrations of poly(dl-dC). For DNA-protein interactions, 3-5 pg of nuclear protein extract from the respective cell line was incubated for 10 min on ice, and Cy-5-labelled genotype-specific DNA probe was added for another 20 min. For competition experiments 1 1-, 33- and 100-fold molar excess of unlabelled probe as competitor was included with the reaction prior to addition of Cy5-labeled DNA probes. Binding reactions were incubated for 20 min at 4°C. For supershift experiments, cell extracts were pre-incubated with 1 μΙ of antibody aPRRXI , provided by M. Kern) or 0.4 pg of control IgG (Santa Cruz Biotechnology, USA) for 20 min at 4 °C. The DNA-protein complexes were resolved on a non-denaturation 5.3% polyacrylamide gel in 0.5x Tris/borate/EDTA buffer. All EMSAs were performed in triplicate or more, and fluorescence was visualised with a Typhoon TRIO+ imager (GE Healthcare, Germany). For comparison of genotype- specific DNA-binding activity, the intensity of the DNA-protein complexes was quantified for both the major and minor allelic DNA-protein interactions using ImageJ Software (http://rsbweb.nih.gov/ij/). Quantification was performed in quintuplicate for each single EMSA, and the change in quantified allele-dependent fluorescence intensity was calculated (ratio of the respective allelic activity). To validate the PMCA- predicted sorting of c/s-regulatory from non-c/^'s-regulatory SNPs (Figure 17h), the quantified change in fluorescence comparing the risk and non-risk alleles ( | non-risk / risk or risk / non-risk ratio | > 1 ) is calculated for each SNP as mean and standard deviation. 3-10 independent EMSA experiments were conducted per SNP and P- values are derived from linear mixed models using the decadic log of the quantified change in fluorescence comparing the major and minor alleles.

12. DNA-Protein affinity chromatography, LC-MS/MS and label free quantification.

To identify DNA-binding proteins interacting with the c/s-regulatory SNP rs4684847 at the PPARG gene locus, DNA-Protein affinity chromatography, LC-MS/MS and label free quantification was performed. Affinity chromatography. Streptavidin magnetic beads (Dynabeads M-280, Invitrogen) were coupled with allele-specific biotinylated DNA-probes (the risk and non-risk allele, respectively, of rs4684847 at central position in a 42 bp sequence probe) overnight, washed, equilibrated with 1 x binding buffer (10 mM Tris-HCI, 1 mM MgCI₂, 0.5 mM EDTA, 0.5 mM DTT, 4% v/v glycerol) and incubated with nuclear extracts for 20 min (binding buffer with additional 50 mM NaCI and 0.01 % CHAPS) and poly (dl-dC) was added. Supernatant was recovered and beads were washed three times in binding buffer without CHAPS followed by stepwise elution of bound protein from the magnetic beads using increasing concentrations of NaCI. All steps were performed at 4°C. Input protein, wash supernatants and eluates were assayed in EMSA to confirm the binding activity. Mass Spectrometry. Eluates revealing allele-specific DNA-protein binding activity were subjected to tryptic digest and mass spectrometry was performed as described before (Hauck (2010) Molecular & Cellular Proteomics 9: 2292-305; Merl et al. , 2012, in press). Briefly, eluted samples were precipitated and protein pellets were resolved in ammoniumbicarbonate followed by tryptic digestion. LC-MS/MS analysis was performed on an Ultimate3000 nano HPLC system (Dionex, USA) online coupled to a LTQ OrbitrapXL mass spectrometer (Thermo Fisher Scientific, Germany) by a nano spray ion source. Peptides were automatically injected and loaded onto the trap column at a flow rate of 30 μΙ/min in 5% buffer B (98% ACN/0.1 % formic acid in HPLC-grade water) and 95% buffer A (2% ACN/0.1 % FA in HPLC-grade water). After 5 min, the peptides were eluted from the trap column and separated on the analytical column by a 120 min gradient from 5 to 31 % of buffer B at 300 nl/min flow rate followed by a short gradient from 31 to 95 % buffer B in 5 min. Between each sample, the gradient was set back to 5 % buffer B and left to equilibrate for 20 minutes. From the MS prescan, the 10 most abundant peptide ions were selected for fragmentation in the linear ion trap if they exceeded an intensity of at least 200 counts and if they were at least doubly charged. During fragment analysis a high- resolution (60,000 full-width half maximum) MS spectrum was acquired in the Orbitrap with a mass range from 200 to 1500 Da. Label-free quantification. The mass spectrometry data were analyzed using the software environment MaxQuant (version 1 .2.0.13) (Cox (2008) Nat Biotech 26: 1367-1372). Proteins were identified by searching MS and MS/MS data of peptides against the Ensembl mouse protein database (Version NCBI m37; 56410 sequences; 26202967 residues) in a regular and a decoy mode. Carbamidomethylation of cysteines was set as fixed modification. The minimum peptide length was specified to be 6 amino acids. Mass tolerance in MS mode was set to 10 ppm and fragment mass tolerance was set to 0.5 Da. The maximum false peptide discovery rate was specified as 0.01 . Label free quantification was carried out in MaxQuant as described previously (Merl et al., 2012 in press). Briefly, peptides were matched across different sample runs based on high mass accuracy and nonlinearly remapped retention times. Feature-matching between raw files was enabled, using a retention time window of two minutes. 'Multiplicity' was set to one and 'Discard unmodified counterpart peptides' was unchecked. Data were filtered for reverse identifications (false positives), contaminants and 'only identified by site' and only identifications based on more than one peptide and present in all samples were included. Averaged LF quantification (LFQ) intensity values were used to calculate protein risk versus non-risk allele ratios. At the end, the analysis revealed an allele-specific increased binding of the homeobox TF PRRX1 at the risk-allele of the rs4684847-adjacent region.

While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. It will be understood that changes and modifications may be made by those of ordinary skill within the scope and spirit of the following claims. In particular, the present invention covers further embodiments with any combination of features from different embodiments described above and below.

The invention also covers all further features shown in the Figs, individually although they may not have been described in the afore or following description. Also, single alternatives of the embodiments described in the figures and the description and single alternatives of features thereof can be disclaimed from the subject matter of the invention.

Furthermore, in the claims the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single unit may fulfill the functions of several features recited in the claims. The terms "essentially", "about", "approximately" and the like in connection with an attribute or a value particularly also define exactly the attribute or exactly the value, respectively. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. In particular, e.g., a computer program can be a computer program product stored on a computer readable medium which computer program product can have computer executable program code adapted to be executed to implement a specific method such as the method according to the invention. Any reference signs in the claims should not be construed as limiting the scope.

Identification of a Patient Having an Increased Risk or Being Prone to Develop Type 2 Diabetes or Having Type 2 Diabetes (T2D)

The present invention in another aspect relates to methods for the identification of a patient having an increased risk or being prone to develop type 2 diabetes or having type 2 diabetes (T2D). Such methods may particularly comprise the determination of the presence of at least one polynucleotide contained in a biological sample of the patient to be identified, said polynucleotide containing a specific mutation. The other aspect of the present invention also relates to means such as kits, probes or primers which are useful for performing the methods described and provided herein. These methods are based on findings of the inventors when applying the methods of the first aspect of the invention to genes related to type 2 diabetes.

The recent advance in genome wide association studies (GWAS) scans brought along an explosion of compelling single nucleotide polymorphism (SNP) associations with diverse human traits starting with body height and body mass index (BMI) through to complex multifactorial diseases like cancer, Alzheimer disease and metabolic disorders (Manolio (2010) N.Engl.J.Med 363: 166-176). Type 2 diabetes (T2D) is a common metabolic disorder resulting from the complex interaction of environmental factors acting on a susceptible genetic background. The heritability of T2D is one of the best deciphered among common diseases with many identified susceptibility loci, reproducibly associated with fasting plasma glucose (FPG), impaired early insulin secretion and risk for T2D (Prokopenko (2008) Trends Genet 24: 613-621 ; Dupuis (2010) Nat.Genet 42: 105-1 16; Voight (2010) Nat.Genet 42: 579-589; Bonnefond (2010) Trends Mol.Med 16: 407^116), the important causal SNP conferring the increased T2D risk remains unknown, like for the majority of complex diseases (Select: GWAS Gets Functional (2010) Cell 143: 177; Baker (2010) Nature 467: 1 135-1 138). Signals emerging from GWAS scans tag sizeable genomic regions, implicating a complex correlation structure of variants, often spread over several genes, within which the causal variants must reside (Musunuru (2010) Nature 466: 714-719) . A primary challenge in the genomic era persists in the identification of the precise etiological SNP and the corresponding regulated, causative nucleotide position that mediates trait susceptibility.

The level of gene expression is a highly heritable trait (Stranger (2007) Nat.Genet 39: 1217-1224) and genetic variation affecting gene expression is one of the major contributors of phenotypic variation in human populations and therefore in mediating disease susceptibility and adaptive evolution (Visel (2009) Nature 461 : 199-205; Cookson (2009) Nat.Rev.Genet 10: 184-194).

Recent studies imply evolutionary constraint as a tool in search of regulatory elements in the genome (Pennacchio (2006) Nature 444: 499-502). However, genome-wide comparisons revealed that some highly conserved non-coding sequences do not exhibit the expected pattern for the known types of transcriptional regulatory elements (Siepel (2005) Genome Res 15: 1034-1050; Maston (2006) Annu.Rev.Genomics Hum. Genet 7: 29-59) and several non-conserved regions have been shown to be functionally relevant in driving specific gene expression (Balhoff (2005) Proc.Natl.Acad.Sci.U.S.A 102: 8591-8596; Fisher (2006) Science 312: 276- 279; Birney (2007) Nature 447: 799-816; Roh (2007) Genome Res 17: 74-81 ) . Indeed, increasing evidence suggests that sequence alignment does not provide a reliable approach to identify c/s-regulatory regions (Maston (2006) Annu.Rev.Genomics Hum. Genet 7: 29-59; Balhoff (2005) Proc.Natl.Acad.Sci.U.S.A 102: 8591-8596; Cameron (2005) Proc.Natl.Acad.Sci.U.S.A 102: 1 1769-1 1774; Fisher (2006) Science 312: 276-279; Narlikar (2009) Brief.Funct.Genomic.Proteomic 8: 215-230; van Loo (2009) Brief.Bioinform 10: 509-524; Visel (2009) Nature 461 : 199-205; Roh (2007) Genome Res 17: 74-81 ). Clustering and complex modular organization of c/^'s-regulatory elements, necessary for interaction of transcription factors, is a hallmark of regulatory regions (Arnone (1997) Development 124: 1851- 1864). TFBSs (transcription factor binding sites) being part of a module are separated by sequences with variable length and identity, thereby allowing certain degeneracy within an orthologous regulatory region (van Loo (2009) Brief. Bioinform 10: 509-524).

However, there are millions of SNP listed in current databases, but causal SNP associated with diseases such as type 2 diabetes (T2D) are not identifiable. Accordingly, the identification of gene-regulatory etiological variants revealing the susceptibility architecture of major biomedical traits like T2D would be highly beneficial for diagnosing purposes. The present invention provides for surpsingly efficacious means and methods for the elucidation of diagnostic tools, like to provision of markers for disorders and diseases, As non-limiting examples, the present invention provides for means and methods for the elucidations of specific mutations (e.g. single nucleotide polymorphisms (SNPs)) that are useful as diagnostic tools in the determination of metabolic disorders, like diabetes, in particular type 2 diabetes (T2D).

Accordingly, with the means and methods provided herein specific mutations (i.e. single nucleotide polymorphisms (SNP)) have been identified having a regulatory and, thus, a causal role in the development of type 2 diabetes (T2D). Accordingly, the other aspect of the present invention relates to a method for identifying a patient having an increased risk or being prone to develop type 2 diabetes or having type 2 diabetes, said method comprising determining such a specific mutation (i.e. SNP) in a patient sample. The other aspect of the present invention also relates to means such as kits, primers and probes which are useful for detecting said specific mutations (i.e. SNP) in a patient sample for identifying a patient having an increased risk or being prone to develop type 2 diabetes or having type 2 diabetes. Also, methods for identifying specific mutations (i.e. SNP) having a regulatory and, thus, a causal role in the development of type 2 diabetes (T2D) are described and provided herein. Accordingly, herein disclosed is also a method for the identification of a patient having an increased risk or being prone to develop type 2 diabetes or having type 2 diabetes or to additional chronic diseases as umbrella term (criteria of the metabolic syndrome: fasting plasma glucose, levels of other metabolites, insulin resistance and obesity and potentially traits that are associated with diabetes like immune dysfunctions, artherioscierosis etc.), said method comprising the determination of the presence of at least one polynucleotide contained in a biological sample of said patient, said polynucleotide being selected from the group consisting of

(a) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 1 (comprising rs10203174), wherein the nucleotide T at position 60 is substituted;

(a') the polynucleotide of (a), wherein said nucleotide T at position 60 is substituted with C;

(b) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 2 (comprising rs1 1605166), wherein the nucleotide C at position 60 is substituted;

(b') the polynucleotide of (b), wherein said nucleotide C at position 60 is substituted with T;

(c) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 3 (comprising rs10830956), wherein the nucleotide C at position 60 is substituted;

(c') the polynucleotide of (c), wherein said nucleotide C at position 60 is substituted with T;

(d) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 4 (comprising rs 17244499), wherein the nucleotide G at position 60 is substituted;

(d') the polynucleotide of (d), wherein said nucleotide G at position 60 is substituted with A;

(e) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 5 (comprising rs72021 16), wherein the nucleotide A at position 60 is substituted;

(e') the polynucleotide of (e), wherein said nucleotide A at position 60 is substituted with G;

(f) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 6 (comprising rs291 171 1 ), wherein the nucleotide A at position 60 is substituted; the polynucleotide of (f), wherein said nucleotide A at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 7 (comprising rs2908289), wherein the nucleotide G at position 60 is substituted; the polynucleotide of (g), wherein said nucleotide G at position 60 is substituted with A;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 8 (comprising rs2042587), wherein the nucleotide C at position 60 is substituted; or

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 31 (comprising rs2042587, alternative), wherein the nucleotide T at position 60 is substituted;

the polynucleotide of (h), wherein said nucleotide C or T at position 60 is substituted with G;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 9 (comprising rs3751812), wherein the nucleotide G at position 60 is substituted; the polynucleotide of (i), wherein said nucleotide G at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 10 (comprising rs13241 165), wherein the nucleotide A at position 60 is substituted;

the polynucleotide of (j), wherein said nucleotide A at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 1 1 (comprising rs1 1602858), wherein the nucleotide G at position 60 is substituted;

the polynucleotide of (k), wherein said nucleotide G at position 60 is substituted with C;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 12 (comprising rs2288232), wherein the nucleotide A at position 60 is substituted; the polynucleotide of (I), wherein said nucleotide A at position 60 is substituted with G; I a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 13 (comprising rs7936247), wherein the nucleotide G at position 60 is substituted;

) the polynucleotide of (m), wherein said nucleotide G at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 14 (comprising rs132301 1 1 ), wherein the nucleotide A at position 60 is substituted;

the polynucleotide of (n), wherein said nucleotide A at position 60 is substituted with G;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 15 (comprising rs4684847), wherein the nucleotide T at position 60 is substituted; the polynucleotide of (o), wherein said nucleotide T at position 60 is substituted with C;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 16 (comprising rs7201850), wherein the nucleotide C at position 60 is substituted; the polynucleotide of (p), wherein said nucleotide C at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 17 (comprising rs2881654), wherein the nucleotide A at position 60 is substituted; the polynucleotide of (q), wherein said nucleotide A at position 60 is substituted with G;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 18 (comprising rs4504165), wherein the nucleotide C at position 60 is substituted; the polynucleotide of (r), wherein said nucleotide C at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 19 (comprising rs13083375), wherein the nucleotide T at position 60 is substituted;

the polynucleotide of (s), wherein said nucleotide T at position 60 is substituted with G;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 20 (comprising rs13234269), wherein the nucleotide T at position 60 is substituted; the polynucleotide of (t), wherein said nucleotide T at position 60 is substituted with A;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 21 (comprising rs7649970), wherein the nucleotide T at position 60 is substituted; the polynucleotide of (u), wherein said nucleotide T at position 60 is substituted with C;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 22 (comprising rs1 19791 10), wherein the nucleotide C at position 60 is substituted;

the polynucleotide of (v), wherein said nucleotide C at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 23 (comprising rs3996352), wherein the nucleotide A at position 60 is substituted; the polynucleotide of (w), wherein said nucleotide A at position 60 is substituted with G;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 24 (comprising rs1 1 128603), wherein the nucleotide G at position 60 is substituted;

the polynucleotide of (x), wherein said nucleotide G at position 60 is substituted with A;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 25 (comprising rs10882102), wherein the nucleotide G at position 60 is substituted;

the polynucleotide of (y), wherein said nucleotide G at position 60 is substituted with C;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 26 (comprising rs1 1709077), wherein the nucleotide A at position 60 is substituted;

the polynucleotide of (z), wherein said nucleotide A at position 60 is substituted with G;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 27 (comprising rs849133), wherein the nucleotide T at position 60 is substituted; the polynucleotide of (aa), wherein said nucleotide T at position 60 is substituted with C;

(bb) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 28 (comprising rs12243326), wherein the nucleotide T at position 60 is substituted;

(bb') the polynucleotide of (bb), wherein said nucleotide T at position 60 is substituted with C;

(cc) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 29 (comprising rs1 1600585), wherein the nucleotide G at position 60 is substituted;

(cc') the polynucleotide of (cc), wherein said nucleotide G at position 60 is substituted with A;

(dd) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 30 (comprising rs7638389), wherein the nucleotide G at position 60 is substituted; and

(dd') the polynucleotide of (dd), wherein said nucleotide G at position 60 is substituted with A;

wherein the presence of at least one polynucleotide (a) to (dd') is indicative for said patient to have an increased risk or to be prone to develop type 2 diabetes or to have type 2 diabetes (T2D). The presence of said at least one polynucleotide may also be indicative for said patient to have T2D at a severe state.

Disclosed is also a method for the identification of a patient having an increased risk or being prone to develop type 2 diabetes or having type 2 diabetes, said method comprising the determination of the presence of at least one polynucleotide contained in a biological sample of said patient, said polynucleotide being selected from the group consisting of

(1 ) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 15

(comprising rs4684847), wherein the nucleotide T at position 60 is substituted;

(1 ') the polynucleotide of (1 ), wherein said nucleotide T at position 60 is substituted with C;

(2) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 7

(comprising rs2908289), wherein the nucleotide G at position 60 is substituted;

the polynucleotide of (2), wherein said nucleotide G at position 60 is substituted with A;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 6 (comprising rs291 171 1 ), wherein the nucleotide A at position 60 is substituted;

the polynucleotide of (3), wherein said nucleotide A at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ !D NO: 10 (comprising rs13241 165), wherein the nucleotide A at position 60 is substituted;

the polynucleotide of (4), wherein said nucleotide A at position 60 is substituted with T;

the polynucleotide of (5), wherein said nucleotide C or T at position 60 is substituted with G;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 1 (comprising rs1 1602858), wherein the nucleotide G at position 60 is substituted;

the polynucleotide of (6), wherein said nucleotide G at position 60 is substituted with C;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 1 (comprising rs10203174), wherein the nucleotide T at position 60 is substituted;

the polynucleotide of (7), wherein said nucleotide T at position 60 is substituted with C;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 18 (comprising rs4504165), wherein the nucleotide C at position 60 is substituted;

(8') the polynucleotide of (8), wherein said nucleotide C at position 60 is substituted with T;

(9) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 5

(comprising rs72021 16), wherein the nucleotide A at position 60 is substituted;

(9') the polynucleotide of (9), wherein said nucleotide A at position 60 is substituted with G;

(10) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 16

(comprising rs7201850), wherein the nucleotide C at position 60 is substituted;

(10') the polynucleotide of (10), wherein said nucleotide C at position 60 is substituted with T;

(1 1 ) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 17

(comprising rs2881654), wherein the nucleotide A at position 60 is substituted; and

(1 1 ') the polynucleotide of (1 1 ), wherein said nucleotide A at position 60 is substituted with G;

wherein the presence of at least one polynucleotide (1 ) to (1 1 ') is indicative for said patient to have an increased risk or to be prone to develop type 2 diabetes or to have type 2 diabetes.

The method for the identification of a patient having an increased risk or being prone to develop type 2 diabetes or having type 2 diabetes as provided hereincomprises the determination of the presence of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24, 25, 26, 27, 28, 29, or 30 polynucleotides selected from the group of sequences as defined above, wherein the presence of at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24, 25, 26, 27, 28, 29, or 30 of the polynucleotides (a) to (dd') is indicative for said patient to have an increased risk or to be prone to develop type 2 diabetes or to have type 2 diabetes (T2D). For example, the method may comprise the determination of the presence of at least 5 polynucleotides selected from the group of sequences as defined above, wherein the presence of at least 5 of the polynucleotides (a) to (dd') is indicative for said patient to have an increased risk or to be prone to develop type 2 diabetes or to have type 2 diabetes (T2D).

In accordance with the other aspect disillustrated herein comprise the determination of the presence of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24, 25, 26, 27, 28, 29, or 30 polynucleotides selected from the group of sequences as defined above, wherein the presence of at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10 or 1 1 of the polynucleotides (1 ) to (1 1 ') is indicative for said patient to have an increased risk or to be prone to develop type 2 diabetes or to have type 2 diabetes (T2D). For example, the method may comprise the determination of the presence of at least 2 polynucleotides selected from the group of sequences as defined above, wherein the presence of at least 2 of the polynucleotides (1 ) to (1 1 ') is indicative for said patient to have an increased risk or to be prone to develop type 2 diabetes or to have type 2 diabetes (T2D).

In accordance with the other aspect, the substitution comprised by a polynucleotide to be determined as described herein is also referred to a single nucleotide polymorphism (SNP) as commonly understood in the art and refers to a nucleotide polymorphism which may exist as different alleles. Basically, in context of the other aspect of the present invention, the presence of such a substitution (i.e. the presence of a SNP) in a polynucleotide is indicative for a patient to have an increased risk or to be prone to develop T2D or to have T2D. In the other aspect of the present invention, the SNP associated with T2D may refer to a SNP associated with the onset of T2D or a SNP associated with the progression of T2D.

In context with the other aspect, the specific substitutions (i.e. the SNP) as shown in (a'), (b'), (c'), (d'), (e'), (f ), (g'), (h'), (Γ), (j'), (k'), (Γ), (nf), (η'), (ο'), (ρ'), (q'), (r'), (s'), (f), (u'), (ν'), (w'), (x^*), (y'), (ζ'), (aa'), (bb^*), (cc') and (dd'), preferably as shown in (1 ), (V), (2), (2'), (3), (3'), (4), (4'), (5), (5'), (6), (6'), (7), (7'), (8), (8'), (9), (9'), (10), (10'), (1 1 ) and (1 1 '), particularly preferred as shown in (1 ), (1 '), (2) and (2'), and most preferred as shown in (1 ) and (1 '), herein are also referred to herein as high-risk alleles. If such a high-risk allele is detected in both alleles, i.e. if the patient to be identified is determined to be homozygous for a high-risk allele, the risk to have T2D or to be prone to develop T2D is preferably even higher compared to a patient having only one high risk allele (i.e. a patient being heterozygous for said allele).

In context with the other aspect, the biological sample containing the at least one polynucleotide which is analyzed in accordance with the methods described and provided herein may be a body fluid blood sample or a tissue sample. The body fluid sample may be selected from the group consisting of serum, plasma, whole blood, saliva, semen, vaginal secrete, synovial fluid, spinal fluid, cerebrospinal fluid, tears, stool, urine, mucus and the like. "Whole blood" is a venous, arterial or capillary blood sample in which the concentrations and properties of cellular and extra-cellular constituents remain relatively unaltered when compared with their in-vivo state. Anticoagulation in-vitro stabilizes the constituents in a whole blood sample for a certain period of time. In one embodiment, the biological sample is whole blood including blood cells or serum. Preferably, the sample is whole blood.

In accordance with the other aspect, the presence of the at least one polynucleotide contained in the biological sample may be determined by methods commonly known in the art and described herein which is suitable to determine the presence of a certain polynucleotide comprising a specific substitution. Such methods may inter alia be PCR techniques, restriction digestion, chain-termination based sequencing (e.g., sequencing after Sanger (Nature (1977), 265: 687-695)), high-throughput sequencing (e.g., 454 Sequencing® from Roche or SOLiD® Sequencing from ABI) or hybridization techniques such as, e.g., microarray, dot blot or Southern blot. Hybridization assays for the characterization of nucleic acid sequences are well known in the art; see e.g. Sambrook, Russell "Molecular Cloning, A Laboratory Manual", Cold Spring Harbor Laboratory, N.Y. (2001 ); Ausubel, "Current Protocols in Molecular Biology", Green Publishing Associates and Wiley Interscience, N.Y. (1989). The term "hybridization" or "hybridizes" as used herein may relate to hybridizations under stringent or non-stringent conditions. If not further specified, the conditions are preferably non-stringent. Said hybridization conditions may be established according to conventional protocols described, e.g., in Sambrook (2001 ) loc. cit. ; Ausubel (1989) loc. cit. , or Higgins and Hames (Eds.) "Nucleic acid hybridization, a practical approach" IRL Press Oxford, Washington DC, (1985). The setting of conditions is well within the skill of the artisan and can be determined according to protocols described in the art. Thus, the detection of only specifically hybridizing sequences will usually require stringent hybridization and washing conditions such as, for example, the highly stringent hybridization conditions of 0.1 x SSC, 0.1 % SDS at 65 °C or 2 x SSC, 60 °C, 0.1 % SDS. Low stringent hybridization conditions for the detection of homologous or not exactly complementary sequences may, for example, be set at 6 x SSC, 1 % SDS at 65 °C. As is well known, the length of the probe and the composition of the nucleic acid to be determined constitute further parameters of the hybridization conditions. Preferably, in accordance with the other aspect of the present invention, the presence of the at least one polynucleotide is determined by HPLC-based genotyping or microarray or polymerase chain reaction (PCR).

In order to determine whether a nucleotide residue in a nucleic acid sequence corresponds to a certain position in the nucleotide sequence of, e.g., SEQ ID NOs: 1 to 31 , the skilled person may use means and methods well known in the art, e.g., alignments, either manually or by using computer programs such as those mentioned herein. For example, BLAST 2.0, which stands for Basic Local Alignment Search Tool BLAST (Nucl Acids Res (1997), 25: 3389-3402; J Mol Evol (1993), 36: 290-300; J Mol Biol (1990), 215: 403-410), can be used to search for local sequence alignments. BLAST, as discussed above, produces alignments of nucleotide sequences to determine sequence similarity. Because of the local nature of the alignments, BLAST is especially useful in determining exact matches or in identifying similar sequences. The fundamental unit of BLAST algorithm output is the High- scoring Segment Pair (HSP). An HSP consists of two sequence fragments of arbitrary but equal lengths whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or cut-off score set by the user. The BLAST approach is to look for HSPs between a query sequence and a database sequence, to evaluate the statistical significance of any matches found, and to report only those matches, which satisfy the user-selected threshold of significance. The parameter E establishes the statistically significant threshold for reporting database sequence matches. E is interpreted as the upper bound of the expected frequency of chance occurrence of an HSP (or set of HSPs) within the context of the entire database search. Any database sequence whose match satisfies E is reported in the program output.

Prior to detection of the nucleic acid sequences, a genetic sample may be amplified. DNA can be amplified by a number of methods, many of which employ PCR. See, for example, PCR Technology: Principles and Applications for DNA Amplification (Ed. H. A. Erlich, Freeman Press, NY, N.Y., 1992); PCR Protocols: A Guide to Methods and Applications (Eds. Innis, et al., Academic Press, San Diego, Calif., 1990); Mattila et al., Nucleic Acids Res. 19, 4967 (1991 ); Eckert et al., PCR Methods and Applications 1 , 17 (1991 ); PCR (Eds. McPherson et al., IRL Press, Oxford); and U.S. Pat. Nos. 4,683,202, 4,683,195, 4,800,159 4,965,188, and 5,333,675.

Other suitable amplification methods include the ligase chain reaction (LCR) (for example, Wu and Wallace, Genomics 4, 560 (1989), Landegren et al., Science 241 , 1077 (1988) and Barringer et al. Gene 89:1 17 (1990)), transcription amplification (Kwoh et al., Proc. Natl. Acad. Sci. USA 86:1 173-1 177 (1989) and WO88/10315), self-sustained sequence replication (Guatelli et al., Proc. Nat. Acad. Sci. USA, 87:1874-1878 (1990) and WO90/06995), selective amplification of target polynucleotide sequences (U.S. Pat. No. 6,410,276), consensus sequence primed polymerase chain reaction (CP-PCR) (U.S. Pat. No. 4,437,975), arbitrarily primed polymerase chain reaction (AP-PCR) (U.S. Pat. Nos. 5,413,909, 5,861 ,245) nucleic acid based sequence amplification (NABSA), rolling circle amplification (RCA), multiple displacement amplification (MDA) (U.S. Pat. Nos. 6, 124, 120 and 6,323,009) and circle-to-circle amplification (C2CA) (Dahl et al. Proc. Natl. Acad. Sci 101 :4548- 4553 (2004)). (See, U.S. Pat. Nos. 5,409,818, 5,554,517, and 6,063,603). Other amplification methods that may be used are described in, U.S. Pat. Nos. 5,242,794, 5,494,810, 5,409,818, 4,988,617, 6,063,603 and 5,554,517 and in U.S. Ser. No. 09/854,317.

As outlined above, microarrays may be used for detecting the SNPs. In one embodiment, a high density DNA array is used for SNP identification. Such arrays are commercially available e.g. from Affymetrix and lllumina (see Affymetrix GeneChip® 500K Assay Manual, Affymetrix, Santa Clara, CA; Sentrix® humanHap650Y genotyping beadchip, lllumina, San Diego, CA).

DNA microarrays can be fabricated by any number of known methods including photolithography, pipette, drop-touch, piezoelectric, spotting and electric procedures. The DNA microarrays generally have probes that are supported by a substrate so that a target sample is bound or hybridized with the probes. In use, the microarray surface is contacted with one or more target samples under conditions that promote specific, high-affinity binding of the target to one or more of the probes of the invention. A sample solution containing the target sample may contain radioactively, chemoluminescently or fluorescently labeled molecules that are detectable. The hybridized targets and probes can also be detected by voltage, current, or electronic means known in the art.

Optionally, a plurality of microarrays may be formed on a larger array substrate. The substrate can be diced into a plurality of individual microarray dies in order to optimize use of the substrate. Possible substrate materials include siliceous compositions where a siliceous substrate is generally defined as any material largely comprised of silicon dioxide. Natural or synthetic assemblies can also be employed. The substrate can be hydrophobic or hydrophilic or capable of being rendered hydrophobic or hydrophilic and includes inorganic powders such as silica, magnesium sulfate, and alumina; natural polymeric materials, particularly cellulosic materials and materials derived from cellulose, such as fiber-containing papers, e.g., filter paper, chromatographic paper, etc.; synthetic or modified naturally occurring polymers, such as nitrocellulose, cellulose acetate, poly (vinyl chloride), polyacrylamide, cross linked dextran, agarose, polyacrylate, polyethylene, polypropylene, poly (4-methylbutene), polystyrene, polymethacrylate, poly(ethylene terephthalate), nylon, polyvinyl butyrate), etc.; either used by themselves or in conjunction with other materials; glass available as Bioglass, ceramics, metals, and the like. The surface of the substrate is then chemically prepared or derivatized to enable or facilitate the attachment of the molecular species to the surface of the array substrate. Surface derivatizations can differ for immobilization of prepared biological material and in situ synthesis of the biological material on the microarray substrate. Surface treatment or derivatization techniques are well known in the art. The surface of the substrate can have any number of shapes, such as strip, plate, disk, rod, particle, including bead, and the like. In modifying siliceous or metal oxide surfaces, one technique that has been used is derivatization with bifunctional silanes, i.e., silanes having a first functional group enabling covalent binding to the surface and a second functional group that can impart the desired chemical and/or physical modifications to the surface to covalently or non-covalently attach ligands and/or the polymers or monomers for the biological probe array. Adsorbed polymer surfaces are used on siliceous substrates for attaching nucleic acids to the substrate surface. Since a microarray die may be quite small and difficult to handle for processing, an individual microarray die can also be packaged for further handling and processing. For example, the microarray may be processed by subjecting the microarray to a hybridization assay while retained in a package.

Various techniques can be employed for preparing an oligonucleotide for use in a microarray. In situ synthesis of oligonucleotide or polynucleotide probes on a substrate is performed in accordance with well-known chemical processes, such as sequential addition of nucleotide phosphoramidites to surface-linked hydroxy I groups. Indirect synthesis may also be performed in accordance with biosynthetic techniques such as Polymerase Chain Reaction ("PCR"). Other methods of oligonucleotide synthesis include phosphotriester and phosphodiester methods and synthesis on a support, as well as phosphoramidate techniques. Chemical synthesis via a photolithographic method of spatially addressable arrays of oligonucleotides bound to a substrate made of glass can also be employed. The probes or oligonucleotides, themselves, can be obtained by biological synthesis or by chemical synthesis. Chemical synthesis provides a convenient way of incorporating low molecular weight compounds and/or modified bases during specific synthesis steps. Furthermore, chemical synthesis is very flexible in the choice of length and region of target polynucleotides binding sequence. The oligonucleotide can be synthesized by standard methods such as those used in commercial automated nucleic acid synthesizers. Immobilization of probes or oligonucleotides on a substrate or surface may be accomplished by well-known techniques. One type of technology makes use of a bead-array of randomly or non-randomly arranged beads. A specific oligonucleotide or probe sequence is assigned to each bead type, which is replicated any number of times on an array. A series of decoding hybridizations is then used to identify each bead on the array. The concept of these assays is very similar to that of DNA chip based assays. However, oligonucleotides are attached to small microspheres rather than to a fixed surface of DNA chips. Bead-based systems can be combined with most of the allele-discrimination chemistry used in DNA chip based array assays, such as single-base extension and oligonucleotide ligation assays. The bead-based format has flexibility for multiplexing and SNP combination. In bead-based assays, the identity of each bead needs is determined where that information is combined with the genotype signal from the bead to assign a "genotype call" to each SNP and individual.

One bead-based genotyping technology uses fluorescently coded microspheres developed by Luminex. Fulton R, McDade R, Smith P, Kienker L, Kettman J. J. Advanced multiplexed analysis with the FlowMetrix system, Clin. Chem. 1997; 43: 1749-1756. These beads are coated with two different dyes (red and orange), and can be identified and separated using flow cytometry, based on the amount of these two dyes on the surface. By having a hundred types of microspheres with a different red:orange signal ratio, a hundred-plex detection reaction can be performed in a single tube. After the reaction, these microspheres are distinguished using a flow fluorimeter where a genotyping signal (green) from each group of microspheres is measured separately. This bead-based platform is useful in allele-specific hybridization, single-base extension, allele-specific primer extension, and oligonucleotide ligation assay. In a different bead-based platform commercialized by lllumina, microspheres are captured in solid wells created from optical fibers. Michael K., Taylor L, Schultz S, Walt D. Randomly ordered addressable high-density optical sensor arrays, Anal. Chem., 1998; 70: 1242-1248; Steemers F., Ferguson J, Walt D., Screening unlabeled DNA targets with randomly ordered fiber-optic gene arrays, Nat. Biotechnol., 2000; 18: 91 -94. The diameter of each well is similar to that of the spheres, allowing only a single sphere to fit in one well. Once the microspheres are set in these weils, all of the spheres can be treated like a high-density microarray. The high degree of replication in DNA microarray technology makes robust measurements for each bead type possible. Bead-array technology is particularly useful in SNP genotyping. Software used to process raw data from a DNA microarray or chip is well known in the art and employs various known methods for image processing, background correction and normalization. Many available public and proprietary software packages are available for such processing whereby a quality assessment of the raw data can be carried out, and the data then summarized and stored in a format which can be used by other software to perform additional analyses.

As an alternative to or in addition to DNA microarray analysis, genetic variations such as SNPs and mutations can be detected by DNA sequencing. DNA sequencing may also be used to sequence a substantial portion, or the entire, genomic sequence of an individual. Traditionally, common DNA sequencing has been based on polyacrylamide gel fractionation to resolve a population of chain-terminated fragments (Sanger et al., Proc. Natl. Acad. Sci. USA 74:5463-5467 (1977)). Alternative methods have been and continue to be developed to increase the speed and ease of DNA sequencing. For example, high throughput and single molecule sequencing platforms are commercially available or under development from 454 Life Sciences (Branford, CT) (Margulies et al., Nature (2005) 437:376-380 (2005)); Solexa (Hayward, CA); Helicos Biosciences Corporation (Cambridge, MA) (U.S. application Ser. No. 1 1/167046, filed June 23, 2005), and Li-Cor Biosciences (Lincoln, NE) (U.S. application Ser. No. 1 1/1 18031 , filed April 29, 2005).

As mentioned, in the other aspect of the present invention, specific mutations (i.e. single nucleotide polymorphisms (SNP)) have been identified having a regulatory and, thus, a causal role in the development of type 2 diabetes (T2D). That is, a patient carrying one, two, three, four, five or more of these mutations in the genome is considered to have a higher risk to have or to develop T2D. The more of these SNP are present in the patient's genome, the higher the risk to have or to develop T2D. Also, as mentioned, homozygosity for a particular SNP preferably increases the risk to have or to develop T2D. The specific regulatory SNP identified in context with the other aspect of the present invention have been found by phylogenetic module complexity analysis (PMC analysis) based on modular organization of genomic regions analyzed in several vertebrate species as a reliable tool to predict functionality of non-coding SNP at transcriptional level. As a starting point in search of regulatory SNP, linkage blocks were selected which are thought to be associated with T2D levels. As a result, precise regulatory variants were identified in diabetes associated loci as shown in Table 20.

In Tables 20 and 21 the abbreviation rs represents reference SNP, Major/Minor represents more/less frequent allele variant; MAF represents the minor allele frequency of a variant, Hg18 represents human genome version 18 and r² as well as D' as defined above .

Table 20: SNP having regulatory/causal role for T2D; SNAP broad institute anal sis with >0.7, max.500kb, Ha Ma 22, H l8

Table 21 : Preferred SNP having regulatory/causal role for T2D; SNAP (broad institute) analysis with >0.7, max. 500kb, HapMap22, Hgl8

Another aspect relates to (a) probe(s) and/or (a) primer(s) for use in the identification of a patient having an increased risk or being prone to develop type 2 diabetes or having type 2 diabetes. In context with the other aspect of the present invention, said probe(s) and/or primer(s) is/are suitable to detect at least one of the specific substitutions (i.e. SNP) identified and provided herein. In particular, the probe(s) and/or primer(s) provided and to be employed as described herein are suitable to detect at least one polynucleotide selected from the group consisting of

(d) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 4 (comprising rs17244499), wherein the nucleotide G at position 60 is substituted;

(e) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 5 (comprising rs7202 6), wherein the nucleotide A at position 60 is substituted;

(f) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 6 (comprising rs291 171 1 ), wherein the nucleotide A at position 60 is substituted;

(f ) the polynucleotide of (f), wherein said nucleotide A at position 60 is substituted with T;

(g) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 7 (comprising rs2908289), wherein the nucleotide G at position 60 is substituted; the polynucleotide of (g), wherein said nucleotide G at position 60 is substituted with A;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 8 (comprising rs2042587), wherein the nucleotide C at position 60 is substituted; or a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 31 (comprising rs2042587, alternative), wherein the nucleotide T at position 60 is substituted;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 10 (comprising rs13241 165), wherein the nucleotide A at position 60 is substituted; the polynucleotide of Q), wherein said nucleotide A at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 1 1 (comprising rs1 1602858), wherein the nucleotide G at position 60 is substituted; the polynucleotide of (k), wherein said nucleotide G at position 60 is substituted with C;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 12 (comprising rs2288232), wherein the nucleotide A at position 60 is substituted; the polynucleotide of (I), wherein said nucleotide A at position 60 is substituted with G;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 13 (comprising rs7936247), wherein the nucleotide G at position 60 is substituted; the polynucleotide of (m), wherein said nucleotide G at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 14 (comprising rs132301 1 1 ), wherein the nucleotide A at position 60 is substituted; the polynucleotide of (n), wherein said nucleotide A at position 60 is substituted with G; a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 15 (comprising rs4684847), wherein the nucleotide T at position 60 is substituted; the polynucleotide of (o), wherein said nucleotide T at position 60 is substituted with C;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 19 (comprising rs13083375), wherein the nucleotide T at position 60 is substituted; the polynucleotide of (s), wherein said nucleotide T at position 60 is substituted with G;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 22 (comprising rs1 19791 10), wherein the nucleotide C at position 60 is substituted; the polynucleotide of (v), wherein said nucleotide C at position 60 is substituted with T;

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 23 (comprising rs3996352), wherein the nucleotide A at position 60 is substituted; (w') the polynucleotide of (w), wherein said nucleotide A at position 60 is substituted with G;

(x) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 24 (comprising rs1 1 128603), wherein the nucleotide G at position 60 is substituted;

(χ') the polynucleotide of (x), wherein said nucleotide G at position 60 is substituted with A;

(y) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 25 (comprising rs10882102), wherein the nucleotide G at position 60 is substituted;

(y') the polynucleotide of (y), wherein said nucleotide G at position 60 is substituted with C;

(z) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 26 (comprising rs1 1709077), wherein the nucleotide A at position 60 is substituted;

(ζ') the polynucleotide of (z), wherein said nucleotide A at position 60 is substituted with G;

(aa) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 27 (comprising rs849133), wherein the nucleotide T at position 60 is substituted;

(aa') the polynucleotide of (aa), wherein said nucleotide T at position 60 is substituted with C;

(dd') the polynucleotide of (dd), wherein said nucleotide G at position 60 is substituted with A. Another aspect relates to the above described probe and/or primer of the invention, wherein said probe and/or primer is suitable to detect at least one polynucleotide selected from the group consisting of

(1 ) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 15

(comprising rs4684847), wherein the nucleotide T at position 60 is substituted;

(T) the polynucleotide of (1 ), wherein said nucleotide T at position 60 is substituted with C;

(2) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 7

(comprising rs2908289), wherein the nucleotide G at position 60 is substituted;

(2') the polynucleotide of (2), wherein said nucleotide G at position 60 is substituted with A;

(3) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 6

(comprising rs291 171 1 ), wherein the nucleotide A at position 60 is substituted;

(3') the polynucleotide of (3), wherein said nucleotide A at position 60 is substituted with T;

(4) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 10

(comprising rs13241 165), wherein the nucleotide A at position 60 is substituted;

(4') the polynucleotide of (4), wherein said nucleotide A at position 60 is substituted with T;

(5) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 8

(comprising rs2042587), wherein the nucleotide C at position 60 is substituted; or

a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 31 (comprising rs2042587, alternative), wherein the nucleotide T at position 60 is substituted

(5') the polynucleotide of (5), wherein said nucleotide C or T at position 60 is substituted with G;

(6) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 1 1

(comprising rs1 1602858), wherein the nucleotide G at position 60 is substituted;

(6') the polynucleotide of (6), wherein said nucleotide G at position 60 is substituted with C;

(7) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 1

(comprising rs10203174), wherein the nucleotide T at position 60 is substituted;

(7') the polynucleotide of (7), wherein said nucleotide T at position 60 is substituted with C;

(8) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 18

(comprising rs4504165), wherein the nucleotide C at position 60 is substituted;

(9) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 5

(10) a polynucleotide comprising the nucleotide sequence of SEQ I D NO: 16

(comprising rs7201850), wherein the nucleotide C at position 60 is substituted;

(1 1 ) a polynucleotide comprising the nucleotide sequence of SEQ ID NO: 17

(1 1 ') the polynucleotide of (1 1 ), wherein said nucleotide A at position 60 is substituted with G.

In accordance with the other aspect, the probes(s) and/or primer(s) may be used in the methods described and provided herein. Probes may be nucleotide molecules which are able to hybridize to the polynucleotide to be detected in the patient's sample in accordance with the other aspect of the present invention. Such probes may be, inter alia, nucleic acid analogues such as DNA molecules, RNA molecules, oligonucleotide thiophosphates, substituted ribo-oligonucleotides, LNA molecules, PNA molecules, GNA (glycol nucleic acid) molecules, TNA (threose nucleic acid) molecules, morpholino polynucleotides, or any modification thereof as known in the art (see, e.g., US 5,525,71 1 , US 4,71 1 ,955, US 5,792,608 or EP 302175 for examples of modifications). Such probes may be of any length, but preferably comprise not less than 10 nucleotides, more preferably not less than 15 nucleotides, more preferably not less than 20 nucleotides, and most preferably not less than 25 nucleotides. The probes are preferably not longer than 150 nucleotides, more preferably not longer than 100 nucleotides, most preferably not longer than 75 nucleotides. In context with the other aspect of the present invention, probes may hybridize to the polynucleotide to be detected in the patient's sample in accordance with the other aspect of the present invention, wherein the hybridization conditions are preferably stringent conditions. Said hybridization conditions may be established according to conventional protocols described, for example, in Sambrook, Russell "Molecular Cloning, A Laboratory Manual", Cold Spring Harbor Laboratory, N.Y. (2001 ); Ausubel, "Current Protocols in Molecular Biology", Green Publishing Associates and Wiley Interscience, N.Y. (1989), or Higgins and Hames (Eds.) "Nucleic acid hybridization, a practical approach" IRL Press Oxford, Washington DC, (1985). Stringent hybridization and washing conditions may be 0.1 x SSC, 0.1 % SDS at 65 °C. Probes are of particular useful for assays such as microarrays or blot assays (e.g., dot blot, southern blot, northern blot) as described and exemplified herein. The sequence of the probe does not have to be 100% complementary to the respective sequence of the polynucleotide to be detected as long as it is still capable of hybridizing to the polynucleotide, preferably under stringent conditions. Furthermore, probes may be conjugated to marker molecules or tagging molecules as described herein such as fluorescent dyes excited and emitting at UV/VIS or infrared wavelengths like FITC, TRITC, Texas Red, Cy-dyes, alexa dyes (Bioprobes), or the like. Primers can be used particularly in PGR techniques as described herein. Preferably, as known in the art, one primer is complementary to a sequence of the 5'- end of one strand of the polynucleotide to be detected ("forward primer"), while the other primer is complementary to a sequence of the 3'-end of the other strand of the polynucleotide to be detected ("reverse primer"). Primers for PGR techniques and the like should usually have a length of about 12 to 30 nucleotides, but they may also comprise more or less nucleotides if appropriate. The primers may not be 100% complementary to the respective sequences of the polynucleotide to be detected as long as it is still capable of hybridizing to the polynucleotide, preferably under stringent conditions. The primers described herein may also hybridize upstream or downstream of the nucleotide sequences shown in SEQ ID NOs. 1 to 31 , or may hybridize upstream or downstream of the respective specific substitutions (i.e. SIMP) described herein. Furthermore, primers may be conjugated to marker molecules or tagging molecules as described herein such as fluorescent dyes excited and emitting at UV/VIS or infrared wavelengths like FITC, TRITC, Texas Red, Cy-dyes, alexa dyes (Bioprobes), or the like.

Another aspect further relates to kits comprising (a) probe(s) and/or (a) primer(s) described herein and to be employed in context of the other aspect of the present invention. Preferably, the kit is suitable to identify a patient having an increased risk or being prone to develop type 2 diabetes (T2D) or having T2D. The kit may be employed in a method for determining the presence of at least one polynucleotide as described herein, e.g., PCR techniques, restriction digestion, chain-termination based sequencing (e.g., sequencing after Sanger (Nature (1977), 265: 687-695)), high-throughput sequencing (e.g., 454 Sequencing® from Roche or SOLiD® Sequencing from ABI) or hybridization techniques such as, e.g., microarray, dot blot or Southern blot. Preferably, the kit is employed in a method according to the other aspect of the present invention for identifying a patient having increased risk or being prone to develop T2D or having T2D.

While the other aspect of the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. It will be understood that changes and modifications may be made by those of ordinary skill within the scope and spirit of the following claims. In particular, the other aspect of the present invention covers further embodiments with any combination of features from different embodiments described above and below.

The other aspect of the invention also covers all further features shown in the Figs, individually although they may not have been described in the afore or following description. Also, single alternatives of the embodiments described in the figures and the description and single alternatives of features thereof can be disclaimed from the subject matter of the other aspect of the invention.

Furthermore, in the claims the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single unit may fulfil the functions of several features recited in the claims. The terms "essentially", "about", "approximately" and the like in connection with an attribute or a value particularly also define exactly the attribute or exactly the value, respectively.

The following sequences/SNPs are referred to in this application wherein the specific substituted nucleotides are indicated in bold and are underlined. These SNPs are indicative for an increased risk or being prone to develop type 2 diabetes or having T2 diabetes. The SNPs have been identified by the method of identifying a regulatory region or a regulatory variation in reference genome as provided herein.

SEQ ID NO: 1

rs10203174 ± ~60 bp

DNA

Homo sapiens

AGGCATTTCCAGTTCAAAATATAAAGGTAAACACTCTCACAGAGACTCAATTATTTG

GGTAGAGACATGTGGTTAACATTTCCTAGGGAAGAAGCAAATGAGAAAAGGTCAAA

GGCAAAA

SEQ ID NO: 2

rs1 1605166 ± -60 bp

DNA

Homo sapiens

TATTAAAAATCTACCTTCAGAGAAAGGTCTCTTATCTCCTTCCCAAGTTTTATCTATA GCCACTTAAAAGTAATCAGATTTAAGAAAAAAAGTTATCTATTTTAACCAAATCTCCC TAAG

SEQ ID NO: 3

rs10830956 ± ~60 bp

DNA

Homo sapiens

ATTGGCATCAGCTAATTTTAAAGAGGTAAAACATTGAATTAATTCATCTTTAGTGGA

GGCAGCAATTAGTATATCATCCATATAATGAAGAATGTAATTATTTTTAAAAGTCTGT

TGGAT SEQ ID NO: 4

rs17244499 ± ~60 bp

DNA

Homo sapiens

ATAAAGGCCAGTAGAGCTACCTTCAGAGTTACTATCCTAGAGAAAATACCTATGAC AACGTCTGCTTCAGTTACATATACAAAGCATTTCATAAAATATTAAGAAGAAAAGCA AGCTTCA

SEQ ID NO: 5

rs72021 16 ± ~60 bp

DNA

Homo sapiens

TGAGGCCTAATGTTGAAATCTCATCTGTAAGTCTGGTATATCTAACTAATCATATAA ACATCTTTCATCTTAGACTGTGTAGCATGTCCCTAGTAGACCAGGTGGGCCAAATG ACATTAT

SEQ ID NO: 6

rs291 171 1 ± -60 bp

DNA

Homo sapiens

TTGAAAAATTTCAGGTTTACAGAAAAGTTACAAAGATAGTACAGCAAATTCCCATAT

ATATTTCACCTAGTTTCACCTAACATTAACATCTTACATAAGTTTGGTACGTTTGTCA

AAACT

SEQ ID NO: 7

rs2908289 ± -60 bp

DNA

Homo sapiens

CAGAAACACCATCATTTCTTTGTAATCAGCTGAAATCACCACTCATTAGTGGGCCTC TCGTGGGCTATAACTGATTTTACAGCCACCTGTCATTTCTTTTTTCTTTACAACTTTG TGCGT

SEQ ID NO: 8

rs2042587 ± -60 bp

DNA

Homo sapiens

AAGTGCAGGGCAATGGAAAAGACTAGAGAAAGCTCTGGAAATGGAGAACCCTTAA TGAACAACTGCTGCAACAGTGGTCAGTTCCTGCAAGTTATCACTGATTAAAGGGAA AGCAAGCAC

SEQ ID NO: 9

rs3751812 ± ~60 bp

DNA Homo sapiens

CTTGGTAGAAAGATAGCCAGGCATAGCTCAGCAGACCTGAAAATAGGTGAGCTGT CAAGGTGTTGGCAGGGAGAGGCTCCTCTGGGTGGGACTTGGGGCATCCTACCAG CGAAAAAGAGA

SEQ ID NO: 10

rs13241 165 ± ~60 bp

DNA

Homo sapiens

ATTTAGAAACAGCCATTTATATGTGAGAGTTCAGTATATGATAAAGGTGACATTTCA

AAAGAGTGGGGGAAATGATAGACAATTTATTAAAGAGTGTTCAACAGTTACTTAATC

ATTTTG

SEQ ID NO: 1 1

rs1 1602858 ± -60 bp

DNA

Homo sapiens

ATAGATGAATAAGACAAACAAACAAGGTTAGTATCTGCAAAGAATTTATAATCTGGG AAGTGGGCGAACCAAACAGACTACATAATTAAAATACACCACGGAGTGTGTAAAGG GTACGAG

SEQ ID NO: 12

rs2288232 ± -60 bp

DNA

Homo sapiens

GGCTTCCTCTGACACCCTCTTGCCCTTCCCAGACTGCTGACGCCTACACATGGGC AGACAACCTACCTGGTCCCTTCACTTACCTCTGGTAGTGTATCCTCCAATGACATA AATTTTTCC

SEQ ID NO: 13

rs7936247 ± -60 bp

DNA

Homo sapiens

ATTTTATCTATAAGTATTAAATATCTTAGCATGTATCTCTGAAAGATCAAGACTTTTT

AGATACATAACCGCCATACCATTGGTCACACCTAAAAAATTAACAATACTTTTCTTTT

TCTT

SEQ ID NO: 14

rs132301 1 1 ± -60 bp

DNA

Homo sapiens

AAATGCTTCCAAATAGGGAACTGGTGAATTATATTATGATACCTATTTATTTACAGT

GAACCTCTACACAGCCAGGAAATTGCTAATGTAGAAAACATATAATGACCTGCAAA

TGTTACA SEQ ID NO: 15

rs4684847 ± -60 bp

DNA

Homo sapiens

ATAATTTTGAAAGCCTACTCTGTTCCAGACGCTGATTTATTTAAATCATCTCTAATTC TTACAACTCCGAAAAGATAAGAAAACAGAGATGTGACAAGGCTTGAATGCAAACCC AGGTCT

SEQ ID NO: 16

rs7201850 ± ~60 bp

DNA

Homo sapiens

ACCCAGATATGTGTTATGCTGATTCTGTGTGATTGGCTACTTCCAGGATTGACCCT

GTTCTCCTGTCCTGCTCAAGTCTCCTTCCTTCACTTAACCATATATTGTGGGTACTC

AGAAATG

SEQ ID NO: 17

rs2881654 ± -60 bp

DNA

Homo sapiens

AAGTGCTGGGATTACAGGCGTGAGCCATCGCAGCCTGCCCCTGAATTCCATAATTT

CACATGTAATTAGAATTCAAGACCTTTTTCTTGAATATTACTCTTGTCCATATTAATT

TTTTTC

SEQ ID NO: 18

rs4504165 ± -60 bp

DNA

Homo sapiens

ATTCTTAACTGAACAAAGTACTTTATTTTCTACTTTAGTGTGAGTTGACCTGAAACTT TCGGCCCCAGCCAAGTTACAAATATACCCTACTAGGACTTGGCCTTGTTCACAAAT TCCTAA

SEQ ID NO: 19

rs13083375 ± -60 bp

DNA

Homo sapiens

TCAAGAAGAATAAAAAAAAAAAGAGTAAGGCCATTAGAATGTTCAAGGAGAAGATC ATTTTTTAAGATTATAAAAGCGTAGTGGTAGGAATATTTCAGCATTTAGAAGGTAAT AGTAAAT

SEQ ID NO: 20

rs13234269 ± ~60 bp

DNA Homo sapiens

AAAATCTGAAGCAAATTAGGTAAATGAGATTTGTCAATATGAGCCTAGGTACAGTG

CTGTTTGCCATATTATTCTCTGTACTTTTCTGTGTGTGTGAAATTTTTTTTTTTAAATC

GCAGA

SEQ ID NO: 21

rs7649970 ± -60 bp

DNA

Homo sapiens

GGTGGTGTGTTATTCTTCTCATAGAGAACTCCATTTTTTCATTATGACATAGCACTT ATTGTTTAAACATCAATTGATGTTCAAACATCAGCTGGTGTAACATTGCTGCAGTTG CTATTG

SEQ ID NO: 22

rs1 19791 10 ± ~60 bp

DNA

Homo sapiens

CTTAAAGTAG G G G ATC ATG AAG G AAAG GTAC AC ATTTAACTC AAC AAAATTTG GTCT

TACTGGAACTTTAAAAATGCAAATGTGTTTATGTTTTGTTTTATTTTGCTTAACAAAT

ATCTT

SEQ ID NO: 23

rs3996352 ± -60 bp

DNA

Homo sapiens

CCATTGATATTTAGATTGTGAGAAATATAGAAATTTATTTGGGGAGAGTTTAT ATCTT TATAATATTAAATTAAGTCTTGTATCTAGACATGTAGTAGGTCTCACCATCCATTTGA ACCT

SEQ ID NO: 24

rs1 1 128603 ± -60 bp

DNA

Homo sapiens

AGCTAGTTTATATTTTCAAGTCTTCTTACTTTTACTAATGCTGCTTTAAGATATATTCT

GTAAGTTCAAATGTAAAGTTCTAGATTGGGAAAGAATTTATAATGTGTCTAATAAAG

ATTG

SEQ ID NO: 25

rs10882102 ± ~60 bp

DNA

Homo sapiens

AAAATGCTTGGGATTACAAGCATGAGCTACCCTTTGGGCACATAATGTATTATTAG G TAG AG ATG G C AAAATTTCTTAATTCTC TTGTTG G GTTAATGTTG TTTTG TTG TTATA AAATTT SEQ ID NO: 26

rs1 1709077 ± -60 bp

DNA

Homo sapiens

ATGGCAAGCCAGTGCATCTCTGTTAGAAATAGAAACTTAGTAGCCACCTCAAAAGC AAGATCATAATTCTGTGAATCAGCTAATAGGTGGTATTTAGCAGTTAGGTATGGGC TACCCTCG

SEQ ID NO: 27

rs849133 ± -60 bp

DNA

Homo sapiens

AACAAATGAATCCTGACTTACTATTCCTTTGTAAAAGAAAAAGATAAAAATTAAAAAT TTGACTCAGGAAATACAAGAATAGTAAGTAAATACAGCGAGTTAATGAAGAATTACA GATAA

SEQ ID NO: 28

rs12243326 ± ~60 bp

DNA

Homo sapiens

AATAGCAAATCTTAGCTGCCTTGGACCTGATATAATTATTTGTCTTCATTTACATGGT

TTATCCTTCAAGGTTGAATAAATGATGTGGGAGCTAGTCAAGGGGCTTTAGGTATG

TGATTT

SEQ ID NO: 29

rs1 1600585 ± -60 bp

DNA

Homo sapiens

TTCTTTTTAAATGATACCATGACACAAGTCACTTCAAACAACAAAATAAACTGTTTTA CGAAGAAATTTGCCCACTTACTATTATCTACACAACACAAATCTGGATAAAACAGAA

SEQ ID NO: 30

rs7638389 ± -60 bp

DNA

Homo sapiens

AGAGTTAAACAATCAACATCATTAACTCTGTGGCATTTCCTTAAGATATCATGAATTA TGAGACATCCTTTTCTTAACATCAGGTAAAGGAAAATCTTTCATTAAATGTGTATAAT GATT

SEQ ID NO: 31

rs2042587, alternative ± -60 bp

DNA

Homo sapiens TGCTTGCTTTCCCTTTAATCAGTGATAACTTGCAGGAACTGACCACTGTTGCAGCA GTTTGTTCATTAAGGTTCTCCATTTCCAGAGCTTTCTCTAGTCTTTTCCATTGCCCT GCACTTT

Corresponding substituted nucleotides to these SNPs are provided herein above.

Claims

1 . A computer implemented method of identifying a regulatory region or a regulatory variation in a reference genome of a species, comprising:

obtaining sequence data of DNA sequences of the reference genome;

defining reference regions of interest in the sequence data;

identifying orthologous regions of at least one further species corresponding to the reference regions;

analyzing the identified orthologous regions with regard to common patterns of regulatory elements;

classifying the reference regions based on the analysis of the orthologous regions; and

rating each reference region in accordance with its corresponding classification,

wherein classifying the reference regions based on the analysis of the orthologous regions comprises summarizing the regulatory elements in modules.

2. The method according to claim 1 , wherein analyzing the identified orthologous regions with regard to common patterns of regulatory elements comprises analysing a set of orthologous sequences for each orthologous region.

3. The method according to claim 1 or 2, wherein the regulatory elements are summarized in modules according to the formula:

'ΛΓ-^:' ^< · · , , , . min x sequences

> > sites m y - element module m .

j~_t total input sequences

4. The method according to claim 1 , 2 or 3 wherein the regulatory elements are summarized in modules according to the formula: max j I

sites in y - element module in min x sequences .

5. The method according to any one of the preceding claims, wherein classifying the reference regions based on the analysis of the orthologous regions comprises summarizing all common regulatory elements.

6. The method according to claim 5, wherein all common regulatory elements are

, . , , V . . min x sequences summarized according to the formula: 2_, ^{Sltes m} "

total input sequences

7. The method according to claim 5 or 6 wherein all common regulatory elements are max j

summarized according to the formula: V sites in min x sequences .

8. The method according to any one of the preceding claims, wherein classifying the reference regions based on the analysis of the orthologous regions comprises summarizing the number modules.

9. The method according to claim 8, wherein the number of modules is summarized

, "^⁷^ , , , · min x sequences

according to the formula: _,∑_j ^y " ^el^ement modules m - x=i y=k total input sequences

10. The method according to claim 8 or 9 wherein the number of modules is max j I

summarized according to the formula: T T y - element modules in min x sequences .

x=i y=k

1 1 . The method according to any one of the preceding claims, wherein the regulatory elements are transcription factor binding sites, methylation sites, miRNA seats and/or regions of open chromatin.

12. Method according to any one of the preceding claims, wherein rating each reference region in accordance with its corresponding classification comprises determining that a region is a regulatory region when a condition with regard to the classification is met.

13. The method according to any of claims 1 to 12, wherein defining the reference regions of interest in the sequence data and identifying the orthologous regions of the at least one further species corresponding to the reference regions comprise determining a single nucleotide polymorphism and identifying the reference regions spanning the single nucleotide polymorphism.

14. The method according to any of claims 1 to 13, wherein identifying the orthologous regions of the at least one further species corresponding to the reference regions comprises aligning the reference regions with the orthologous regions.

15. The method according to claim 14, wherein aligning the reference regions with the orthologous regions comprises obtaining further sequence data of DNA sequences of the at least one further species and aligning the sequence data with the further sequence data.

16. The method according to claim 15, wherein aligning the sequence data with the further sequence data comprises providing a specific data set of input sequences for each reference region and orthologous regions.

17. The method according to claim 16, comprising assessing a minimum number of the input sequences for each reference region and determining a modular structure of each reference region in stepwise manner up to the minimum number.

18. The method according to claim 17, wherein the minimum number of the input sequences is assessed as a percentage of the total number of the input sequences.

19. The method according to claim 17 or 18, wherein the minimum number of the input sequences is the minimum number of input sequences to contain a common module comprising at least one regulatory element.

20. The method according to claim 19, wherein the common module comprises a plurality of regulatory elements in a specific order and/or in a specific distance from each other.

21 . The method according to claim 20, wherein analysing the identified orthologous regions of the sequence data with regard to the common regulatory elements comprises predefining a maximum distance variance between two regulatory elements within the common module, a range of the distance between two regulatory elements within the common module and a range of number of the regulatory elements within the common module.

22. The Method according to any one of claims 15 to 21 , wherein aligning the sequence data with the further sequence data comprises aligning a plurality of base pair sequences of the sequence data with a corresponding plurality of base pair sequences of the further sequence data.

23. The method according to claims 13 and 22, wherein the base pair sequences of the sequence data comprises a base pair sequence having the single nucleotide polymorphism essentially in the middle.

24. The method according to any one of the claims 16 to 23, wherein analysing the identified orthologous regions of the sequence data with regard to the common regulatory elements comprises extracting a common framework of regulatory elements from the specific data set of the input sequences.

25. A computer program comprising code means being adapted to implement the method according to any one of the preceding claims when being executed.

26. A computer program product directly loadable into the internal memory of a digital computer, comprising software code portions for performing the steps of any of claims 1 to 24 when said product is run on a computer.