The information below relates to the initial phase of the edikt project, which ended in May 2005. Information on the current phase is available via the edikt portal.


 

e dikt ::emTree

 

The Hard Disk Bottleneck

• technological advance (cf. Moore’s Law) is giving exponential increases in CPU speeds, hard disk data densities and scientific data volumes, but NOT hard disk speeds:

1975 Typical disk drive: IBM 3330 , (Data capacity c 100 MB , Average access time: c 100 msec )
2002 Typical disk drive: IBM DeskStar , (Data capacity c 100 GB , Average access time: c 10 msec )

• large enterprises have long had to deal with large data volumes (currently in the 10-100 TB size range – see, for example, http://www.wintercorp.com ). Commercial DBMS’s have evolved largely to meet their needs.
• scientific data processing requires unorthodox and advanced data analysis, not found in commercial database products (for example searching genomes for genetic sequences). Scientists need to identify (index) any occurrence of a sequence anywhere in the genome. Standard commercial indexing techniques (B trees) only index based on string prefixes
• traditionally scientists circumvent these problems by reading the whole data file into memory and building in-storage data structures, or searching by scanning the whole data file sequentially
• BUT this is becoming infeasible because data volumes are outpacing memory sizes and hard disk access speeds


emTree proposed objectives
- enable faster access to large scientific data sets using specialised indexes
- produce generic software components to enable production of a variety of external memory data structures (indexes) not currently available in commercial databases, for example:
- suffix trees
- kd trees
- string btrees
- have good scalability characteristics
- build on a commercial DBMS substrate
- use parallelism
- use hard disks well
- use the emTree components to build a prototype large suffix tree for genome searching

e mTree proposed design
- the emTree design uses ideas from previous research projects:
- PJama (Glasgow University)
- TPIE (Duke University)
- GiST (Universityof California)
- emTree will be a tree-structure of virtual memory pages (each page will be a sub-tree containing many tree nodes)
- emTree will be part of a layered structure:
1) Scientific application
2) Index Reader/Writer (node level)
3) EM Reader/Writer (block level)
4) DBMS/file system

 

 This project is not currently active.