Database Creation

From EFI
Jump to: navigation, search

Note: This program can be grouped into sections that are dependant on other sections finishing. The first section is points one and two.  The section section (points 3,4,5,6) depend on the first section being complete, but all of these points can run in parallel.


  1. Download core database files for build
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
    ftp://ftp.ebi.ac.uk/pub/databases/interpro/match_complete.xml.gz
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz
  2. gunzip the files
    gunzip *.gz
  3. Copy trembl files to create beginning of new combined files
    cp uniprot_trembl.fasta combined.fasta
    cp uniprot_trembl.dat combined.dat
    Format the Combined.fasta with formatdb -pT -oT
  4. Concatonate the swissprot flies to the combined files
    cat uniprot_sprot.fasta >> combined.fasta
    cat uniprot_sprot.dat >>combined.dat
  5. Pull out GI information (other information is in this file also)
    "grep -P \"\tGI\t\" idmapping.dat >gionly.dat
  6. Gather tab files and then run the following command to ensure each is in the proper format. Instructions are here
    mac2unix gdna.tab
    dos2unix gdna.tab
    mac2unix phylo.tab
    dos2unix phylo.tab
    mac2unix efi-accession.tab
    dos2unix efi-accession.tab 
    mac2unix hmp.tab 
    dos2unix hmp.tab
  7. tr -d ' \t' <gdna.tab > gdna.new.tab
    rm gdna.tab
    mv gdna.new.tab gdna.tab
  8. If we need a new ena database, mirror the files.  Change num to the release version. (Release_123 as of Apr 6, 2015) .
      rsync -auv rsync://rsync.ebi.ac.uk:/pub/databases/ena/sequence/release/ /home/mirrors/embl/Release_NUM/
  9. unzip the necessary files
    cd /home/mirrors/embl/Release_NUM/
    gunzip std/*pro*gz std/*fun*gz std/*env*gz  (1126.977 seconds)
    gunzip con/*env*gz con/*fun*gz con/*pro*gz (250 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*pro*gz; done (5931.368 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*env*gz; done (1253 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*fun*gz; done (751.435)
  10. create the master struct.tab file -- This does not require the ENA database. You can keep downloading the database
    first check that all of the following files are populated correctly, then run the command
    /home/groups/efi/database_tools/formatdat.pl -dat combined.dat -struct struct.tab -uniprotgi gionly.dat -efitid efi-accession.tab -gdna gdna.tab -hmp hmp.tab -phylo phylo.tab
  11. split the combined.fasta file into many (100) parts and then blast it. Update the blast database to the latest mirror on biocluster. DO NOT USE /latest/ , use the exact path in your command. The first step is to change into the database directory.
  12. module load efiest/devel
    splitfasta.pl -parts 100 -tmp pdb/fractions/ -source combined.fasta
  13. #!/bin/bash
    # Submit the following cluster job, replacing the database_dir variable with the path to the EST-database directory
    # Point the pdb_database to the latest database, using the full path
    #PBS -t 1-100
    #PBS -j oe
    #PBS -S /bin/bash
    #PBS -q efi
    #PBS -N PDB_BLAST
    #PBS -l nodes=1:ppn=1
    module load efiest/devel
    est_database_dir='/home/groups/efi/databases/20150403'
    pdb_database='/home/mirrors/NCBI/BLAST_DBS/20150215/pdbaa'
    blastall -p blastp -i $database_dir/pdb/fractions/fracfile-${PBS_ARRAYID}.fa -d $pdb_database -m 8 -e 1e-20 -b 1 -o $database_dir/pdb/fractions/blastout-${PBS_ARRAYID}.fa.tab
  14. Combine all pdb blast results into one file. You may want to use a job dependency
    cat pdb/fractions/blastout-* > pdb.tab
  15. Simplify the pdb blast results for the database
    /home/groups/efi/database_tools/pdbblasttotab.pl -in pdb.tab -out pdbhits.tab
  16. Split the match_complete file into parsable chunks and then parse into tables for each database (SSF, INTERPROT, GENE3D, PFAM) type and then create tab files.
    module load perl
    mkdir match_complete
    /home/groups/efi/database_tools/chopxml.pl match_complete.xml match_complete
    /home/groups/efi/database_tools/formatdatfromxml.pl match_complete/*.xml
  17. create [pro.tab , fun.tab , env.tab, com.tab ] .tab files for combined table by running something similar to the following:
    /home/groups/efi/database_tools/createdb.pl -embl /home/mirrors/embl/Release_123/ -pro pro.tab -fun fun.tab -env env.tab -com com.tab -pfam /home/groups/efi/databases/20150403/PFAM.tab
  18. cat pro.tab, fun.tab, and env.tab into distance.tab
    cat pro.tab fun.tab env.tab >distance.tab
  19. make the pfam description tab file
    create_pfam_info.pl -short pfam_short_name.txt -long pfam_long_name.txt -out pfam_info.txt
  20. copy the colors.tab file from the last database to this one.  This file does not change.
  21. create mysql database
    mysql -u efignn -p -P 3307 -h 10.1.1.3
    create database efi_YYYYMMDD
  22. populate tables
    mysql -u efignn -p efi_YYYYMMDD -P 3307 -h 10.1.1.3 < /home/groups/efi/database_tools/mysqlcreatedatabase.sql