Database Creation

From EFI
Jump to: navigation, search

Note: This program can be grouped into sections that are dependant on other sections finishing. The first section is points one and two.  The section section (points 3,4,5,6) depend on the first section being complete, but all of these points can run in parallel.

  1. Download core database files for build
  2. gunzip the files
    gunzip *.gz
  3. Copy trembl files to create beginning of new combined files
    cp uniprot_trembl.fasta combined.fasta
    cp uniprot_trembl.dat combined.dat
    Format the Combined.fasta with formatdb -pT -oT
  4. Concatonate the swissprot flies to the combined files
    cat uniprot_sprot.fasta >> combined.fasta
    cat uniprot_sprot.dat >>combined.dat
  5. Pull out GI information (other information is in this file also)
    "grep -P \"\tGI\t\" idmapping.dat >gionly.dat
  6. Gather tab files and then run the following command to ensure each is in the proper format. Instructions are here
  7. tr -d ' \t' < >
  8. If we need a new ena database, mirror the files.  Change num to the release version. (Release_123 as of Apr 6, 2015) .
      rsync -auv rsync:// /home/mirrors/embl/Release_NUM/
  9. unzip the necessary files
    cd /home/mirrors/embl/Release_NUM/
    gunzip std/*pro*gz std/*fun*gz std/*env*gz  (1126.977 seconds)
    gunzip con/*env*gz con/*fun*gz con/*pro*gz (250 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*pro*gz; done (5931.368 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*env*gz; done (1253 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*fun*gz; done (751.435)
  10. create the master file -- This does not require the ENA database. You can keep downloading the database
    first check that all of the following files are populated correctly, then run the command
    /home/groups/efi/database_tools/ -dat combined.dat -struct -uniprotgi gionly.dat -efitid -gdna -hmp -phylo
  11. split the combined.fasta file into many (100) parts and then blast it. Update the blast database to the latest mirror on biocluster. DO NOT USE /latest/ , use the exact path in your command. The first step is to change into the database directory.
  12. module load efiest/devel -parts 100 -tmp pdb/fractions/ -source combined.fasta
  13. #!/bin/bash
    # Submit the following cluster job, replacing the database_dir variable with the path to the EST-database directory
    # Point the pdb_database to the latest database, using the full path
    #PBS -t 1-100
    #PBS -j oe
    #PBS -S /bin/bash
    #PBS -q efi
    #PBS -l nodes=1:ppn=1
    module load efiest/devel
    blastall -p blastp -i $database_dir/pdb/fractions/fracfile-${PBS_ARRAYID}.fa -d $pdb_database -m 8 -e 1e-20 -b 1 -o $database_dir/pdb/fractions/blastout-${PBS_ARRAYID}
  14. Combine all pdb blast results into one file. You may want to use a job dependency
    cat pdb/fractions/blastout-* >
  15. Simplify the pdb blast results for the database
    /home/groups/efi/database_tools/ -in -out
  16. Split the match_complete file into parsable chunks and then parse into tables for each database (SSF, INTERPROT, GENE3D, PFAM) type and then create tab files.
    module load perl
    mkdir match_complete
    /home/groups/efi/database_tools/ match_complete.xml match_complete
    /home/groups/efi/database_tools/ match_complete/*.xml
  17. create [ , ,, ] .tab files for combined table by running something similar to the following:
    /home/groups/efi/database_tools/ -embl /home/mirrors/embl/Release_123/ -pro -fun -env -com -pfam /home/groups/efi/databases/20150403/
  18. cat,, and into
    cat >
  19. make the pfam description tab file -short pfam_short_name.txt -long pfam_long_name.txt -out pfam_info.txt
  20. copy the file from the last database to this one.  This file does not change.
  21. create mysql database
    mysql -u efignn -p -P 3307 -h
    create database efi_YYYYMMDD
  22. populate tables
    mysql -u efignn -p efi_YYYYMMDD -P 3307 -h < /home/groups/efi/database_tools/mysqlcreatedatabase.sql