Database Creation2

From EFI
Jump to: navigation, search

Download and Format Datafiles

This step downloads the match_complete.xml file from interpro and the uniprot_database. Watch to make sure both have been updated before updating the database

  1. Create a directory YYYYMMDD
  2. Download core database files for build
  3. gunzip the files
    gunzip *.gz
  4. Copy trembl files to create beginning of new combined files
    cp uniprot_trembl.fasta combined.fasta
    cp uniprot_trembl.dat combined.dat
    Format the Combined.fasta with formatdb -pT -oT
  5. Concatenate the swissprot flies to the combined files
    cat uniprot_sprot.fasta >> combined.fasta
    cat uniprot_sprot.dat >>combined.dat
  6. Pull out GI information (other information is in this file also)
    "grep -P \"\tGI\t\" idmapping.dat >gionly.dat
  7. Gather tab files and then run the following command to ensure each is in the proper format. Instructions are here
  8. tr -d ' \t' < >
  9. create the master file. first check that all of the following files are populated correctly, then run the command
  10. /home/groups/efi/database_tools/ -dat combined.dat -struct -uniprotgi gionly.dat -efitid -gdna -hmp -phylo

PDB Blast



  1. If we need a new ena database, mirror the files.  Change num to the release version. (Release_123 as of Apr 6, 2015) . Otherwise, skip to step 3
      rsync -auv rsync:// /home/mirrors/embl/Release_NUM/
  2. unzip the necessary files
    cd /home/mirrors/embl/Release_NUM/
    gunzip std/*pro*gz std/*fun*gz std/*env*gz  (1126.977 seconds)
    gunzip con/*env*gz con/*fun*gz con/*pro*gz (250 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*pro*gz; done (5931.368 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*env*gz; done (1253 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*fun*gz; done (751.435)
  3. split the combined.fasta file into many (100) parts and then blast it. Update the blast database to the latest mirror on biocluster. DO NOT USE /latest/ , use the exact path in your command. The first step is to change into the database directory.
  4. module load efiest/devel
  5. -parts 100 -tmp pdb/fractions/ -source combined.fasta
# Submit the following cluster job, replacing the database_dir variable with the path to the EST-database directory
# Point the pdb_database to the latest database, using the full path
#PBS -t 1-100
#PBS -j oe
#PBS -S /bin/bash
#PBS -q efi
#PBS -l nodes=1:ppn=1
module load efiest/devel
blastall -p blastp -i $database_dir/pdb/fractions/fracfile-${PBS_ARRAYID}.fa -d $pdb_database -m 8 -e 1e-20 -b 1 -o $database_dir/pdb/fractions/blastout-${PBS_ARRAYID}
  1. Combine all pdb blast results into one file. You may want to use a job dependency
    cat pdb/fractions/blastout-* >
  2. Simplify the pdb blast results for the database
    /home/groups/efi/database_tools/ -in -out
  3. Split the match_complete file into parsable chunks and then parse into tables for each database (SSF, INTERPROT, GENE3D, PFAM) type and then create tab files.
    module load perl
    mkdir match_complete
    /home/groups/efi/database_tools/ match_complete.xml match_complete
    /home/groups/efi/database_tools/ match_complete/*.xml
  4. create [ , ,, ] .tab files for combined table by running something similar to the following:
    /home/groups/efi/database_tools/ -embl /home/mirrors/embl/Release_123/ -pro -fun -env -com -pfam /home/groups/efi/databases/20150403/
  5. cat,, and into for this database release
    cat >
  6. make the pfam description tab file. we have created this one time, and will keep reusing it from the previous databases. -short pfam_short_name.txt -long pfam_long_name.txt -out pfam_info.txt
  7. copy the file from the last database to this one.  This file does not change.


create mysql database

mysql -u efignn -p -P 3307 -h ====

create database efi_YYYYMMDD

populate tables
mysql -u efignn -p efi_YYYYMMDD -P 3307 -h < /home/groups/efi/database_tools/mysqlcreatedatabase.sql