Database Creation2

From EFI
Jump to: navigation, search

Download and Format Datafiles

download_and_format.sh

This step downloads the match_complete.xml file from interpro and the uniprot_database. Watch to make sure both have been updated before updating the database

  1. Create a directory YYYYMMDD
  2. Download core database files for build
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
    ftp://ftp.ebi.ac.uk/pub/databases/interpro/match_complete.xml.gz
    ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz
  3. gunzip the files
    gunzip *.gz
  4. Copy trembl files to create beginning of new combined files
    cp uniprot_trembl.fasta combined.fasta
    cp uniprot_trembl.dat combined.dat
    Format the Combined.fasta with formatdb -pT -oT
  5. Concatenate the swissprot flies to the combined files
    cat uniprot_sprot.fasta >> combined.fasta
    cat uniprot_sprot.dat >>combined.dat
  6. Pull out GI information (other information is in this file also)
    "grep -P \"\tGI\t\" idmapping.dat >gionly.dat
  7. Gather tab files and then run the following command to ensure each is in the proper format. Instructions are here
    mac2unix gdna.tab
    dos2unix gdna.tab
    mac2unix phylo.tab
    dos2unix phylo.tab
    mac2unix efi-accession.tab
    dos2unix efi-accession.tab 
    mac2unix hmp.tab 
    dos2unix hmp.tab
  8. tr -d ' \t' <gdna.tab > gdna.new.tab
    rm gdna.tab
    mv gdna.new.tab gdna.tab
  9. create the master struct.tab file. first check that all of the following files are populated correctly, then run the command
  10. /home/groups/efi/database_tools/formatdat.pl -dat combined.dat -struct struct.tab -uniprotgi gionly.dat -efitid efi-accession.tab -gdna gdna.tab -hmp hmp.tab -phylo phylo.tab

PDB Blast

Pdb update.sh

Pdb blast.sh

  1. If we need a new ena database, mirror the files.  Change num to the release version. (Release_123 as of Apr 6, 2015) . Otherwise, skip to step 3
      rsync -auv rsync://rsync.ebi.ac.uk:/pub/databases/ena/sequence/release/ /home/mirrors/embl/Release_NUM/
  2. unzip the necessary files
    cd /home/mirrors/embl/Release_NUM/
    gunzip std/*pro*gz std/*fun*gz std/*env*gz  (1126.977 seconds)
    gunzip con/*env*gz con/*fun*gz con/*pro*gz (250 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*pro*gz; done (5931.368 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*env*gz; done (1253 seconds)
    for D in wgs/*; do [ -d "${D}" ] && gunzip  $D/*fun*gz; done (751.435)
  3. split the combined.fasta file into many (100) parts and then blast it. Update the blast database to the latest mirror on biocluster. DO NOT USE /latest/ , use the exact path in your command. The first step is to change into the database directory.
  4. module load efiest/devel
  5. splitfasta.pl -parts 100 -tmp pdb/fractions/ -source combined.fasta
#!/bin/bash
# Submit the following cluster job, replacing the database_dir variable with the path to the EST-database directory
# Point the pdb_database to the latest database, using the full path
#PBS -t 1-100
#PBS -j oe
#PBS -S /bin/bash
#PBS -q efi
#PBS -N PDB_BLAST
#PBS -l nodes=1:ppn=1
module load efiest/devel
est_database_dir='/home/groups/efi/databases/20150403'
pdb_database='/home/mirrors/NCBI/BLAST_DBS/20150215/pdbaa'
blastall -p blastp -i $database_dir/pdb/fractions/fracfile-${PBS_ARRAYID}.fa -d $pdb_database -m 8 -e 1e-20 -b 1 -o $database_dir/pdb/fractions/blastout-${PBS_ARRAYID}.fa.tab
  1. Combine all pdb blast results into one file. You may want to use a job dependency
    cat pdb/fractions/blastout-* > pdb.tab
  2. Simplify the pdb blast results for the database
    /home/groups/efi/database_tools/pdbblasttotab.pl -in pdb.tab -out pdbhits.tab
  3. Split the match_complete file into parsable chunks and then parse into tables for each database (SSF, INTERPROT, GENE3D, PFAM) type and then create tab files.
    module load perl
    mkdir match_complete
    /home/groups/efi/database_tools/chopxml.pl match_complete.xml match_complete
    /home/groups/efi/database_tools/formatdatfromxml.pl match_complete/*.xml
  4. create [pro.tab , fun.tab , env.tab, com.tab ] .tab files for combined table by running something similar to the following:
    /home/groups/efi/database_tools/createdb.pl -embl /home/mirrors/embl/Release_123/ -pro pro.tab -fun fun.tab -env env.tab -com com.tab -pfam /home/groups/efi/databases/20150403/PFAM.tab
  5. cat pro.tab, fun.tab, and env.tab into distance.tab for this database release
    cat pro.tab fun.tab env.tab >distance.tab
  6. make the pfam description tab file. we have created this one time, and will keep reusing it from the previous databases.
    create_pfam_info.pl -short pfam_short_name.txt -long pfam_long_name.txt -out pfam_info.txt
  7. copy the colors.tab file from the last database to this one.  This file does not change.

MYSQL

create mysql database

mysql -u efignn -p -P 3307 -h 10.1.1.3 ====

create database efi_YYYYMMDD

populate tables
mysql -u efignn -p efi_YYYYMMDD -P 3307 -h 10.1.1.3 < /home/groups/efi/database_tools/mysqlcreatedatabase.sql