hmmsim man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

hmmsim(1)			 HMMER Manual			     hmmsim(1)

NAME
       hmmsim - collect score distributions on random sequences

SYNOPSIS
       hmmsim [options] hmmfile

DESCRIPTION
       The  hmmsim  program  generates	random sequences, scores them with the
       model(s) in hmmfile, and outputs various sorts  of  histograms,	plots,
       and fitted distributions for the resulting scores.

       hmmsim  is not a mainstream part of the HMMER package. Most users would
       have no reason to use it. It is used to develop and test the  statisti‐
       cal  methods  used  to  determine  P-values and E-values in HMMER3. For
       example, it was used to generate most of the results in a 2008 paper on
       H3's  local  alignment  statistics  (PLoS  Comp	Bio  4:e1000069, 2008;
       http://www.ploscompbiol.org/doi/pcbi.1000069).

       Because it is a research testbed, you should not expect	it  to	be  as
       robust  as  other  programs  in	the  package. For example, options may
       interact in weird ways; we haven't tested nor tried to  anticipate  all
       different possible combinations.

       The  main  task	is  to fit a maximum likelihood Gumbel distribution to
       Viterbi scores or an maximum likelihood exponential tail to  high-scor‐
       ing  Forward  scores,  and to test that these fitted distributions obey
       the conjecture that lambda ~ log_2 for both the Viterbi Gumbel and  the
       Forward exponential tail.

       The  output is a table of numbers, one row for each model. Four differ‐
       ent parametric fits to the score data are tested: (1)  maximum  likeli‐
       hood  fits to both location (mu/tau) and slope (lambda) parameters; (2)
       assuming lambda=log_2, maximum likelihood fit to the location parameter
       only;  (3)  same	 but  assuming an edge-corrected lambda, using current
       procedures in H3 [Eddy, 2008]; and (4) using both parameters determined
       by  H3's	 current  procedures.  The  standard  simple,  quick and dirty
       statistic for goodness-of-fit is 'E@10', the calculated E-value of  the
       10th ranked top hit, which we expect to be about 10.

       In detail, the columns of the output are:

       name   Name of the model.

       tailp  Fraction of the highest scores used to fit the distribution. For
	      Viterbi, MSV, and Hybrid scores, this defaults to 1.0 (a	Gumbel
	      distribution  is	fitted	to  all the data). For Forward scores,
	      this defaults to 0.02 (an exponential  tail  is  fitted  to  the
	      highest 2% scores).

       mu/tau Location parameter for the maximum likelihood fit to the data.

       lambda Slope parameter for the maximum likelihood fit to the data.

       E@10   The  E-value  calculated for the 10th ranked high score ('E@10')
	      using the ML mu/tau and lambda. By definition, this expected  to
	      be about 10, if E-value estimation were accurate.

       mufix  Location	parameter,  for	 a maximum likelihood fit with a known
	      (fixed) slope parameter lambda of log_2 (0.693).

       E@10fix
	      The E-value calculated for the 10th ranked score using mufix and
	      the expected lambda = log_2 = 0.693.

       mufix2 Location	parameter,  for a maximum likelihood fit with an edge-
	      effect-corrected lambda.

       E@10fix2
	      The E-value calculated for the 10th ranked  score	 using	mufix2
	      and the edge-effect-corrected lambda.

       pmu    Location parameter as determined by H3's estimation procedures.

       plambda
	      Slope parameter as determined by H3's estimation procedures.

       pE@10  The  E-value  calculated	for  the  10th ranked score using pmu,
	      plambda.

       At the end of this table, one more line is printed, starting with # and
       summarizing the overall CPU time used by the simulations.

       Some  of the optional output files are in xmgrace xy format. xmgrace is
       powerful and freely available graph-plotting software.

MISCELLANEOUS OPTIONS
       -h     Help; print a brief reminder  of	command	 line  usage  and  all
	      available options.

       -a     Collect  expected	 Viterbi alignment length statistics from each
	      simulated sequence. This only works  with	 Viterbi  scores  (the
	      default;	see  --vit).  Two additional fields are printed in the
	      output table for each model: the mean length of  Viterbi	align‐
	      ments, and the standard deviation.

       -v     (Verbose). Print the scores too, one score per line.

       -L <n> Set the length of the randomly sampled (nonhomologous) sequences
	      to <n>.  The default is 100.

       -N <n> Set the number  of  randomly  sampled  sequences	to  <n>.   The
	      default is 1000.

       --mpi  Run  in  MPI parallel mode, under mpirun.	 It is parallelized at
	      the level of sending one profile at a  time  to  an  MPI	worker
	      process, so parallelization only helps if you have more than one
	      profile in the <hmmfile>, and you want to have at least as  many
	      profiles	as  MPI worker processes.  (Only available if optional
	      MPI support was enabled at compile-time.)

OPTIONS CONTROLLING OUTPUT
       -o <f> Save the main output table to a file <f> rather than sending  it
	      to stdout.

       --afile <f>
	      When  collecting	Viterbi	 alignment statistics (the -a option),
	      for each sampled sequence, output two fields per line to a  file
	      <f>:  the	 length	 of the optimal alignment, and the Viterbi bit
	      score.  Requires that the -a option is also used.

       --efile <f>
	      Output a rank vs. E-value plot in XMGRACE xy format to file <f>.
	      The  x-axis  is the rank of this sequence, from highest score to
	      lowest; the y-axis is the E-value calculated for this  sequence.
	      E-values	are calculated using H3's default procedures (i.e. the
	      pmu, plambda parameters in the output table). You expect a rough
	      match  between rank and E-value if E-values are accurately esti‐
	      mated.

       --ffile <f>
	      Output a "filter power" file to <f>: for each model, a line with
	      three  fields:  model  name,  number of sequences passing the P-
	      value threshold, and fraction of sequences passing  the  P-value
	      threshold.  See  --pthresh  for  setting	the P-value threshold,
	      which defaults to 0.02 (the default MSV filter threshold in H3).
	      The  P-values  are as determined by H3's default procedures (the
	      pmu,plambda parameters in the output table).  If	all  is	 well,
	      you  expect  to  see filter power equal to the predicted P-value
	      setting of the threshold.

       --pfile <f>
	      Output cumulative survival plots (P(S>x)) to file <f> in XMGRACE
	      xy format. There are three plots: (1) the observed score distri‐
	      bution; (2) the maximum likelihood fitted	 distribution;	(3)  a
	      maximum likelihood fit to the location parameter (mu/tau) while
		  assuming lambda=log_2.

       --xfile <f>
	      Output  the  bit	scores	as  a binary array of double-precision
	      floats (8 bytes per score) to file <f>.  Programs	 like  Easel's
	      esl-histplot  can	 read  such  binary files. This is useful when
	      generating extremely large sample sizes.

OPTIONS CONTROLLING MODEL CONFIGURATION (MODE)
       H3 only uses multihit local alignment ( --fs mode), and this  is	 where
       we  believe  the	 statistical  fits.   Unihit  local  alignment	scores
       (Smith/Waterman; --sw mode)  also  obey	our  statistical  conjectures.
       Glocal  alignment  statistics (either multihit or unihit) are still not
       adequately understood nor adequately fitted.

       --fs   Collect multihit local alignment scores. This  is	 the  default.
	      alignment as 'fragment search mode'.

       --sw   Collect  unihit  local  alignment scores. The H3 J state is dis‐
	      abled.  alignment as 'Smith/Waterman search mode'.

       --ls   Collect	multihit   glocal   alignment	scores.	  In	glocal
	      (global/local) alignment, the entire model must align, to a sub‐
	      sequence of the target. The H3 local entry/exit transition prob‐
	      abilities are disabled. 'ls' comes from HMMER2's historical ter‐
	      minology for multihit local alignment as 'local search mode'.

       --s    Collect unihit glocal alignment scores.  Both the H3 J state and
	      local  entry/exit	 transition  probabilities  are	 disabled. 's'
	      comes from HMMER2's historical  terminology  for	unihit	glocal
	      alignment.

OPTIONS CONTROLLING SCORING ALGORITHM
       --vit  Collect Viterbi maximum likelihood alignment scores. This is the
	      default.

       --fwd  Collect Forward log-odds likelihood scores, summed  over	align‐
	      ment ensemble.

       --hyb  Collect  'Hybrid'	 scores,  as described in papers by Yu and Hwa
	      (for instance, Bioinformatics 18:864, 2002). These involve  cal‐
	      culating a Forward matrix and taking the maximum cell value. The
	      number itself is statistically  somewhat	unmotivated,  but  the
	      distribution is expected be a well-behaved extreme value distri‐
	      bution (Gumbel).

       --msv  Collect MSV (multiple ungapped segment  Viterbi)	scores,	 using
	      H3's main acceleration heuristic.

       --fast For  any	of  the	 above	options, use H3's optimized production
	      implementation (using SIMD vectorization). The default is to use
	      the implementations sacrifice a small amount of numerical preci‐
	      sion. This can introduce confounding noise into statistical sim‐
	      ulations	and fits, so when one gets super-concerned about exact
	      details, it's better to be able to factor that source  of	 noise
	      out.

OPTIONS CONTROLLING FITTED TAIL MASSES FOR FORWARD
       In  some experiments, it was useful to fit Forward scores to a range of
       different tail masses, rather than just one. These  options  provide  a
       mechanism  for fitting an evenly-spaced range of different tail masses.
       For each different tail mass, a line is generated in the output.

       --tmin <x>
	      Set the lower bound on the tail mass distribution. (The  default
	      is 0.02 for the default single tail mass.)

       --tmax <x>
	      Set  the upper bound on the tail mass distribution. (The default
	      is 0.02 for the default single tail mass.)

       --tpoints <n>
	      Set the number of tail masses to sample,	starting  from	--tmin
	      and  ending  at --tmax.  (The default is 1, for the default 0.02
	      single tail mass.)

       --tlinear
	      Sample a range of tail masses with uniform linear	 spacing.  The
	      default is to use uniform logarithmic spacing.

OPTIONS CONTROLLING H3 PARAMETER ESTIMATION METHODS
       H3 uses three short random sequence simulations to estimating the loca‐
       tion parameters for the expected score distributions  for  MSV  scores,
       Viterbi	scores,	 and Forward scores. These options allow these simula‐
       tions to be modified.

       --EmL <n>
	      Sets the sequence length in simulation that estimates the	 loca‐
	      tion parameter mu for MSV E-values. Default is 200.

       --EmN <n>
	      Sets  the	 number	 of sequences in simulation that estimates the
	      location parameter mu for MSV E-values. Default is 200.

       --EvL <n>
	      Sets the sequence length in simulation that estimates the	 loca‐
	      tion parameter mu for Viterbi E-values. Default is 200.

       --EvN <n>
	      Sets  the	 number	 of sequences in simulation that estimates the
	      location parameter mu for Viterbi E-values. Default is 200.

       --EfL <n>
	      Sets the sequence length in simulation that estimates the	 loca‐
	      tion parameter tau for Forward E-values. Default is 100.

       --EfN <n>
	      Sets  the	 number	 of sequences in simulation that estimates the
	      location parameter tau for Forward E-values. Default is 200.

       --Eft <x>
	      Sets the tail mass fraction to fit in the simulation that	 esti‐
	      mates the location parameter tau for Forward evalues. Default is
	      0.04.

DEBUGGING OPTIONS
       --stall
	      For debugging the MPI master/worker version: pause after	start,
	      to  enable the developer to attach debuggers to the running mas‐
	      ter and worker(s) processes. Send SIGCONT signal to release  the
	      pause.   (Under  gdb:  (gdb)  signal SIGCONT) (Only available if
	      optional MPI support was enabled at compile-time.)

       --seed <n>
	      Set the random number seed to <n>.   The	default	 is  0,	 which
	      makes the random number generator use an arbitrary seed, so that
	      different runs of hmmsim will almost certainly generate  a  dif‐
	      ferent statistical sample.  For debugging, it is useful to force
	      reproducible results, by fixing a random number seed.

EXPERIMENTAL OPTIONS
       These options were used in a small  variety  of	different  exploratory
       experiments.

       --bgflat
	      Set  the	background residue distribution to a uniform distribu‐
	      tion, both for purposes of the null model	 used  in  calculating
	      scores,  and for generating the random sequences. The default is
	      to use a standard amino acid background frequency distribution.

       --bgcomp
	      Set the background residue distribution to the mean  composition
	      of  the  profile. This was used in exploring some of the effects
	      of biased composition.

       --x-no-lengthmodel
	      Turn the H3 target sequence length model off. Set the self-tran‐
	      sitions  for  N,C,J  and the null model to 350/351 instead; this
	      emulates HMMER2.	Not a good idea in general. This was  used  to
	      demonstrate one of the main H2 vs. H3 differences.

       --nu <x>
	      Set  the nu parameter for the MSV algorithm -- the expected num‐
	      ber of  ungapped	local  alignments  per	target	sequence.  The
	      default  is  2.0, corresponding to a E->J transition probability
	      of 0.5. This was used to test whether varying nu has significant
	      effect  on  result  (it  doesn't	seem to, within reason).  This
	      option only works if --msv is selected (it  only	affects	 MSV),
	      and  it  will not work with --fast (because the optimized imple‐
	      mentations are hardwired to assume nu=2.0).

       --pthresh <x>
	      Set the filter P-value threshold to  use	in  generating	filter
	      power  files  with --ffile.  The default is 0.02 (which would be
	      appropriate for testing MSV scores, since this  is  the  default
	      MSV  filter  threshold  in  H3's	acceleration  pipeline.) Other
	      appropriate choices (matching defaults in the acceleration pipe‐
	      line) would be 0.001 for Viterbi, and 1e-5 for Forward.

SEE ALSO
       See  hmmer(1)  for  a master man page with a list of all the individual
       man pages for programs in the HMMER package.

       For complete documentation, see the user	 guide	that  came  with  your
       HMMER   distribution   (Userguide.pdf);	or  see	 the  HMMER  web  page
       (@HMMER_URL@).

COPYRIGHT
       @HMMER_COPYRIGHT@
       @HMMER_LICENSE@

       For additional information on copyright and  licensing,	see  the  file
       called  COPYRIGHT  in  your HMMER source distribution, or see the HMMER
       web page (@HMMER_URL@).

AUTHOR
       Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147 USA
       http://eddylab.org

HMMER @HMMER_VERSION@		 @HMMER_DATE@			     hmmsim(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net