Creating Data Files

There is functionality built into velociraptor to enable easy creation of data files. To do this, you will need to fill an velociraptor.observations.objects.ObservationalData instance, and then call the velociraptor.observations.objects.ObservationalData.write() method to save it out to file. There are several association functions that you will need to call to register various metadata. An example converting a TSV file to a velociraptor-compatible file is shown below.

General Rules

To ensure consistency between data files, follow the following suggestions:

  • Never use ‘log’ data on either axis; e.g. always include mass, not log(mass) even if the original paper used logarithmic quantities.
  • Always include all requested metadata (it doesn’t take that long, but is very useful)!
  • Always remove all h-factors. Every file should be h-free.
  • Include a comment describing important aspects for the data, e.g. for a stellar mass function include the assumed IMF.

Example

The input TSV file:

#Crain et al 2009 (GIMIC)
#Assuming Chabrier IMF and cosmology of
# Omega_l = 0.75, Omega_0 = 0.045, h = 0.73
#
#GSMF weighted over all density regions
#log(M) [Msun] Phi [(h^-1 Mpc)^-3]
7.513259     -0.266989
7.765294     -0.297700
8.028524     -0.493567
8.280272     -0.704415
8.531949     -0.960297
8.783625     -1.216179
9.035397     -1.412015
9.298460     -1.712962
9.527172     -1.998804
9.767462     -2.209621
10.019115    -2.480514
10.271149    -2.511225
10.523422    -2.391822
10.775695    -2.272418
11.004742    -2.348101
11.267685    -2.724105
11.508166    -2.814830
11.771062    -3.220858
12.011042    -3.626822
12.251666    -3.627479
12.490953    -4.468774

Conversion file:

from velociraptor.observations.objects import ObservationalData
from astropy.cosmology import WMAP7 as cosmology
import unyt
import numpy as np
import os

input_filename = "Crain2009_GSMF.txt"
delimiter = "\t"

output_filename = "crain_2009.hdf5"
output_directory = "gsmf"

if not os.path.exists(output_directory):
   os.mkdir(output_directory)

processed = ObservationalData()
raw = np.loadtxt(input_filename, delimiter=delimiter)

comment = f"Assuming Chabrier IMF. h-corrected for SWIFT using cosmology: {cosmology.name}."
citation = "Crain et al. 2009 (GIMIC)"
bibcode = "2009MNRAS.399.1773C"
name = "GSMF from GIMIC"
plot_as = "line"
redshift = 0.0
redshift_lower = 0.0
redshift_upper = 0.2
h = cosmology.h

log_M = raw.T[0]
M = 10 ** (log_M) * unyt.Solar_Mass / h
Phi = (10**raw.T[1] * (h ** 3)) * unyt.Mpc ** (-3)

processed.associate_x(M, scatter=None, comoving=True, description="Galaxy Stellar Mass")
processed.associate_y(Phi, scatter=None, comoving=True, description="Phi (GSMF)")
processed.associate_citation(citation, bibcode)
processed.associate_name(name)
processed.associate_comment(comment)
processed.associate_redshift(redshift, redshift_lower, redshift_upper)
processed.associate_plot_as(plot_as)
processed.associate_cosmology(cosmology)

output_path = f"{output_directory}/{output_filename}"

if os.path.exists(output_path):
   os.remove(output_path)

processed.write(filename=output_path)

Multi-Redshift Data

Data from a single paper that has been collected at multiple redshifts (or a single simulation, with multiple snapshots) should be stored in a multi-redshift file. This will allow the most appropriate redshift from the data to be plotted automatically when using the pipeline.

The velociraptor.observations.MultiRedshiftObservationalData class acts as a container for multiple instances of the velociraptor.observations.ObservationalData object, each for a single redshift. However, the comments and cosmology are stored at the top level. Extending the example above to handle the multiple redshift case:

from velociraptor.observations.objects import (
   ObservationalData,
   MultiRedshiftObservationalData,
)
from astropy.cosmology import WMAP7 as cosmology
import unyt
import numpy as np
import os

input_filenames = ["Crain2009_GSMF_z0.txt", "Crain2009_GSMF_z1.txt"]
input_redshifts = [[0.0, 0.5], [0.5, 1.5]]
delimiter = "\t"

output_filename = "Crain_2009.hdf5"
output_directory = "gsmf"
comment = f"Assuming Chabrier IMF. h-corrected for SWIFT using cosmology: {cosmology.name}."
citation = "Crain et al. 2009 (GIMIC)"
bibcode = "2009MNRAS.399.1773C"
name = "GSMF from GIMIC"

if not os.path.exists(output_directory):
   os.mkdir(output_directory)

multi_z = MultiRedshiftObservationalData()
multi_z.associate_citation(citation, bibcode)
multi_z.associate_name(name)
multi_z.associate_comment(comment)
multi_z.associate_cosmology(cosmology)
multi_z.associate_maximum_number_of_returns(1)

for filename, redshifts in zip(input_filenames, input_redshifts):
   processed = ObservationalData()
   raw = np.loadtxt(filename, delimiter=delimiter)

   plot_as = "line"
   redshift = 0.5 * sum(redshifts)
   redshift_lower, redshift_upper = redshifts
   h = cosmology.h

   log_M = raw.T[0]
   M = 10 ** (log_M) * unyt.Solar_Mass / h
   Phi = (10**raw.T[1] * (h ** 3)) * unyt.Mpc ** (-3)

   processed.associate_x(M, scatter=None, comoving=True, description="Galaxy Stellar Mass")
   processed.associate_y(Phi, scatter=None, comoving=True, description="Phi (GSMF)")
   processed.associate_redshift(redshift, redshift_lower, redshift_upper)
   processed.associate_plot_as(plot_as)

   multi_z.associate_dataset(processed)

output_path = f"{output_directory}/{output_filename}"

if os.path.exists(output_path):
   os.remove(output_path)

multi_z.write(filename=output_path)

In this example, note that the following items are stored at the top level:

  • Citation
  • Name
  • Comment
  • Cosmology

as the object is an abstraction for a single piece of academic work. Below this, at the individual dataset level, we have

  • Actual data (e.g. x, y, associated with a single redshift)
  • Redshift (with bracketing)
  • Plotting commands (as some redshifts may have a very small number of objects, hence being better plotted as points, whereas some redshifts may require binning to a line).

Finally, we have the new associate_maximum_number_of_returns function. This determines the maximum number of returned datasets from the load_datasets function. This is useful in cases where you have a large number of individual datasets that cover very small ranges in redshift, and you may only wish to plot one of them at a time on a given figure.