VAST Pipeline Exploration
Interacting with results from the VAST Pipeline¶
This notebook gives an example of how to use vast-tools in a notebook environment to interact with output from the VAST Pipeline (https://github.com/askap-vast/vast-pipeline).
Documentation for the pipeline is currently being written but there is a short Wiki page on the methods used here: https://github.com/askap-vast/vast-project/wiki/VAST-Pipeline-Methods-Description. If anything is unclear with the results don't hesistate in contacting the Pipeline development team.
vaex¶
For large pipeline runs vast-tools makes use of vaex
to load the measurements and measurement_pairs dataframes in an out-of-core context. vaex
behaves differently to pandas but offers many advantages in handling large dataframes. See the documentation here: https://vaex.readthedocs.io/en/latest/index.html. If you use the built-in methods of the pipeline class then vaex
will be taken care of behind the scenes such that users shouldn't have to worry about it. However be mindful when performing manual analysis using vaex
dataframes. Warnings will appear when vaex
has been used. Also note that vaex
is a relatively young project so you may run into a bug or two when using it yourself.
Imports¶
Below are the imports required for this example. The main import required from vast-tools is the Pipeline class. Astropy objects are also imported as well as bokeh as some plots use the bokeh library. Note that we initialise the bokeh function output_notebook()
such that bokeh outputs are shown in the notebook environment. matplotlib.pyplot
is also imported in case we want to manage any plots produced.
from vasttools.pipeline import Pipeline
from bokeh.io import output_notebook
from bokeh.plotting import show
from astropy.coordinates import Angle, SkyCoord
from astropy import units as u
import matplotlib.pyplot as plt
%matplotlib inline
output_notebook()
Loading a Pipeline run¶
The first step is to initialise the Pipeline. If the environment variable PIPELINE_WORKING_DIR
is defined then you do not need to provide the pipeline run directory. Otherwise, use the project_dir
argument in the Pipeline initialisation to set the directory. Ask your admin if you are not sure where the pipeline directory is located.
pipe = Pipeline(project_dir='/data/vast-pipeline/vast-pipeline/pipeline-runs')
If you are unsure of the run name you want to load, you can use the list_piperuns()
method to list all available runs.
pipe.list_piperuns()
For this example we will load the VAST_2118-06A
run. To do this we call the load_run()
method and save the run to my_run
:
my_run = pipe.load_run('VAST_2118-06A')
It's also possible to load multiple runs as one object using:
pipe.load_runs(['VAST_2118-06A', 'GalCtr_p2'])
If you have loaded two separately you can also merge one with the other by doing:
my_run = my_run.combine_with_run(my_other_run)
For the purposes of this example we will continue to use the single VAST_2118-06A
run.
The pipeline object has loaded all the outputs and information of the pipeline run. You have access to the following dataframes:
my_run.images
contains the information on the images used in the pipeline run.
my_run.images.head()
my_run.sources
contains the information on the unique sources found in the pipeline run.
my_run.sources.head()
my_run.measurements
contains the information on all of the measurements found in the pipeline run, both from selavy and forced (if used in the pipeline run). Measurements are the datapoints in time of asource
, hence the source column tells you which source the measurement belongs to.
my_run.measurements.head()
There is also easy access to a SkyCoord object of the sources:
my_run.sources_skycoord
You also have access to: * my_run.relations
which contains the relation information of the pipeline run, i.e. what source is related to another source. * my_run.skyregions
which contains the sky region information of the pipeline run.
You should rarely need to access these, but are there in case they are required.
Looking at a specific source¶
We can load any source into a vast-tools source instance (like those used in other notebook examples) by using the my_run.get_source(id)
function. The id
is the same id as listed in the sources table. Below I load the source with the id = 4187950
.
my_source = my_run.get_source(4187950)
All the normal functions are available:
lc = my_source.plot_lightcurve()
epoch1 = my_source.show_png_cutout(1)
all_cutouts = my_source.show_all_png_cutouts(columns=5, figsize=(20,10))
ned_results = my_source.ned_search()
ned_results
Filtering the Measurements and Re-Generating the Sources¶
WARNING! This method can require significant memory and time to run when used on large runs. On shared instances, such as Jupyter Hub, it might require speaking to the administrator to raise the memory limit.
There may be situations where you find you wish to remove a bulk of measurements from the analysis. For example, maybe you have identified a bad image, or you want to remove all measurements that have a sibling. The pipeline has functions to help with this.
For this example lets pretend that the EPOCH02 image (id number 2093) is bad and we want to remove all the measurements from this image:
my_run.images.head(3)
The first step is to create a filtered measurements dataframe to feed into the function.
The method is slightly different on whether the measurements dataframe is loaded using pandas or vaex (there will be a warning if vaex is being used when you opened the job).
If pandas we can remove all the measurements associated with that image by doing:
my_run_filtered_measurements = my_run.measurements.loc[my_run.measurements.image_id != 2093].copy()
If vaex:
# my_run_filtered_measurements = my_run.measurements[my_run.measurements.image_id != 3]
# my_run_filtered_measurements.extract()
We can then use this filtered measurements dataframe to regenerate a sources dataframe with all the metrics updated. In summary it will:
- Recalculate all aggregate metrics on the source.
- Drop all sources that no longer have a selavy measurement.
However two metrics are not re-calculated:
- new & new_source_high_sigma
these will remain unchanged as the process to recalcuate these two is too intensive to do here.
To get the new sources we use the function recalc_sources_df
as so:
my_run_filtered_sources = my_run.recalc_sources_df(my_run_filtered_measurements)
my_run_filtered_sources.head()
# note the columns will be in a different order
The new my_run_filtered_sources
along with the my_run_filtered_measurements
can then be passed into the get_source and variability functions below to use this different sources_df rather than the default pipeline. For example if we again fetch source number 4187950:
my_source_filtered = my_run.get_source(4187950, user_sources=my_run_filtered_sources, user_measurements=my_run_filtered_measurements)
lc = my_source_filtered.plot_lightcurve()
Note that the datapoint from image number 2093 has now been removed.
Changing the Measurement Flux Values¶
WARNING! This method can require significant memory and time to run when used on large runs. On shared instances, such as Jupyter Hub, it might require speaking to the administrator to raise the memory limit.
There may be situations where the flux values of the measurements should be changed, for example, to account for flux corrections. In this case the measurement pairs dataframe, containing the two epoch metrics, should also be recalculated and supplied to recalc_sources_df
such that the new significant vs
and m
columns can be calculated. If not supplied to recalc_sources_df
it is assumed that the fluxes are unchanged.
See the Two-epoch analysis section below for more details on the two-epoch metrics.
The measurement pairs dataframe can be calculated using the recalc_measurement_pairs_df
method, using the same new measurements dataframe as used earlier.
# scale the flux values of the my_run_filtered_measurements by 10x
my_run_filtered_measurements['flux_int'] = my_run_filtered_measurements['flux_int'] * 10.
my_run_filtered_measurements['flux_peak'] = my_run_filtered_measurements['flux_peak'] * 10.
# Recalculate the measurement pairs dataframe
my_run_new_meas_pairs_df = my_run.recalc_measurement_pairs_df(my_run_filtered_measurements)
my_run_new_meas_pairs_df.head()
This new measurement pairs dataframe can then be supplied to for use when recalculating the sources dataframe.
my_run_filtered_sources = my_run.recalc_sources_df(
my_run_filtered_measurements, measurement_pairs_df=my_run_new_meas_pairs_df)
my_run_filtered_sources.head()
Performing transient and variable analysis¶
There are a couple of built-in analyses that you can perform on the data.
- Two-epoch analysis (e.g. Mooley et al., 2016; https://ui.adsabs.harvard.edu/abs/2016ApJ...818..105M/abstract).
- eta-v sigma analysis (e.g. Rowlinson et al., 2019; https://ui.adsabs.harvard.edu/abs/2019A%26C....27..111R/abstract; though no machine learning is used here, only sigma cuts).
For both searches we are going to perform some filtering on the sources to minimise false positives. This can be done by defining a string that will be passed to the df.filter()
method of a dataframe:
my_query_string = (
"n_measurements >= 3 "
"& n_selavy >= 2 "
"& n_neighbour_dist > 1./60. "
"& 0.8 < avg_compactness < 1.4 "
"& n_relations == 0"
"& max_snr >= 7.0"
)
Here we are asking for sources that:
- have 3 or more measurements,
- are detected in selavy at least 2 times,
- are 1 arcmin away from their nearest neighbour,
- have an average compactness value (f_int / f_peak) between 0.8 and 1.4,
- have no relations.
- and have a maximum SNR greater of equal to 7 (this SNR metric relates to selavy detections)
Note: This is an example of cuts you may wish to make. Consider your own science goals when making the selections.
With this query string ready, two-epoch analysis will be the first example.
Two-epoch analysis¶
The pipeline pre-calculates the two epoch metrics for us, so the first step is that we need to load the two epoch metrics dataframe into the job. This is a dataframe that contains all possible pairs of source measurements along with the calculated metrics. This is done by:
my_run.load_two_epoch_metrics()
The measurement_pairs_df
is now availble in the run.
Note By default the Vs
parameters are not stored as absolute values, but should be converted to absolute for any manual analysis you do. All the steps below automatically convert to absolute values.
my_run.measurement_pairs_df.head()
It also loads a pairs_df
which contains information about all the unique epoch pairs possible in the pipeline run:
my_run.pairs_df.head()
With these dataframes now loaded into the run, it is now possible to perform the analysis.
This is performed by using the method run_two_epoch_analysis
, which returns a dataframe containing source candidates and a dataframe of the pairs that meet the metrics thresholds that are set when running. I.e. if a source has a pair that meet the Vs
and |m|
thresholds, it is a candidate. For this we are using the values of V=4.3 and m=0.26 (refer to the Mooley et al. paper).
We also pass in the query string that was set up previously which allows the pool of measurements to be pre-filtered.
two_epoch_candidates_df, two_epoch_pairs_df = my_run.run_two_epoch_analysis(4.3, 0.26, query=my_query_string)
Pairs shows the unique epoch pairs and their time delta:
The returned sources are variable candidates according to the thresholds set:
two_epoch_candidates_df.head()
We can see if any sources are marked as new
and sort by the m_abs_significant_max_peak
(largest first):
two_epoch_candidates_df[two_epoch_candidates_df.new == True].sort_values(by='m_abs_significant_max_peak', ascending=False)
Let's check the first one:
candidate_source = my_run.get_source(4193783)
lc = candidate_source.plot_lightcurve()
candidate_source.simbad_search()
It's PSR J2129-04
!
Plotting two epoch metrics¶
You are also able to plot the metrics for a speceific epoch pair, i.e. two images containing the same sources. To pick an epoch you can use the pairs_df
, which also contains total_pairs
which I'll use to sort here to pick the epoch with the most pairs.
my_run.pairs_df.sort_values(by='total_pairs', ascending=False).head()
We'll use the id
of the top pair (in this case 21
) to create the plot and we'll also pass in the query string. By default it returns a bokeh
plot, which we will use the show
function imported at the beginning.
two_epoch_plot = my_run.plot_two_epoch_pairs(21, query=my_query_string)
show(two_epoch_plot)
There is also an option to plot using matplotlib.
two_epoch_plot_plt = my_run.plot_two_epoch_pairs(14, query=my_query_string, plot_type='matplotlib')
There is also a second style two epoch metrics plot available if you wish. This can be accessed by passing plot_style='b'
:
two_epoch_plot = my_run.plot_two_epoch_pairs(21, query=my_query_string, plot_style='b')
show(two_epoch_plot)
two_epoch_plot_plt = my_run.plot_two_epoch_pairs(21, query=my_query_string, plot_type='matplotlib', plot_style='b')
Remember that you can change aspects of the returned matplotlib figure by fetching the axes:
two_epoch_plot_plt.axes[0].set_ylim([-1, 1])
two_epoch_plot_plt.axes[0].set_xlim([0, 100])
two_epoch_plot_plt
The plotting of two epochs only supports one 'epoch' at a time but feel free to create your own plot if you require.
The eta-v analysis¶
Now we'll also run the eta-v analysis. This is called by the my_run.run_eta_v_analysis()
. We will use the same query string as before (copied again here for clarity, but you wouldn't need to re-define it):
my_query_string = (
"n_measurements >= 3 "
"& n_selavy >= 2 "
"& n_neighbour_dist > 1./60. "
"& 0.8 < avg_compactness < 1.4 "
"& n_relations == 0 "
"& max_snr > 7.0"
)
Now we run the analysis, using the run_eta_v_analysis
function, where we use 1 sigma
as the threshold value to select candidates. It returns the eta and v threshold values (see Rowlinson et al. paper), a dataframe of candidate sources and a plot for visualisation.
eta_thresh, v_thresh, eta_v_candidates, plot = my_run.run_eta_v_analysis(1.0, 1.0, query=my_query_string)
print(eta_thresh, v_thresh)
View the plot from the analysis. The datapoints are colour coded to represent the number of selavy (i.e actual) detections.
# use the bokeh `show` function imported at the beginning.
show(plot)
For the purposes of the example lets run the analysis again this time fetching a matplotlib plot and the diagnostic plot.
eta_thresh, v_thresh, eta_v_candidates, plot, diag_plot = my_run.run_eta_v_analysis(1.0, 1.0, query=my_query_string, plot_type='matplotlib', diagnostic=True)
plot
By looking at the plot above, there is one source that is clearly beyond the 1 sigma threshold in both metrics. If you hover over the point in the bokeh plot you'll see it is source 4193783
- PSR J2129-04
.
The other candidates can be explored using the candidates dataframe returned by the analysis:
eta_v_candidates
Creating a MOC of the Pipeline Job¶
WARNING Beware of creating a MOC for a large pipeline run as it can take a very long time to complete. If you are using the VAST Pilot Survey data then load the MOCS directly from VASTMOCS
.
There is also the ability to create a MOC file of the pipeline job. This can be useful to see the area and to also use the MOC to get pre-filtered surveys to crossmatch with (see the catalogue crossmatching example notebook).
The MOC is created by calling the following (it may take a few seconds, or minutes for large runs):
my_run_moc = my_run.create_moc()
We can then plot the MOC as demonstrated in the other notebook examples (and in the mocpy documentation).
from mocpy import World2ScreenMPL
from astropy.visualization.wcsaxes.frame import EllipticalFrame
fig = plt.figure(figsize=(12,6))
with World2ScreenMPL(
fig,
fov=320 * u.deg,
center=SkyCoord(0,0, unit='deg', frame='icrs'),
coordsys="icrs",
rotation=Angle(0, u.degree),
) as wcs:
ax = fig.add_subplot(111, projection=wcs, frame_class=EllipticalFrame)
ax.set_title("VAST Pipeline Runs VAST_2118-06A Area")
ax.grid(color="black", linestyle="dotted")
my_run_moc.fill(ax=ax, wcs=wcs, alpha=0.9, fill=True, linewidth=0, color="#00bb00")
my_run_moc.border(ax=ax, wcs=wcs, alpha=0.5, color="black")
And don't forget to save the MOC so it doesn't have to be generated again!
my_run_moc.write('my_run_moc.fits', overwrite=True)
Created: August 5, 2020