Skip to content

Benchmarks

Initial Profiling

These profiling tests were run with the pipeline codebase correspondent to commit 373c2ce (some commits after the first release).

Running on 12GB of data with 464MB peak memory usage takes 4 mins: performance: ~80% final_operations, ~10% get_src_skyregion_merged_df. final_operations calls other functions, out of these the largest is 50% in make_upload_sources which spends about 20% of time on utils <method 'execute' of 'psycopg2.extensions.cursor' objects>. The get_src_skyregion_merged_df time sink is in threading 15% of time is on wait (final_operations spends some time on threading as well, hence 15% > 10%). memory: 40% pyarrow parquet write_table, rest is mostly fragmented, some more pyarrow and some pandas

Running on 3MB of data with peak memory usage 176MB, takes 1.5s: performance: ~30% goes to pipeline (about 9% of this is pickle), ~11% goes to read, rest goes to django I think memory: 30% memory is spent on django, 20% is spent on astropy/coordinates/matrix_utilities, 10% on importing other modules, rest is fragmented quite small

Note that I didn't include the generation of the images/*/measurements.parquet or other files in these profiles.

Database Update Operations

Delete (Model.objects.all().delete()) and reupload (bulk_upload) (in seconds)

columns\rows 103 104 105
4 0.15 1.24 12.95
8 0.26 1.64 19.11
12 0.31 2.18 21.49

Per cell, 103 rows is slower than 104 and 105 rows, possibly due to overhead. Best to avoid uploading 103 rows each bulk_create call.

Django bulk_update

columns\rows 103 104 105
4 3.39 na na
8 4.38 na na
12 5.50 na na

I don't think there's any point testing 104 or 105 rows, it's obviously the worst performing function, and I've already had to force quit the terminal twice because keyboard interrupt didn't work.

SQL join as (SQL_update in vast_pipeline.pipeline.loading)

columns\rows 103 104 105
4 0.016 0.11 3.08
8 0.019 0.32 4.31
12 0.027 0.38 5.39

105 is slower per cell than 104 and 103, not sure why. Recommend updating 104 rows each time.

This timing info does vary a bit on randomness. Sometimes the SQL join as takes as long as 1 second to complete 103 rows, I'm not sure what's causing this.