Benchmarks¶
Initial Profiling¶
These profiling tests were run with the pipeline codebase correspondent to commit 373c2ce (some commits after the first release).
Running on 12GB of data with 464MB peak memory usage takes 4 mins: performance: ~80% final_operations
, ~10% get_src_skyregion_merged_df
. final_operations
calls other functions, out of these the largest is 50% in make_upload_sources
which spends about 20% of time on utils
<method 'execute' of 'psycopg2.extensions.cursor' objects>
. The get_src_skyregion_merged_df
time sink is in threading
15% of time is on wait
(final_operations
spends some time on threading
as well, hence 15% > 10%). memory: 40% pyarrow
parquet
write_table
, rest is mostly fragmented, some more pyarrow
and some pandas
Running on 3MB of data with peak memory usage 176MB, takes 1.5s: performance: ~30% goes to pipeline (about 9% of this is pickle
), ~11% goes to read
, rest goes to django I think memory: 30% memory is spent on django
, 20% is spent on astropy/coordinates/matrix_utilities
, 10% on importing other modules, rest is fragmented quite small
Note that I didn't include the generation of the images/*/measurements.parquet
or other files in these profiles.
Database Update Operations¶
Delete (Model.objects.all().delete()
) and reupload (bulk_upload
) (in seconds)
columns\rows | 103 | 104 | 105 |
---|---|---|---|
4 | 0.15 | 1.24 | 12.95 |
8 | 0.26 | 1.64 | 19.11 |
12 | 0.31 | 2.18 | 21.49 |
Per cell, 103 rows is slower than 104 and 105 rows, possibly due to overhead. Best to avoid uploading 103 rows each bulk_create call.
Django bulk_update
columns\rows | 103 | 104 | 105 |
---|---|---|---|
4 | 3.39 | na | na |
8 | 4.38 | na | na |
12 | 5.50 | na | na |
I don't think there's any point testing 104 or 105 rows, it's obviously the worst performing function, and I've already had to force quit the terminal twice because keyboard interrupt didn't work.
SQL join as (SQL_update
in vast_pipeline.pipeline.loading
)
columns\rows | 103 | 104 | 105 |
---|---|---|---|
4 | 0.016 | 0.11 | 3.08 |
8 | 0.019 | 0.32 | 4.31 |
12 | 0.027 | 0.38 | 5.39 |
105 is slower per cell than 104 and 103, not sure why. Recommend updating 104 rows each time.
This timing info does vary a bit on randomness. Sometimes the SQL join as takes as long as 1 second to complete 103 rows, I'm not sure what's causing this.