0003 - Readers / Writers Multithreading¶
Context¶
Not all GeoTrellis readers and writers implemented using MR jobs
(Accumulo RDDReader, Hadoop RDDReaders), but using socket reads as well.
This (socket) this approach allows to define paralelizm level depending
on system configuration, like CPU, RAM, FS. In case of RDDReaders
,
that would be threads amount per rdd partition, in case of
CollectionReaders
, that would be threads amount per whole
collection.
All numbers are more impericall rather than have strong theory approvals. Test cluster works in a local network to exclude possible network issues. Reads tested on ~900 objects per read request of landsat tiles (test project).
Test cluster¶
- Apache Spark 1.6.2
- Apache Hadoop 2.7.2
- Apache Accumulo 1.7.1
- Cassandra 3.7
Decision¶
Was benchmarked functions calls performace depending on RAM / and CPU cores availble.
File Backend¶
FileCollectionReader
optimal (or reasonable in most cases) pool size
equal to cores number. As well there could be FS restrictions, that
depends on a certain FS settings.
- collection.reader: number of CPU cores available to the virtual machine
- rdd.reader / writer: number of CPU cores available to the virtual machine
Hadoop Backend¶
In case of Hadoop
we can use up to 16 threads without reall
significant memory usage increment, as HadoopCollectionReader
keeps
in cache up to 16 MapFile.Readers
by default (by design). However
using more than 16 threads would not improve performance signifiicantly.
- collection.reader: number of CPU cores available to the virtual machine
S3 Backend¶
S3
threads number is limited only by the backpressure, and that’s an
impericall number to have max performance and not to have lots of
useless failed requests.
- collection.reader: number of CPU cores available to the virtual machine, <= 8
- rdd.reader / writer: number of CPU cores available to the virtual machine, <= 8
Accumulo Backend¶
Numbers in the table provided are average for warmup calls. Same results valid for all backends supported, and the main really performance valueable configuration property is avaible CPU cores, results table:
4 CPU cores result (m3.xlarge):
Threads | Reads time (ms) | Comment |
---|---|---|
4 | ~15,541 | |
8 | ~18,541 | ~500mb+ of ram usage to previous |
32 | ~20,120 | ~500mb+ of ram usage to previous |
8 CPU cores result (m3.2xlarge):
Threads | Reads time (ms) | Comment |
---|---|---|
4 | ~12,532 | |
8 | ~9,541 | ~500mb+ of ram usage to previous |
32 | ~10,610 | ~500mb+ of ram usage to previous |
- collection.reader: number of CPU cores available to the virtual machine
Cassandra Backend¶
4 CPU cores result (m3.xlarge):
Threads | Reads time (ms) | Comment |
---|---|---|
4 | ~7,622 | |
8 | ~9,511 | Higher load on a driver node + (+ ~500mb of ram usage to previous) |
32 | ~13,261 | Higher load on a driver node + (+ ~500mb of ram usage to previous) |
8 CPU cores result (m3.2xlarge):
Threads | Reads time (ms) | Comment |
---|---|---|
4 | ~8,100 | |
8 | ~4,541 | Higher load on a driver node + (+ ~500mb of ram usage to previous) |
32 | ~7,610 | Higher load on a driver node + (+ ~500mb of ram usage to previous) |
- collection.reader: number of CPU cores available to the virtual machine
- rdd.reader / writer: number of CPU cores available to the virtual machine
Conclusion¶
For all backends performance result are pretty similar to Accumulo
and Cassandra
backend numbers. In order not to duplicate data these
numbers were omitted. Thread pool size mostly depend on CPU cores
availble, less on RAM. In order not to loose performane should not be
used threads more than CPU cores availble for java machine, otherwise
that can lead to significant performance loss.