This is the repository containing all data mining algorithms for the Medical Informatics Platform of Human Brain Project, that are executed on Exareme.
local
execution of an SQL script on master node only.local-global
execution of a local SQL script on worker nodes/containers, which is followed by a global SQL script on global node.multiple local-global
execution of a fixed (as in fixed number) sequences oflocal-global
workflows.
Eachlocal_global
is executed according to the order appeared in the algorithm's directory structure.iterative
execution of an iterative algorithm, which is expressed in four phases:initialization
(actually amultiple_local_global
)step
(actually amultiple_local_global
)termination condition
(actually alocal
) andfinalization
(actually amultiple_local_global
)
Firstly,
init
phase is executed, which is followed by a pair ofstep
andtermination_condition
phases. In each termination condition, the iterations module of execution engine reads the value of termination_condition related table and "decides" whether to continue the iterative execution. If so, astep
phase is resubmitted, otherwise thefinalize
phase of the algorithm is submitted.
For all algorithms, a properties
file is required (namely properties.json
). This JSON
file contains the algorithm's:
- name
- description (appears in web portal)
- type, specifically one of:
local
local_global
multiple_local_global
iterative
- parameters
These parameters are required for the algorithm to run and are provided by the user. The algorithms' SQL files require these variables as input. The parameter has the following properties:name
(String)desc
(String) Will be shown in the properties of the algorithm.type
Defines the type of the parameter. It can take the following values:column
(Used for querying the columns of the database.)formula
(Same as the column type but is is parsed as a formula of R. Allowed characters are '+ - * : 0.' )filter
(Used to filter the results of the database.)dataset
(If the property is of type dataset then it will be used to choose on which dataset to run the algorithm on.)other
(For any other reason use this type.)
columnValuesSQLType
(String) If type is column or formula then this is required. Specifies the possible types that the column can have. Allowed types 'text, integer, real'. They could be more than one in combination with a comma. Empty string means that there is no constraint.columnValuesIsCategorical
(String) If type is column or formula then this is required. Specifies the categorical type that the column can have. Allowed types 'true','false'. Empty string means that there is no constraint.columnValuesNumOfEnumerations
(String) If type is column or formula then this is required. Specifies the number of enumerations that the column can have. Example of possible values '1','2'. Empty string means that there is no constraint.value
(String) It is used as an example value.valueNotBlank
(Boolean) Defines if the value can be blank.valueMultiple
(Boolean) Defines if the parameter can have multiple values.valueType
Defines the type of the value. It can take the following values:string
integer
real
json
Example: See here for the properties file of LINEAR_REGRESSION algorithm.
For each algorithm workflow refer to the corresponding link for a hands-on example:
local
=> LIST_VARIABLES algorithmlocal_global
=> VARIABLE_PROFILE algorithmmultiple_local_global
=> LINEAR_REGRESSION algorithmiterative
=> SAMPLE_ITERATIVE algorithm
The input of algorithm workflows can be retrieved in the 1st local.template.sql
by using the input_local_tbl
variable. (It must also be defined in requirevars
.)
defaultDB
To share context (and thus data) among SQL template files, a database named defaultDB
is provided.
For example, it can be used to create and insert values in a table at a local.template.sql
, which can then be read from the global.template.sql
.
To be able to use defaultDB
in a template.sql
, the script file is required to begin with:
- `requirevars 'defaultDB'` (more variables can be _required_ using this command, see [here](WP_LINEAR_REGRESSION/1/global.template.sql))- `attach Database '%{defaultDB}' as defaultDB`
Output of previous phase
An additional way of sharing context when in a local_global
or multiple_local_global
algorithm workflow is:
- _[only for
multiple_local_global
]_ forlocal.template.sql
files, the output from the previousglobal.template.sql
execution can be read by using theinput_local_tbl
variable. - for
global.template.sql
files, the output from the previouslocal.template.sql
file can be read by using theinput_global_tbl
variable.
N.B.: It should be noted here, that defaultDB
is shared over the network from local nodes to global and vice versa.
It is required by the runtime engine that every *.template.sql
file must have some output.
If this isn't applicable in a script file, simply write select "ok";
at the end.
The final results (i.e. the algorithm's output) must be formatted using
jdict
UDF of madIS.
This converts the results to a JSON
format.
For sharing context among iteration execution phases, the previous_phase_output_tbl
variable can be used. This follows the same convention
as the one used for sharing context between local
and global
scripts. In other words, output of the previous iterative execution phase
is "forwarded" as input to the next one (e.g. output of step-1
is forwarded as input to step-2
and output of step-N
is forwarded
as input to finalize
).
For all iterative algorithms (in the parameters JSON array
of its properties file), the following properties must be defined:
- `iterations_max_number`The iterative algorithm will run at most `iterations_max_number` times.- `iterations_condition_query_provided`Defines if a termination query is provided (under the `termination_condition` directory, in the corresponding file).Otherwise `iterations_max_number` will be solely used as a termination condition criterion.**Note 1**: In the case which a termination condition query has been provided, the iterations module in Exareme takes intoaccount its output along with the `iterations_number < iterations_max_number` condition.**Note 2**: In the case which a termination condition query has **not** been provided, the `termination_condition.template.sql`must exist, and solely contain a `select "ok";` query.
The algorithm developer need not to worry about iterations control logic, such as setting up an iterations number counter,
or writing a query for ensuring that iterations_number < iterations_max_number
. This is all handled by the iterations module
of Exareme.
The only requirement imposed by the iterations module is the one mentioned below.
If an iterative algorithm requires a termination condition that is not solely based oniterations_number < iterations_max_number
criterion,
the algorithm developer needs to write a query that abides by the following rules:
- updating
iterationsDB.iterations_condition_check_result_tbl
table, and specifically - setting
iterations_condition_check_result
column's value with the output of the termination condition query.
The template which must be followed is this:
update iterationsDB.iterations_condition_check_result_tbl set iterations_condition_check_result = (select termination_condition_query...);
N.B.: iterationsDB
does not need to be defined in the requirevars
section. Again, this is handled by the runtime engine's iterations module.
An example of a termination condition query is presented below:
update iterationsDB.iterations_condition_check_result_tbl set iterations_condition_check_result = (select sum_tbl.sum_val < 5from defaultDB.sum_tbl);
In this example, the iterative algorithm calculates a sum (saved at defaultDB.sub_tbl
table) and the termination condition reads:
if the sum is lower than 5 AND iterations_number < iterations_max_number, then continue iterations.