2017_03_16_ _Using_Apache_Spark

Informal meeting on possible usage of Apache Spark.

When given needs (lots of hand waving and wishful thinking at this stage), what about using Apache Spark as an abstraction layer for computational models ?

The following are informal notes of an informal brainstorm meeting.

How to specify the information model

Do not contaminate your information model: data inter-operability

The Information Model can be specified/expressed with a variety of abstraction levels:

When designing a data model one should clearly separate those models and make sure not to contaminate a higher level with implementation details or with a weekly related concern.

Even at the conceptual data model level, there can be many (conceptual) models for a given object each abstract model carrying in own design bias that might depend on the semantics the model designer wishes to stress (or is aware of) or domain simplifications the model designer might wish to make because e.g. of objectives of usage. In order to avoid such bias an abstract model design should start from a mathematical description (formalism considered as the most abstract). For example

when having to represent geographic coordinates one has the choice between polar or cartesian representations and one should pay a crucial attention on how these notions are derived in the process of refining the data model.
assume you abstract a person has having a name and birth date. If one needs to realize some searches on such data then one might chose between a binary search tree indexes with the names or with the birthdates...

The general warning is thus: beware not to pollute your information model...

Why UML might not be the best choice to represent an abstract model

When specifying a data model, the data community prefers to use Infoset (reference?) than UML because

UML remains very syntactic
UML is very object oriented
UML does not take into account the usage of the data structure. For example when writing an AST (Abstract Syntax Tree) parser a data structure described with UML will tend to scatter the code within many inherited operators and type of nodes (for which the Visitor design pattern is not a satisfying fix). Conversely DOM methods are well adapted for writing document parsers but they are of little help to understand the data model of DOM.
UML is not well suited to describe behavior (e.g. algorithm).
UML only allows for binary relationship. Ternary relationships (e.g. the triplet (student, teacher, teaching period) will thus be represented as an objet (which stands at the opposite of the relational model design choice where there are no objects). And one cannot (should not?) add a behavior to an UML ternary relationship.

What about treatment pipeline ?

The need:

les traitements sont dans des langages hétérogènes (Python, Java, C++...)
je veux executer des traitements en batch
les composer dans des worflow (ecrit à la main si il le faut)
en declencher certains depuis l'interface client...

Answer:

Talend fait du workflow.
Spark a un moteur de workflow mais qui n'est pas graphique.

Concerning Hadoop + Apache Spark

Philosophy

One can consider (it is part of the needs to think it that way) that Hadoop/Spark as a mean for abstracting data. In particular a user might wish not to worry about the physical and logical (where the data is and how it is stored). Although it might be convenient at first, at some point of the production process such "implementation details" will present themselves at solid hindrances/obstacles for scaling up.

For example if the algorithm of a treatment is sequential and cannot be parallelized (to be distributed/gridified) then having distributed data won't help. Big data means not only distributing the data but distributing the algorithm as well (under constraint of keeping computations local). This is what map-reduce constrains you to do (a reduce happens on one machine with mostly one cpu).

Another example where abstraction layers won't suffice to mask some usage constraint is when the data miss regularity (i.e. to "different" to fit into tabular frames). JSON provides some freedom to represent "pieces of trees" but still impose some regularity constraints on the tree (fix or at least bounded number of children for a given node and with fix depth). One can for example refer to CityOne.building[i].wall[j]). But one cannot hope to describe graphs in a manner that would be "mapable" to Spark...

One last example: Hadoop allows to re-unify databases that are all tabular. If among the set of databases that we would like Hadoop to abstract there is e.g. some RDF data then the abstraction will not be possible;

Although early optimization is the root of evil, you must still think early enough about the distribution of the data together with the algorithm. At least this efficiency/scale up issue should be articulated with the abstract model (in order to produce the logical model). Because e.g. when collecting the result of the first stage distributed computation one should not kill the network...

Deployment

For a desktop used as an Spark development context (while aiming a cluster deployment): take a "sandbox" (pre-packaged virtual machine). Requirements:
- 16Go of RAM (de facto although a basic sandbox is 8G).
- Virtual box (Oracle)
There is a Docker version of an HortonWorks sandbox (on OSX docker requires a linux virtual machine a note that Docker install will pull Hyperkit on OSX).
For the real deployment consider a "Hadoop distribution" e.g. Cloudera, HortonWorks. Such a distribution packages HDFS (Hadoop File System), Hadoop Map reduce, includes, Hadoop (which includes Spark), notebooks, a galaxy of components...
Such a distribution provides
- gerer un cluster (avec la reprise d'incident en cas de crash calcul ou perte de donnees i.e. avec de la replication)
- il faut un ETL pour charger ta donnee
- repartir la donnée de maniere transparente
- gerer les difficultés d'interconnection reseau
Le job de Spark est d'effectuer des jobs de manière distribuée. Mais avec Spark viennent des facilitateurs.
LIRIS can provide hosting in the openstack context (refer to Romain said ECO) with a limited (should not impact on teaching course) amount of disk space. Check also with the INSA side of LIRIS.

Hadoop learning curve

le ticket d'entree pour un tutorial dans la sandbox (30h de cours pour etre degrossi)
ces plateformes ne sont pas tout a fait matures (si il faut la version 3.2.5 d'un package particulier alors cela peut etre tres compliqué).
il y a un ticket d'entree pour faire des choses qui ont du sens. Par exemple dans le Tutoriel pour faire du Kafka un composant est prepackagé pour un numéro de port un peu exotique et le composant suivant ne le trouvera pas dans la distribution choisie...

ETL engines:

Talend (execution engine is based on java)
- base version is free (open source)
- Enterprise version work on servers (not open source)
Le Grand Lyon (allexandro) utilise Kafka avec des exctrateurs dans son ecosysteme Hadoop

Meeting references

Attendees: ECO, EBO
Date: 2017 / 03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly