0% found this document useful (0 votes)

105 views56 pages

Special Theme:: Ercim News

Uploaded by

erlinares

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views56 pages

Special Theme:: Ercim News

Uploaded by

erlinares

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Number 89, April 2012

ERCIM
www.ercim.eu
NEWS

Special theme:

Big Data
Also in this issue:
Keynote
E-Infrastructures for Big Data
by Kostas Glinos

Joint ERCIM Actions

ERCIM Fellowship Programme:
Eighty Fellowships Co-funded to Date

Research and Innovation

NanoICT: A New Challenge for ICT
by Mario D’Acunto, Antonio Benassi
and Ovidio Salvetti
Editorial Information

ERCIM News is the magazine of ERCIM. Published quarterly, it

reports on joint actions of the ERCIM partners, and aims to reflect Keynote
the contribution made by ERCIM to the European Community in
Information Technology and Applied Mathematics. Through short E-Infrastructures
articles and news items, it provides a forum for the exchange of
information between the institutes and also with the wider scien- for Big Data:
tific community. This issue has a circulation of about 8,500 copies.
The printed version of ERCIM News has a production cost of €8 Opportunities
per copy. Subscription is currently available free of charge.
and Challenges
ERCIM News is published by ERCIM EEIG
BP 93, F-06902 Sophia Antipolis Cedex, France The management of extremely large and growing volumes of
Tel: +33 4 9238 5010, E-mail: [email protected] data has since many years been a challenge for the large sci-
Director: Jérôme Chailloux entific facilities located in Europe such as CERN or ESA,
ISSN 0926-4981 without clear long term solutions. The problem will become
even more acute as new ESFRI facilities come on-stream in
Editorial Board: the near future. The advent of “big data science”, however, is
Central editor: not limited to large facilities or to some fields of science. Big
Peter Kunz, ERCIM office ([email protected]) data science emerges as a new paradigm for scientific dis-
Local Editors: covery that reflects the increasing value of observational,
Austria: Erwin Schoitsch, ([email protected]) experimental and computer-generated data in virtually all
Belgium:Benoît Michel ([email protected]) domains, from physics to the humanities and social sciences.
Cyprus: George Papadopoulos ([email protected])
Czech Republic:Michal Haindl ([email protected]) The volume of information produced by the “data factories”
France: Thierry Priol ([email protected]) is a problem for sustainable access and preservation, but it is
Germany: Michael Krapp ([email protected]) not the only problem. Diversity of data, formats, metadata,
Greece: Eleni Orphanoudakis ([email protected]), semantics, access rights and associated computing and soft-
Artemios Voyiatzis ([email protected]) ware tools for simulation and visualization add to the com-
Hungary: Erzsébet Csuhaj-Varjú ([email protected]) plexity and scale of the challenge.
Italy: Carol Peters ([email protected])
Luxembourg: Patrik Hitzelberger ([email protected]) Big Data and e-Science: challenges and opportunities
Norway: Truls Gjestland ([email protected]) ICT empowers science by making possible massive interdis-
Poland: Hung Son Nguyen ([email protected]) ciplinary collaboration between people and computers, on a
Portugal: Joaquim Jorge ([email protected]) global scale. The capacity and know-how to compute and
Spain: Silvia Abrahão ([email protected]) simulate, to extract meaning out of vast data quantities and to
Sweden: Kersti Hedman ([email protected]) access scientific resources are central in this new way of co-
Switzerland: Harry Rudin ([email protected]) creating knowledge. Making efficient use of scientific data is
The Netherlands: Annette Kik ([email protected]) a critical issue in this new paradigm and has to be tackled in
United Kingdom: Martin Prime ([email protected]) different dimensions: creation of data, access and preserva-
W3C: Marie-Claire Forgue ([email protected]) tion for re-use, interoperability to allow cross-disciplinary
exploration and efficient computation, intellectual property,
Contributions etc.
Contributions must be submitted to the local editor of your country
ICT infrastructures for scientific data are increasingly being
Copyright Notice developed world-wide. However, many barriers still exist
All authors, as identified in each article, retain copyright of their work across countries and disciplines making interoperability and
sustainability difficult to achieve. To cope with the extremely
Advertising large or complex datasets generated and used in research, it is
For current advertising rates and conditions, see essential to take a global approach to interoperability and
http://ercim-news.ercim.eu/ or contact [email protected] discoverability of scientific information resources.
International cooperation to achieve joint governance, com-
ERCIM News online edition patible legal frameworks and coordinated funding is also
The online edition is published at http://ercim-news.ercim.eu/ necessary.

Subscription Data-intensive science needs be reproducible and therefore

Subscribe to ERCIM News by sending email to requires that all research inputs and outcomes are made
[email protected] or by filling out the form at the available to researchers. Open access to scholarly papers,
ERCIM News website: http://ercim-news.ercim.eu/ trusted and secure access to data resources and associated
software codes, and interlinking of resources with publica-
Next issue tions, they all support reproducible and verifiable e-science.
July 2012, Special theme: “Cybercrime and Privacy Issues” In some areas the storage and processing of large datasets
may have implications to data protection, which need to be
Cover image: investigated together with access to data by the public.
A heavy ion collision event animation from the Large Ion Collider
Experiment (ALICE) © CERN ERCIM NEWS 89 April 2012
Keynote

means to connect researchers, instruments, data and compu-

tation resources throughout Europe. Furthermore, the 2009
Communication of the Commission on ICT infrastructures
for e-science highlighted the strategic role of IT in the scien-
tific discovery process and sought to increase adoption of
ICT in all phases of this process. The Communication
expressed the urgency to develop a coherent strategy to over-
come the fragmentation in infrastructures and to enable
Kostas Glinos research communities to better manage, use, share and pre-
European Commission, DG serve data. In its conclusions of December 2009, the
Information Society and Competitiveness Council of the European Union invited
Media Member States and the Commission to broaden access to sci-
Head of GEANT and entific data and open repositories and ensure coherent
e- Infrastructure Unit approaches to data access and curation.

More recently, in October 2010, the High Level Expert

Group on Scientific Data submitted its final report to the
Commission. The main conclusion of the report is that there
In all fields of science we can encounter similar technical is a need for a “collaborative data infrastructure” for science
problems when using extremely large and heterogeneous in Europe and globally. The vision this infrastructure would
datasets. Data may have different structures or may not be enable is described in the following terms:
well structured at all. Analytical tools to extract meaningful
information from the huge amounts of data being produced “Our vision is a scientific e-infrastructure that supports
are lagging. Technical problems are often more complex in seamless access, use, re-use, and trust of data. In a sense, the
interdisciplinary research which is the research paying the physical and technical infrastructure becomes invisible and
highest rewards. When the amounts of data to be processed the data themselves become the infrastructure a valuable
are large they cannot easily move around the network. Novel asset, on which science, technology, the economy and
solutions are therefore needed; and in some cases, storage society can advance.”
and data analysis resources might need to move to where
data is produced. A complementary vision was developed by the Commission
co-funded project GRDI2020. It envisions a Research Data
A significant part of the global effort should focus on Infrastructure that enables integration between data manage-
increasing trust (eg through international certification) and ment systems, digital data libraries, research libraries, data
enhancing interoperability so that data can be more readily collections, data tools and communities of research.
shared across borders and disciplines. Second, we need to
develop new tools that can create meaningful, high quality These efforts are expected to create a seamless knowledge
analytical results from large distributed data sets. These tools territory or “online European Research Area” where knowl-
and techniques are also needed to select the data that is most edge and technology move freely thanks to digital means.
valuable for future analysis and storage. This is a third focus Furthermore, it is essential to take a global approach to pro-
of effort: financial and environmental systainability. The rate mote interoperability, discoverability and mutual access of
of global data production per year has already exceeded the scientific information resources.
rate of increase in global data storage capacity; this gap is
widening all the time, making it increasingly more important Financial support for this policy is expected to come from the
to understand what data has an intrinsic value that should not next framework programme for research and innovation. The
be lost and what data is “transient” and we could eventually Commission has included data e-infrastructure as a priority
throw away [Richard Baraniuk, More is Less: Signal in its proposals for the so-called Horizon 2020 programme,
Processing and the Data Deluge Science 2011 (331): at covering the period from 2014 to 2020. Coordination with
p. 717]. funding sources and policy initiatives in Member States of
the EU is also necessary as much of the e-infrastructure in
European Commission activities in scientific data Europe obtains financing and responds to needs at national
Through the 7th Framework Programme for research, the level.
Commission, in coordination with Member States, promotes
and funds ICT infrastructures for research (e-infrastructures) In summary, data should become an invisible and trusted e-
enabling the transition to e-science. The Commission has infrastructure that enables the progress of science and tech-
invested more than 100 M€ in the scientific data infrastruc- nology. Beyond technical hurdles, this requires a European
ture over the last few years, covering domains ranging from (and global) research communication system that enables
geospatial information and seismology to genomics, biodi- and encourages a culture of sharing and open science,
versity and linguistics . The development of e-Infrastructures ensures long-term preservation of scientific information, and
is part of the Digital Agenda flagship initiative, envisioned as that is financially and environmentally sustainable.

The views expressed are those of the author and do not necessarily represent the official view of the European Commission on the subject.

ERCIM NEWS 89 April 2012 3

Joint ERCIM Actions
Contents

2 Editorial Information SPECIAL THEME

KEYNOTE This special theme section on “Big Data” has been

coordinated by Stefan Manegold, Martin Kersten, CWI,
2 E-Infrastructures for Big Data: Opportunities and The Netherlands, and Costantino Thanos, ISTI-CNR, Italy
Challenges
by Kostas Glinos, European Commission Introduction to the special theme
10 Big Data
by Costantino Thanos, Stefan Manegold and Martin
JOINT ERCIM ACTIONS Kersten

6 Industrial Systems Institute/RC ‘Athena’ Joins Invited article

ERCIM 12 Data Stewardship in the Age of Big Data
by Dimitrios Serpanos by Daniel E. Atkins

7 International Workshop on Computational Invited article

Intelligence for Multimedia Understanding 13 SciDB: An Open-Source DBMS for Scientific Data
by Emanuele Salerno by Michael Stonebraker

7 The European Forum for ICST is Taking Shape Invited article

14 Data Management in the Humanities
8 ERCIM Fellowship Programme: Eighty Postdoctoral by Laurent Romary
Fellowships Co-funded to Date
15 Managing Large Data Volumes from Scientific
Facilities
by Shaun de Witt, Richard Sinclair, Andrew Sansum and
Michael Wilson

16 Revolutionary Database Technology for Data

Intensive Research
by Martin Kersten and Stefan Manegold

17 Zenith: Scientific Data Management on a Large Scale

by Esther Pacitti and Patrick Valduriez

18 Performance Analysis of Healthcare Processes

through Process Mining
by Diogo R. Ferreira

20 A Scalable Indexing Solution to Mine Huge Genomic

Sequence Collections
by Eric Rivals, Nicolas Philippe, Mikael Salson, Martine
Leonard, Thérèse Commes and Thierry Lecroq

21 A-Brain: Using the Cloud to Understand the Impact

of Genetic Variability on the Brain
by Gabriel Antoniu, Alexandru Costan, Benoit Da Mota,
Bertrand Thirion and Radu Tudoran

23 Big Web Analytics: Toward a Virtual Web

Observatory
by Marc Spaniol, András Benczúr, Zsolt Viharos and
Gerhard Weikum

24 Computational Storage in Vision Cloud

by Per Brand

4 ERCIM NEWS 89 April 2012

RESEARCH AND INNOVATION

This section features news about research activities

and innovative developments from European research
institutes

40 Massively Multi-Author Hybrid Artificial

26 Large-Scale Data Analysis on Cloud Systems Intelligence
by Fabrizio Marozzo, Domenico Talia and Paolo Trunfio by John Pendlebury, Mark Humphrys and Ray Walshe

27 Big Software Data Analysis 41 Bionic Packaging: A Promising Paradigm for Future
by Mircea Lungu, Oscar Nierstrasz and Niko Schwarz Computing
by Patrick Ruch Thomas Brunschwiler, Werner Escher,
29 Scalable Management of Compressed Semantic Big Stephan Paredes and Bruno Michel
Data
by Javier D. Fernández, Miguel A. Martínez-Prieto and 43 NanoICT: A New Challenge for ICT
Mario Arias by Mario D’Acunto, Antonio Benassi, Ovidio Salvetti

30 SCAPE: Big Data Meets Digital Preservation 44 Information Extraction from Presentation-Oriented
by Ross King, Rainer Schmidt, Christoph Becker and Documents
Sven Schlarb by Massimo Ruffolo and Ermelinda Oro

31 Brute Force Information Retrieval Experiments 45 Region-based Unsupervised Classification of SAR

using MapReduce Images
by Djoerd Hiemstra and Claudia Hauff by Koray Kayabol

32 A Big Data Platform for Large Scale Event 46 Computer-Aided Diagnostics

Processing by Peter Zinterhof
by Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta
Patiño-Martinez, Claudio Soriente and Patrick Valduriez 47 Computer-Aided Maritime Search and Rescue
Operations
34 CumuloNimbo: A Highly-Scalable Transaction by Salvatore Aronica, Massimo Cossentino, Carmelo
Processing Platform as a Service Lodato, Salvatore Lopes, Umberto Maniscalco.
by Ricardo Jimenez-Peris, Marta Patiño-Martinez,
Kostas Magoutis, Angelos Bilas and Ivan Brondino 48 Wikipedia as Text
by Máté Pataki, Miklós Vajna and Attila Csaba Marosi
35 ConPaaS, an Integrated Cloud Environment for Big
Data 49 Genset: Gender Equality for Science Innovation and
by Thorsten Schuett and Guillaume Pierre Excellence
by Stella Melina Vasilaki
36 Crime and Corruption Observatory: Big Questions
behind Big Data 50 Recommending Systems and Control as a Priority for
by Giulia Bonelli, Mario Paolucci and Rosaria Conte the European Commission’s Work Programme
by Sebastian Engell and Françoise Lamnabhi-Lagarrigue
37 Managing Big Data through Hybrid Data
Infrastructures
by Leonardo Candela, Donatella Castelli and Pasquale EVENTS, BOOKS, IN BRIEF
Pagano
52 IEEE Winter School on Speech and Audio Processing
39 Cracking Big Data organized and hosted by FORTH-ICS
by Stratos Idreos
52 First NetWordS Workshop on Understanding the
Architecture of the Mental Lexicon: Integration of
Existing Approaches
by Claudia Marzi

52 Announcements

55 Books

55 In Brief

ERCIM NEWS 88 April 2012 5

Joint ERCIM Actions

Industrial Systems
Institute/RC ‘Athena’
Joins ERCIM
by Dimitrios Serpanos

The Industrial Systems Institute (ISI) of Research Centre

ATHENA (RC ‘Athena’) is delighted to join ERCIM. ISI, a
public research institute, was founded in Patras, Western
Greece in 1998. Since 2003, ISI has been part of the
‘Athena’ Research and Innovation Centre in Information,
Communication, and Knowledge Technologies, operating
under the auspices of the General Secretariat of System for earthquake monitoring and disaster rescue assistance
Research and Technology of the Greek Ministry of developed by ISI
Education, Lifelong Learning and Religious Affairs.

ISI is the main research institute in Greece that focuses on Research and development in ICT and applied mathematics
cutting edge R&D that applies to the industrial and enterprise is a strategic element for ISI’s sustainable growth. We expect
environment. Its main goal is sustainable innovation in infor- that ERCIM will enable ISI to increase its contribution
mation, communication and knowledge technologies for through stronger networking and cooperation with important
improving the competiveness of the Greek industry while institutions, members of ERCIM, and dissemination of its
applying environmentally friendly practices. results and achievements through joint activities. Three
members of its staff will represent ISI in ERCIM, namely
ISI is involved in a range of research areas such as industrial Professor Dimitrios Serpanos (Director of ISI), as the repre-
information and communication systems; enterprise systems sentative in the ERCIM general member assembly, with Dr
and enterprise integration; embedded systems in several Nikolaos Zervos (Researcher) as his substitute and Dr
application areas, including transport, healthcare, and Artemios Voyiatzis (Researcher) represents ISI on the
nomadic environments; enterprise and industrial process ERCIM News editorial board.
modelling; safety and security; reliability and cyber-security.
Link:
Our vision for ISI is to sustain a leading role in R&D of inno- http://www.isi.gr/en/
vative industrial and enterprise technologies. ISI has made
significant contributions at the regional, national, and Please contact:
European levels and has established strong relationships and Dimitrios Serpanos
collaboration with R&D and industrial stakeholders in Industrial Systems Institute/RC ‘Athena’, Greece
Greece, Europe, and the USA. E-mail: [email protected], Tel: +30 261 091 0301

Research at ISI
ISI performs high-impact research in the following - collaboration platforms, virtual and extended • computer systems organization (processor architec-
areas: enterprise, enterprise clustering ture, computer communication networks, perform-
• information and communication systems for the - collaborative manufacturing ance of systems, computer system implementation)
industry and enterprise environment • modelling and automation of industrial systems • software (software engineering, operating systems)
- high-performance communication systems and and processes • mathematics of computing (discrete mathematics,
architectures - control of industrial robots probability and statistics)
- real-time communications - control of mobile robots and autonomous vehicles • computing methodologies (artificial intelligence,
- low-power hardware architectures for process- - control of distributed industrial systems image processing and computer vision, pattern
ing and communication - control of complex electromechanical systems recognition, simulation and modelling)
• embedded systems - system modeling and fault prediction, detec- • computer applications (physical sciences and engi-
- design and architecture tion, and isolation neering, life and medical science, computer-aided
- interoperability - intelligent and adaptive systems engineering)
- design tools • sustainable development • computers and society (public policy issues, elec-
- real-time cooperation and coordination - ICT for energy-efficient systems tronic commerce, security)
- safety and reliability - ICT for increasing sustainable energy aware-
• cyber-security ness ISI has produced successful products and services
• electronic systems - sustainable (multimodal) transport such as:
- RFID • an innovative wireless network system for real-
• enterprise integration The R&D activities of ISI draw upon many areas of time industrial communications
- next-generation control systems mathematics, computer science and engineering • system for sea protection from ship wreckage oil spill
- software agents including: • system for earthquake monitoring and disaster res-
- ontologies for enterprise processes, resources, • hardware (memory structures, I/O and data com- cue assistance
and products munications, logic design, integrated circuit, per- • integrated building management system
- service-Oriented Architecture formance and reliability) • a smart camera system for surveillance.

6 ERCIM NEWS 89 April 2012

MUSCLE Working Group

International Workshop on
Computational Intelligence
for Multimedia
Understanding
by Emanuele Salerno
Ovidio Salvetti and Emanuele Salerno open the workshop.
The National Research Council of Italy in Pisa hosted the
International Workshop on Computational Intelligence oriented talks, covering a very wide range of subjects, were
for Multimedia Understanding, organized by the ERCIM given. Papers on applications included both the development
Working Group on Multimedia Understanding through of basic concepts towards the real-world practice and the
Semantics, Computation and Learning (Muscle), 11-13 actual realization of working systems. Most contributions
December 2011. The workshop was co-sponsored by focussed on still images, but there were some on video and
CNR and Inria. Proceedings to come in LNCS 7252 text. Two presentations exploited results from studies on
human visual perception for image and video analysis and
Computational intelligence is becoming increasingly impor- synthesis. Two greatly appreciated invited lectures were
tant in our households, and in all commercial, industrial and given by Sanni Siltanen, VTT, former Muscle chair, and
scientific environments. A host of information sources, dif- Bülent Sankur, Bogaziçi University, Istanbul. These dealt
ferent in nature, format, reliability and content are normally with emotion detection from facial images (Sankur) and
available at virtually no expense. Analyzing raw data to pro- advanced applications of augmented reality (Siltanen).
vide them with semantics is essential to exploit their full
potential and help us manage our everyday tasks. This is also The Muscle group has more than 50 members from research
important for data description for storage and mining. groups in 15 countries. Their expertise ranges from machine
Interoperability and exchangeability of heterogeneous and dis- learning and artificial intelligence to statistics, signal pro-
tributed data is an essential requirement for any practical cessing and multimedia database management. The goal of
application. Semantics is information at the highest level, and Muscle is to foster international cooperation in multimedia
inferring it from raw data (that is, from information at the research and carry out training initiatives for young
lowest level) entails exploiting both data and prior information researchers.
to extract structure and meaning. Computation, machine
learning, ontologies, statistical and Bayesian methods are tools Links:
to achieve this goal. They were all discussed in this workshop. http://muscle.isti.cnr.it/pisaworkshop2011/
http://wiki.ercim.eu/wg/MUSCLE/
30 participants from eleven countries attended the work-
shop. With the help of more than 20 external reviewers, the Please contact:
program committee ensured a high scientific level, accepting Emanuele Salerno, MUSCLE WG chair, ISTI-CNR, Italy
19 papers for presentation. Both theoretical and application- Tel: +39 050 315 3137; E-mail: [email protected]

In 2006 and 2008, Keith Jeffery from for Programming Languages and Systems),
The European STFC, UK, led expert groups on the topic and EASST (European Association of
at the request of the European Commission. Software Science and Technology) was
Forum for ICST The commission initiated a project to agreed, with Jan van Leeuwen, Utrecht
survey the views of ICST professionals in University representing Informatics
is Taking Shape Europe; significantly this survey indicated Europe, taking the chair, and ERCIM pro-
that ERCIM was the best known and viding the coordinating website. On 2
Since 2006 there has been discussion regarded organisation in Europe. Working February 2012, a meeting was hosted at
about creating a platform for cooperation closely with Informatics Europe and the Inria, Paris where the objectives and struc-
among the European societies in European Chapter of the ACM, ERCIM ture of the European Forum for ICST
Information and Communication has pushed steadily for progress. A meeting (EFICST) were defined and agreed, with
Sciences and Technologies (ICST). The at the 2011 Informatics Europe annual con- the vice presidents Keith Jeffery, STFC rep-
objective is to have a stronger, unified ference confirmed a widespread desire to resenting ERCIM, and Paul Spirakis repre-
voice for ICST professionals in Europe. form such an association. An initial constel- senting EATCS and ACM Europe.
The forum will help to develop common lation of Informatics Europe, ERCIM, the
viewpoints and strategies for ICST in European Chapter of ACM, CEPIS Please contact:
Europe and, whenever appropriate or (Council of European Professional Keith Jeffery, STFC, UK
needed, a common representation of Informatics Societies), EATCS (European ERCIM president
these viewpoints and strategies at the Association for Theoretical Computer and EFICST vice-president
international level. Science), EAPLS (European Association E-mail: [email protected]

ERCIM NEWS 89 April 2012 7

Joint ERCIM Actions

ERCIM Fellowship Testimonials

Programme: Eighty
Postdoctoral Fellowships
Co-funded to Date “My experience with the
ABCDE programme is
Since September 2010, the ERCIM Alain Bensoussan extremely positive. I had the
Fellowship Programme has been supported by the ‘FP7 chance to get in touch with
Marie Curie Actions - People, Co-funding of Regional, top-level researchers and to
National and International Programmes’ (COFUND) of contrast my background and
the European Commission. Many ERCIM member ideas with those of students
institutes have taken advantage of this opportunity and and professors of several
80 fellowships have already been granted. nationalities. The richness
of the environment con-
Benefits for hosting institutes Marco Giunti tributed to stimulate my
With the implementation of ‘COFUND’ , the European postdoctoral fellow at Inria research; the period spent
Commission recognized ERCIM’s successful and long-lasting under the programme has
fellowship programme. As expected, the co-funding resulted been indeed very prolific. I
in a considerable increase in the number of granted ERCIM have particularly enjoyed
Fellowships. The programme is now at mid-term and all the possibility to meet other
ERCIM member institutes are strongly encouraged to partici- fellows in the seminar organized by the ERCIM
pate. The advantages for a hosting institute are obvious: 30% office, which in my view was a complete success.
of the costs for postdoctoral position can be co-funded. It was an excellent occasion to exchange our expe-
Additionally, the employment conditions are very flexible. riences as well as to establish new connections. I
Fellows can be hosted either with a working contract or a would recommend this programme to everyone in
stipend agreement. With up to 250 applications for each round love with serious and outstanding research.”
and an effective evaluation procedure, it should be easy for all
institutes to find highly qualified post-docs for their labs.

Encouraging mobility Great opportunity for post-docs

The programme places a high value on trans-national The programme enables bright young scientists from all over
mobility for training and career development. ERCIM sup- the world to work on a challenging problem as fellows of
ports fellowships of one or two 12 month terms. The 2x12 leading European research centres. A fellowship helps widen
months fellowships are spent in two different institutes to and intensify the network of personal relations and under-
strengthen collaboration among ERCIM research teams. standing among scientists. The programme offers the oppor-
Fellowships comprising a single 12 month term include a tunity to:
‘Research Exchange Programme’ where fellows spend at • achieve the status of a EU Marie Curie Fellow
least one week in two different ERCIM institutes. In order to • work with internationally recognized experts
encourage mobility, an institute cannot host a fellow from the • improve knowledge about European research structures
same country. and networks
• become familiarized with working conditions in leading
European research centres
• build networks with research groups working in similar
areas in different laboratories.

A novelty of the programme is the yearly ‘ABCDE seminar’

which offers fellows training in a range of non-scientific
skills. The first seminar was held over two days in November
2010 in Berlin with an exciting programme covering topics
such as media training, career management, ‘from research
to business’, intellectual property rights, ethics in research.
The seminar also provided an excellent opportunity for net-
working and community building.
!
Link:
http://fellowship.ercim.eu
!
Please contact:
Emma Lière, ERCIM Fellowship programme coordinator
! " ERCIM office
E-mail: [email protected]

8 ERCIM NEWS 89 April 2012

“Alessandra Toninelli vis- “I have hosted eight fellows at my
ited our group ARLES at group at NTNU since 2006, and
Inria while she was a post- my impression of all of them has
doc researcher at the been very positive! My research
University of Bologna. in signal processing and wireless
Seeing a great potential for communications for medicine
collaborating in our requires a team that has a diverse
research on socially-aware research background to solve
middleware and her great complex problems. The hosted
competence in semantic fellows were specialists in signal
technologies and access processing, physical layer, MAC
control policies, we offered layer, network protocols, com-
her a subsequent ERCIM puter networks, optimization
Valérie Issarny, Senior Research Fellowship in Paris- Ilangko Balasingham, Adjunct theory, and mathematics.
Scientist at Inria Rocquencourt. Here, she Professor, Department of Furthermore, they were genuinely
spearheaded work on Yarta, Electronics and interested in learning new skills
a middleware for mobile social networks, which is Telecommunications at the and getting involved in complex
currently one of the main areas of new work on our Norwegian University of Science research projects. Most of them
group, and has also been used by our collaborators and Technology contacted me before submitting
at Trinity College, Dublin. She then remained with their applications, which made the
us as an Inria postdoc, before being hired by match-making process a bit easier. Their contributions have been
Engineering Inc. in Italy with the post of BI con- very fruitful for us: each fellow, on average, has submitted three
sultant. Alessandra is still in touch with the group journal/conference papers during the tenure. I still have regular
members. This was a great experience for both the contact with many of them and will be looking for new fellows! I
ERCIM fellow and host, and we will definitely be would also like to convey my sincere gratitude to the people
on the lookout for more such opportunities” working at the ERCIM offices at NTNU and in Nice for their pro-
fessional way of handling and solving many practical issues.”

“My first experience as a

young postdoctoral
researcher was even better
than I expected. I not only
have enhanced my skills and
learned new ones, but also “Sergiy Zhuk visited my
met interesting people and group at CWI while he was a
discovered another culture fellow at Inria. We found his
and other ways of working work on data assimilation
and collaborating. The and his individual enthu-
ERCIM Office addressed siasm so interesting that we
most annoying money mat- offered him a subsequent
ters so that I could focus on ERCIM Fellowship in
Jean-Laurent Hippolyte, enjoying great scientific Jason Frank, leader of CWI’s Amsterdam. Before the term
postdoctoral fellow at VTT work with the help of my research group MAC1 - of his appointment had
Technical Research Centre of hosting supervisor. My per- Computational and Stochastic passed, he was hired by IBM
Finland sonal research project has Dynamics Research in Dublin, where
been greatly influenced by he will direct a research
the collective projects in which I was involved, both group on assimilation of satellite data for deduc-
international and national (inside my host country) thanks tion of ocean flows, and maintain collaborations
to the fellowship. The research exchange trips are particu- with us. Overall this was a great opportunity for
larly enriching: they gave us the opportunity to share both fellow and host, and I will be sure to keep an
experiences with other European research teams in a eye open for future ERCIM applicants.”
close and casual way. Overall, an ERCIM Fellowship is a
gratifying adventure that I would recommend to any
young researcher willing to establish long-term interna-
tional collaborations.”

ERCIM NEWS 89 April 2012 9

Special Theme: Big Data

Introduction to the special theme

Big Data
by Costantino Thanos, Stefan Manegold and Martin Kersten

Big data' refers to data sets whose size is beyond the capabilities of the cur-
rent database technology.

The current data deluge is revolutionizing the way research is carried out
and resulting in the emergence of a new fourth paradigm of science based
on data-intensive computing. This new data-dominated science will lead to
a new data-centric way of conceptualizing, organizing and carrying out
research activities which could lead to an introduction of new approaches
to solve problems that were previously considered extremely hard or, in
some cases, impossible to solve and also lead to serendipitous discoveries.

The recent availability of huge amounts of data, along with advanced tools
of exploratory data analysis, data mining/machine learning and data visual-
ization, offers a whole new way of understanding the world.

In order to exploit these huge volumes of data, new techniques and tech-
nologies are needed. A new type of e-infrastructure, the Research Data
Infrastructure, must be developed to harness the accumulation of data and
knowledge produced by research communities, optimize the data move-
ment across scientific disciplines, enable large increases in multi- and inter-
disciplinary science while reducing duplication of effort and resources, and
integrating research data with published literature.

Science is a global undertaking and research data are both national and
global assets. A seamless infrastructure is needed to facilitate collaborative
arrangements necessary for the intellectual and practical challenges the
world faces.

Therefore, there is a need for Global Research Data Infrastructures to over-

come language, policy, methodology, and social barriers and to reduce geo-
graphic, temporal, and national barriers in order to facilitate discovery,
access, and use of data.

The next generation of global research data infrastructures is facing two

main challenges:
• to effectively and efficiently support data-intensive science
• to effectively and efficiently support multidisciplinary/interdisciplinary
science.

In order to build the next generation of Global Research Data

Infrastructures several breakthroughs must be achieved. They include:

Data modelling challenges

There is a need for radically new approaches to research data modelling.
Current data models (relational model) and management systems (rela-
tional database management systems) were developed by the database
research community for business/commercial data applications. Research

10 A NEWS 89 April 2012

data has completely different characteristics and thus the current database
technology is unable to handle it effectively.
There is a need for data models and query languages that:
• more closely match the data representation needs of the several
scientific disciplines;
• describe discipline-specific aspects (metadata models);
• represent and query data provenance information;
• represent and query data contextual information;
• represent and manage data uncertainty;
• represent and query data quality information.

Data management challenges

There is a clear need for extremely large data processing. This is especially
true in the area of scientific data management where, in the near future, we
expect data inputs in the order of multiple Petabytes. However, current
data management technology is not suitable for such data sizes.
In the light of such new database applications, we need to rethink some of
the strict requirements adopted by database systems in the past. For
instance, database management systems (DBMS) see database queries as
contracts carved in stone that require the DBMS to produce a complete and
correct answer, regardless of the time and resources required. While this
behaviour is crucial in business data management, it is counterproductive
in scientific data management. With the explorative nature of scientific dis-
covery, scientists cannot be expected to instantly phrase a crisp query that
yields the desired (but a priori unknown) result, or to wait days to get a
multi-megabyte answer that does not reveal what they were looking for.
Instead, the DBMS could provide a fast and cheap approximation that is
neither complete nor correct, but indicative of the complete answer. In this
way, the user gets a ‘feel’ for the data that helps him/her to advance his/her
exploration using a refined query.

The challenges faced include catching the user’s intention and providing the
users with suggestions and guidelines to refine their queries in order to
quickly converge to the desired results, but also call for novel database
architectures and algorithms that are designed with the intent to produce
fast and cheap indicative answers rather than complete and correct answers.

Data Tools challenges

Currently, the available data tools for most scientific disciplines are inade-
quate. It is essential to build better tools in order to improve the produc-
tivity of scientists. There is a need for better computational tools to visu-
alize, analyze, and catalog the available enormous research datasets in
order to enable data-driven research.
Scientists need advanced tools that enable them to follow new paths, try
new techniques, build new models and test them in new ways that facilitate
innovative multidisciplinary/interdisciplinary activities and support the
whole research cycle.

Please contact:
Costantino Thanos
ISTI-CNR Italy
E-mail: [email protected]

Stefan Manegold, Martin Kersten

CWI, The Netherlands
E-mail: [email protected], [email protected]

ERCIM NEWS 89 April 2012 11

Special Theme: Big Data

Invited article

Data Stewardship in the Age of Big Data

by Daniel E. Atkins

As evidenced by a large and growing number of reports from research communities, research
funding agencies, and academia, there is growing acceptance of the assertion that science is
becoming more and more data-centric.

Data is pushed to the center by the scale On the campuses of research universi- ture for big data. We must approach it as
and diversity of data from computational ties there is widespread and growing a shared enterprise involving academia,
models, observatories, sensor networks, demand by researchers to create univer- government, for-profit and non-profit
and the trails of social engagement in sity-level, shared, professionally man- organizations, with multiple institutions
our current age of Internet-based con- aged data-storage services and associ- playing complementary roles within an
nection. It is pulled to the center by tech- ated services for data management. This incentive system that is sustainable both
nology and methods now called “the is being driven by: financially and technically.
big-data movement” or by some a • the general increase in the scale of
“fourth paradigm for discovery” that data and new methods for extracting At the federal government level, major
enables extracting knowledge from information and knowledge from research funding agencies including the
these data and then acting upon it. Vivid data, including that produced by National Science Foundation, the
examples of data analytics and its poten- other people National Institutes for Health, and the
tial to convert data to knowledge and • policies by research funders requiring Department of Energy, together with
then to action in many fields are found at that data should be an open resource several mission-based agencies are
http://www.cra.org/ccc/dan.php. Note that is available at no or low cost to developing, with encouragement from
that I am using the phrase “big data” to others over long periods of time the White House, a coordinated inter-
include both the stewardship of the data • privacy and export regulations on agency response to “big data.”
and the system of facilities and methods data that are beyond the capability of Although details will not be available
(including massive data centers) to the researcher to be in compliance for several months, the goals will be
extract knowledge from data. • the growing need to situate data and strategic and will include four linked
computational resources together to components: foundational research,
The focus of these comments is the fact make it easier for researchers to infrastructure, transformative applica-
that our current infrastructure - technolo- develop scientific applications, tions to research, and related training
gies, organizations, and sustainability increasingly as web service, on top of and education.
strategies - for the stewardship of digital the rich data store. They could poten-
data and codes is far from adequate to tially use their data, as well as shared Commercial partners could play mul-
support the vision of transformative data- data and tools from others to acceler- tiple roles in the big-data movement,
intensive discovery. I include both data ate discovery, democratize resources, especially by providing cloud-based
and programming codes because to the and yield more bang for the buck platforms for storing and processing big
extent that both are critical to research, from research funding. data. The commercial sector has pro-
both need to be curated and preserved to vided and will likely continue to pro-
sustain the fundamental tradition of the Although there are numerous successful vide resources at a scale well beyond
reproducibility of science. See for repository services for the scholarly lit- what can be provided by an individual
example an excellent case study about erature, most do not accommodate or even a university consortium. Major
reproducible research in the digital age at research data and codes. Furthermore, cloud service providers in the US have
http://stanford.edu/~vcs/papers/RRCiSE- as noted by leaders of an emerging strategies to build a line of business to
STODDEN2009.pdf. Digital Preservation Network (DPN) provide the cloud as a platform for big
being incubated by Internet 2, a US- data, and there is growing interest
The power of data mining and analytics based research and education network within both universities and the federal
increases the opportunity costs for not consortium, even the scholarship that is government in exploring sustainable
preserving data for reuse, particularly being produced today is at serious risk public-private partnerships.
for inherently longitudinal research such of being lost forever to future genera-
as global climate change. Multi-scale tions. There are many digital collections Initial efforts at collaboration between
and multi-disciplinary research often with a smattering of aggregation but all academia, government, and industry are
requires complex data federation and in are susceptible to multiple single points encouraging but great challenges
some fields careful vetting and creden- of failure. DPN aspires to do something remain to nurture the infrastructure nec-
tialing of data is critical. Appraisal and about this risk, including archival serv- essary to achieve the promise of the big-
curation is, at present at least, expensive ices for research data. data movement.
and labor intensive. Government
research-funding agencies are declaring No research funding agency, at least in
data from research to be a public good the US, has provided or is likely to pro- Please contact:
and requiring that it be made openly vide the enormous funding on a sus- Daniel E. Atkins
available to the public. But where will tained basis required to create and University of Michigan, USA
these data be stored and stewarded? maintain an adequate cyberinfrastruc- E-mail: [email protected]

12 ERCIM NEWS 89 April 2012

Invited article

SciDB: An Open-Source DBMS

for Scientific Data
by Michael Stonebraker

SciDB is a native array DBMS that combines data management and mathematical operations
in a single system. It is an open source system that can be downloaded from SciDB.org

SciDB is an open-source DBMS ori- Access is provided through an array- run existing R scripts as well as to visu-
ented toward the data management version of SQL, which we term AQL. alize the result of SciDB queries.
needs of scientists. As such it mixes sta- AQL provides facilities for filtering
tistical and linear algebra operations arrays, joining arrays and aggregation Our storage manager divides arrays,
with data management ones, using a over the cell values in an array. which can be arbitrarily large, into
natural nested multi-dimensional array Moreover, Postgres-style user-defined storage “chunks” which are partitioned
data model. We have been working on scalar functions, as well as array func- across the nodes of a cluster and then
the code for three years, most recently tions are provided. allocated to disk blocks. Worthy chunks
with the help of venture capital backing. are also cached in main memory for
Currently, there are 14 full-time profes- In addition, SciDB contains pre-built faster access.
sionals working on the code base. popular mathematical functions, such as
matrix multiply, that operate in parallel We have benchmarked SciDB against
SciDB runs on Linux and manages data on multiple cores on a single node as Postgres on an astronomy-style work-
that can be spread over multiple nodes well as across nodes in a cluster. load that typifies the load provided by
in a computer cluster, connected by the Large Synoptic Survey Telescope
TCP/IP networking. Data is stored in Other notable features of SciDB include (LSST) project. On this benchmark,
the Linux file system on local disks con- a no-overwrite storage manager that SciDB outperforms Postgres by 2
nected to each node. Hence, it uses a retains old values of updated data, and orders of magnitude. We have also
“shared nothing” software architecture. provides Postgres-style “time travel” on benchmarked SciDB analytics against
the various versions of a cell. those in R. On a single core, we offer
The data model supported by SciDB is Moreover, we have extended SciDB comparable performance; however
multi-dimensional arrays, where each with support for multiple notions of SciDB scales linearly with additional
cell can contain a vector of values. “null”. Using this capability, users can cores and additional nodes, a character-
Moreover, dimensions can be either the distinguish multiple semantic notions, istic that does not apply to R.
standard integer ones or they can be such as “data is missing but it is sup-
user-defined data types with non- posed to be there” and “data is missing Early users of SciDB include the LSST
integer values, such as latitude and lon- and will be present within 24 hours”. project mentioned above, multiple high-
gitude. There is no requirement that Standard ACID transactions are sup- energy physics (HEP) projects, as well
arrays be rectangular; hence SciDB supported, as is an interface to the statis- as commercial applications in genomics,
ports “ragged” arrays. tical package R, which can be used to insurance and financial services. SciDB
has been downloaded by about 1000
users from a variety of scientific and
commercial domains.

A fairly robust and performant version

of the system is currently downloadable
from our web site (SciDB.org). We plan
a production-ready release of SciDB
within the next six months. The system
is supported by Paradigm4, Inc, a ven-
ture-capital backed company in
Waltham, Massachusetts, which pro-
vides application consulting as well as a
collection of enterprise-oriented exten-
sions to SciDB.
SciDB is used by the Large Synoptic Survey Telescope (LSST) project. Its powerful data system
will compare new with previous images to detect changes in brightness and position. Hundreds Link: http://www.scidb.org
of images of each part of the sky will be used to construct a movie of the sky. This image from
a pilot project called the Deep Lens Survey (DLS), gives a taste of what the sky will look like Please contact:
with LSST. But the LSST data will actually go deeper still, have better resolution and cover Marilyn Matz
50,000 times the area of this image, and in 6 different optical bands. In addition, LSST will CEO, Paradigm4, Inc, Waltham, Ma,
also reveal changes in the sky by repeatedly covering this area multiple times per month, over USA
10 years. Image Courtesy of Deep Lens Survey / UC Davis / NOAO. E-mail: [email protected]

ERCIM NEWS 89 April 2012 13

Special Theme: Big Data

Invited article

Data Management in the Humanities

by Laurent Romary

Owing to the growing interest in digital methods within the humanities, an understanding of the
tenets of digitally based scholarship and the nature of specific data management issues in the
humanities is required. To this end the ESFRI roadmap on European infrastructures has been
seminal in identifying the need for a coordinating e-infrastructure in the humanities - DARIAH -
whose data policy is outlined in this paper.

Scholarly data in the humanities is a policy on standards and good prac- From a political point of view, we need
heterogeneous notion. Data creation, ie tices. Scholars should both benefit to discuss with potential data providers
the transcription of a primary docu- from strong initiatives such as the Text (cultural heritage entities, libraries or
ment, the annotation of existing sources Encoding Initiative (TEI) and stabilize even private sector stakeholders such
or the compilation of observations their experience by participating in the as Google) methods of creating a
across collections of objects, is inherent development of standards, in collabo- seamless data landscape where the fol-
to scholarly activity and thus makes it ration with other stakeholders (pub- lowing issues should be jointly
strongly dependant upon the actual lishers, cultural heritage institutions, tackled:
hypotheses or theoretical backgrounds libraries). • general reuse agreements for schol-
of the researcher. There is little notion ars, comprising usage in publica-
of data centre in the humanities since The vision also impacts on the tech- tions, presentation on web sites,
data production and enrichment are nical priorities for DARIAH, namely: integration (or referencing) in digital
anchored on the individuals performing • deploying a repository infrastructure editions, etc.;
research. where researchers can transparently • definition of standardized formats
and trustfully deposit their produc- and APIs that could make access to
DARIAH’s goal is to create a sound tions, comprising permanent identi- one or the other data provider more
and solid infrastructure to ensure the fication and access, targeted dissem- transparent;
long-term stability of digital assets, as ination (private, restricted and pub- • identification of scenarios by cover-
well as the development of a wide lic) and rights management, possibly ing the archival version of records as
range of thus-far unanticipated services in a semi-centralized way allowing well as scholarly created enrich-
to carry out research on these assets. efficiency, reliability and evolution ments. For example, TEI transcrip-
This comprises both technical aspects (cf. http://hal.archives-ouvertes.fr/ tions made by scholars could be
(identification, preservation), editorial hal-00399881); archived in the library where the pri-
(curation, standards) and sociological • defining standardized interfaces for mary source is situated.
(openness, scholarly recognition). accessing data through such reposi-
tories, but also through third-party As a whole, DARIAH should con-
This vision is underpinned by the data sources, with facilities such as tribute to excellence in research by
notion of digital surrogates, informa- threading, searching, selecting, visu- being seminal in the establishment of a
tion structures intended to identify, alizing and importing data; large coverage, coherent and acces-
document or represent a primary source • experimenting with the agile devel- sible data space for the humanities.
used in a scholarly work. Surrogates opment of virtual research spaces Whether acting at the level of stan-
can be metadata records, a scanned based on such services, integrating dards, education or core IT services,
image of a document, digital photo- community based research work- we should keep this vision in mind
graphs, transcription of a textual flows (see http://hal.inria.fr/inria- when setting priorities in areas that
source, or any kind of extract or trans- 00593677). will impact the sustainability of the
formation (eg the spectral analysis of a future digital ecology of scholars.
recorded speech signal) of existing Beyond the technical aspects, an ade-
data. Surrogates act as a stable refer- quate licensing policy must be defined Links:
ence for further scholarly work in to assert the legal conditions under ESFRI: ec.europa.eu/research/esfri/
replacement – or in complement – to which data assets can be disseminated.
the original physical source it repre- This should be a compromise between DARIAH: www.dariah.eu/
sents or describes. Moreover, a surro- making all publicly financed scholarly
gate can act as a primary source for the productions available in open access European report on scientific data:
creation of further surrogates, thus and preventing the adoption of hetero- cordis.europa.eu/fp7/ict/
forming a network that reflects the var- geneous reuse constraints and/or e-infrastructure/docs/hlg-sdi-report.pdf
ious steps of the scholarly workflow licensing models. We contemplate Text Encoding Initiative:
where sources are combined and encouraging the early dissemination of http://www.tei-c.org
enriched before being further dissemi- digital assets in the scholarly process
nated to a wider community. and recommend, when applicable, the Please contact:
use of a Creative Commons CC-BY Laurent Romary
Such a unified data landscape for license, that supports systematic attri- Inria, France
humanities research necessitates a clear bution (and thus citation) of the source. E-mail: [email protected]

14 ERCIM NEWS 89 April 2012

Managing Large Data Volumes
from Scientific Facilities
by Shaun de Witt, Richard Sinclair, Andrew Sansum and Michael Wilson

One driver for the data tsunami is social networking companies such as FacebookTM which generate
terabytes of content. Facebook for instance, uploads three billion photos monthly for a total of 3,600
terabytes annually. The volume of social media is large, but not overwhelming. The data are
generated by a lot of humans, but each is limited in their rate of data production. In contrast, large
scientific facilities are another driver where the data are generated automatically.

In the 10 years to 2008, the largest cur- and a further 28PB between disk and the
rent astronomical catalogue, the Sloane batch farm. During intense periods of
Digital Sky Survey, produced 25 ter- data reprocessing internal network rates
abytes of data from telescopes. exceeded 60Gb/s for many hours.

By 2014, it is anticipated that the Large Storing and retrieving data is not that
Synoptic Survey Telescope will produce difficult - what’s hard is managing the
20 terabytes each night. By the year data, so that users can find what they
2019, The Square Kilometre Array radio want, and get it when they want. The
telescope is planned to produce 50 ter- limiting factor for Tier1 sites is the per-
abytes of processed data per day , from a formance of the databases. The RAL
raw data rate of 7000 terabytes per database stores a single 20 gigabyte
second. The designs for systems to table, representing the hierarchical file
manage the data from these next genera- structure, which performs about 500
tion scientific facilities are being based transactions per second across six clus-
on the data management used for the Figure 1: A snapshot of the monitor showing ters. In designing the service it is neces-
largest current scientific facility: the data analysis jobs being passed around the sary to reduce to a practical level the
CERN Large Hadron Collider. Worldwide LHC Computing Grid. number of disk operations to the data
tables and the log required for error
The Worldwide LHC Computing Grid recovery on each cluster. Multiple clus-
has provided the first global solution to scheduler responsible for choosing ters are used, but that introduces a com-
collecting and analyzing petabytes of sci- compute locations, moving data and munications delay between clusters to
entific data. CERN produces data as the scheduling jobs. ensure the integrity of the database, due
Tier0 site, which are distributed to 11 to the passing of data locking informa-
Tier1 sites around the world - including The data are made available to tion between them. Either the disk i/o or
the GRIDPP Tier-1 at STFC’s Rutherford researchers who submit jobs to analyse the inter-cluster communication
Appleton Laboratory (RAL) in the UK. datasets on Tier2 sites. They submit pro- becomes the limiting factor.
The CASTOR storage infrastructure used cessing jobs to a batch processing sched-
at RAL was designed at CERN to meet uler that states which data to analyse, In contrast, Facebook users require
the challenge of handling the high LHC and what analysis to perform. The immediate interactive responses, so
data rates and volume using commodity system schedules jobs for processing at batch schedulers cannot be used. They
hardware. CASTOR efficiently schedules the location that minimises data trans- use a waterfall architecture which, in
placement of files across multiple storage fers. The scheduler will copy the data to February 2011, ran 4,000 instances of
devices, and is particularly efficient at the compute location before the MySQL, but also required 9,000
managing tape access. The scientific analysis, but this transfer consumes con- instances of a database memory caching
metadata relating science to data-files is siderable communication bandwidth, system to speed up performance.
catalogued by each experiment centrally which reduces the response speed.
at CERN. The Tier1 sites operate data- Whatever the application, for large data
bases which identify on which disk or The Tier1 at RAL has 8PB of high band- volumes the problem remains how to
tape the data-file is stored. width disk storage for frequently used model the data and its usage so that the
data, in front of a tape robot with lower storage system can be appropriately
In science the priority is to capture the bandwidth, but a maximum capacity of designed to support the users perform-
data, because if it’s not stored it may be 100PB for long term data archiving. In ance demands.
lost, and the lost dataset may have been 2011, network rates between the Tier-1
the one that would have lead to a Nobel and wide area network averaged
Prize. Analysis is given secondary pri- 4.7Gb/s and peaked at over 20Gb/s; Please contact:
ority, since data can be analysed later, over 17PB of data was moved between Michael Wilson, STFC, UK
when it’s possible. Therefore the archi- the Tier-1 and other sites worldwide. Tel: +44 1235 446619
tecture that meets the user priorities is Internally, over the same period, 5PB of E-mail: [email protected]
based on effective storage, with a batch data was moved between disk and tape

ERCIM NEWS 89 April 2012 15

Special Theme: Big Data

Revolutionary Database Technology

for Data Intensive Research
by Martin Kersten and Stefan Manegold

The ability to explore huge digital resources assembled in data warehouses, databases and files, at
unprecedented speed, is becoming the driver of progress in science. However, existing database
management systems (DBMS) are far from capable of meeting the scientists’ requirements. The
Database Architectures group at CWI in Amsterdam cooperates with astronomers, seismologists
and other domain experts to tackle this challenge by advancing all aspects of database technology.
The group’s research results are disseminated via its open-source database system, MonetDB.

The heart of a scientific data warehouse the journey can be continued as long as partitioning the database based on
is its database system, running on a the user remains interested. science interest and resource avail-
modern distributed platform, and used ability. It extends traditional parti-
for both direct interaction with data We envision seven directions of long tioning schemes by taking into
gathered from experimental devices term research in database technology: account the areas of users’ interest
and management of the derived knowl- and the statistical stability in samples
edge using workflow software. • Data Vaults. Scientific data is usually drawn from the archives.
However, most (commercial) DBMS available in self-descriptive file for-
offerings cannot fulfill the demanding mats as produced by advanced scien- • Post-processing result sets. The often
needs of scientific data management. tific instruments. The need to convert huge results returned should not be
They fall short in one or more of the these formats into relational tables thrown at the user directly, but passed
following areas: multi-paradigm data and to explicitly load all data into the through an analytical processing
models (including support for arrays), DBMS forms a major hurdle for data- pipeline to condense the information
transparent data ingestion from, and base-supported scientific data analy- for human consumption. This
seamless integration of, scientific file sis. Instead, we propose the data involves computation intensive data
repositories, complex event processing, vault, a database-attached external mining techniques and harnessing the
and provenance. These topics only file repository. The data vault creates power of GPUs in the software stack
scratch the surface of the problem. The a true symbiosis between a DBMS of a DBMS.
state of the art in scientific data explo- and existing file-based repositories,
ration can be compared with our daily and thus provides transparent access • Query morphing. Given the impreci-
use of search engines. For a large part, to all data kept in the repository sion of the queries, the system should
search engines rely on guiding the user through the DBMS’s (array-based) aid in hinting at proximity results
from their ill-phrased queries through query language. using data distributions looked upon
successive refinement to the informa- during query evaluation. For exam-
tion of interest. Limited a priori knowl- • Array support. Scientific data man- ple, aside from the traditional row set,
edge is required. The sample answers agement calls for DBMSs that inte- it may suggest minor changes to the
returned provide guidance to drill grate the genuine scientific data query predicates to obtain non-empty
down, chasing individual links, or to model, multi-dimensional arrays, as results. The interesting data may be
adjust the query terms. first-class citizens next to relational ‘just around the corner’.
tables, and unified declarative lan-
The situation in scientific databases is guage as symbiosis of relational and • Queries as answers. Standing on the
more cumbersome than searching for linear algebra. Such support needs to shoulders of your peers involves
text, because they often contain complex be beyond ‘alien’ extensions that pro- keeping track of the queries, their
observational data, eg telescope images vide user defined functions. Rather, results, and resource requirements. It
of the sky, satellite images of the earth, arrays need to become first-class can be used as advice to modify ill-
time series or seismograms, and little a DBMS citizens next to relational phrased queries that could run for
priori knowledge exists. The prime chal- tables. hours producing meaningless results.
lenge is to find models that capture the
essence of this data at both a macro- and • One-minute database kernels. Such a Please contact:
micro-scale. The answer is in the data- kernel differs from conventional ker- Stefan Manegold
base, but the ‘Nobel-winning query’ is nels by identifying and avoiding per- CWI, The Netherlands
still unknown. formance degradation by answering E-mail: [email protected]
queries only partly within strict time
Next generation database management bounds. Run the query during a coffee
engines should provide a much richer break, look at the result, and continue
repertoire and ease of use experience to or abandon the data exploration path.
cope with the deluge of observational
data in a resource-limited setting. Good • Multi-scale query processing. Fast
is good enough as an answer, provided exploration of large datasets calls for

16 ERCIM NEWS 89 April 2012

Zenith: Scientific Data Management
on a Large Scale
by Esther Pacitti and Patrick Valduriez

Modern science disciplines such as environmental science and astronomy must deal with
overwhelming amounts of experimental data. Such data must be processed (cleaned, transformed,
analyzed) in all kinds of ways in order to draw new conclusions and test scientific theories. Despite
their differences, certain features are common to scientific data of all disciplines: massive scale;
manipulated through large, distributed workflows; complexity with uncertainty in the data values, eg,
to reflect data capture or observation; important metadata about experiments and their provenance;
and mostly append-only (with rare updates). Furthermore, modern scientific research is highly
collaborative, involving scientists from different disciplines (eg biologists, soil scientists, and
geologists working on an environmental project), in some cases from different organizations in
different countries. Since each discipline or organization tends to produce and manage its own data
in specific formats, with its own processes, integrating distributed data and processes gets difficult
as the amounts of heterogeneous data grow.

In 2011, to address these challenges, we and may share data and applications “friendship”, same semantic domain,
started Zenith (http://www- while keeping full control over some of similar schema. The P2P data services
sop.inria.fr/teams/zenith/), a joint team their data (a major requirement for our include basic services (metadata and
between Inria and University application partners). But for very-large uncertain data management): recom-
Montpellier 2. Zenith is located at scale data analysis or very large work- mendation, data analysis and workflow
LIRMM in Montpellier, a city that flow activities, cloud computing is management through the Shared-data
enjoys a very strong position in environ- appropriate as it can provide virtually Overlay Network (SON) middleware.
mental science with major labs and infinite computing, storage and net- The cloud P2P services include data
groups working on related topics such as working resources. Such hybrid archi- mining, content-based information
agronomy, biodiversity, water hazard, tecture also enables the clean integra- retrieval and workflow execution.
land dynamics and biology. We are tion of the users’ own computational These services can be accessed through
developing our solutions by working resources with different clouds. web services, and each peer can use the
closely with scientific application part- services of multiple clouds.
ners such as CIRAD and INRA in Figure 1 illustrates Zenith’s architecture
agronomy. with P2P data services and cloud data Let us illustrate two recent results
services. We model an online scientific obtained in this context. The first is the
Zenith adopts a hybrid P2P/cloud archi- community as a set of peers and rela- design and implementation of P2Prec
tecture. P2P naturally supports the col- tionships between them. The peers have (http://www-
laborative nature of scientific applica- their own data sources. The relation- sop.inria.fr/teams/zenith/p2prec/), a
tions, with autonomy and decentralized ships are between any two or more recommendation service for P2P con-
control. Peers can be the participants or peers and indicate how the peers and tent sharing that exploits users’ social
organizations involved in collaboration their data sources are related, eg data. In our approach, recommendation

Figure 1: Zenith Hybrid

P2P/cloud Architecture

ERCIM NEWS 89 April 2012 17

Special Theme: Big Data

is based on explicit personalization, by among peers, we propose new by relations and workflow activities are
exploiting the scientists’ social net- semantic-based gossip protocols. mapped to operators that have data
works, using gossip protocols that scale Furthermore, P2Prec has the ability to aware semantics. Our execution model
well. Relevance measures may be get reasonable recall with acceptable is based on the concept of activity acti-
expressed based on similarities, users’ query processing load and network vation, which enables transparent distri-
confidence, document popularity, rates, traffic. bution and parallelization of activities.
etc., and combined to yield different Using both a real oil exploitation appli-
recommendation criteria. With P2Prec, The second result deals with the effi- cation and synthetic data scenarios, our
each user can identify data (documents, cient processing of scientific workflows experiments demonstrate major per-
annotations, datasets, etc.) provided by that are computationally and data-inten- formance improvements compared to
others and send queries to them. For sive, thus requiring execution in large- an ad-hoc workflow implementation.
instance, one may want to know which scale parallel computers. We propose an
scientists are expert in a topic and get algebraic approach (inspired by rela-
documents highly rated by them. Or tional algebra) and a parallel execution Please contact:
another may look for the best datasets model that enable automatic optimiza- Patrick Valduriez
used by others for some experiment. To tion of scientific workflows. With our Inria, France
efficiently disseminate information algebra, data are uniformly represented E-mail: [email protected]

Performance Analysis of Healthcare Processes

through Process Mining
by Diogo R. Ferreira

Process mining provides new ways to analyze the performance of clinical processes based on large
amounts of event data recorded at run-time.

Hospitals and healthcare organizations of process, and expect that medical staff according to a prescribed methodology,
around the world are collecting increas- comply. Now, with process mining to produce understandable, useful, and
ingly vast amounts of data about their techniques, it is possible to analyze the often surprising results.
patients and the clinical processes they actual run-time behaviour of such
go through. At the same time - especially processes and obtain precise informa- One of the latest developments in the
in the case of public hospitals - there is tion about their performance in near field of process mining, introduced by
growing pressure from governmental real-time. Zhengxing Huang and others at
bodies to refactor clinical processes in Zhejiang University in China, concerns
order to improve efficiency and reduce Such an endeavour, however, is made performance. Typically, a control-flow
costs. These two trends converge, difficult by the fact that reality is inher- model must be extracted from the event
prompting for the need to use run-time ently complex, so direct application of log prior to performance analysis.
data in order to support the analysis of process mining techniques may produce However, to study the performance of
existing processes. very large and confusing models, which healthcare processes, only a subset of
are quite difficult to interpret and ana- the recorded activities is usually consid-
In the area of process mining, there are lyze – in the parlance of process mining, ered – these are the key activities that
specialized techniques for the analysis these are known as “spaghetti” models. represent milestones in the process, and
of business processes according to a that are always present regardless of the
number of perspectives, including con- While the area of process mining is actual path of the patient. The time span
trol-flow, social network, and perform- being led by Wil van der Aalst at the between these activities becomes a Key
ance. These techniques are based on the Eindhoven University of Technology in Performance Indicator (KPI).
analysis of event data recorded in The Netherlands, here at the Technical
system logs. In general, any information University of Lisbon, in Portugal, we The ability to measure this KPI directly
system that is able to record the activi- have been developing techniques to from the event log is a major improve-
ties that are performed during process address the problem of how to extract ment with respect to previous perform-
execution can provide valuable data for information from event logs such that ance analysis techniques which rely on
process analysis. the output models are more amenable to a control-flow model that often includes
interpretation and analysis. To this end, too much behaviour. Here, we are inter-
These event data become especially rel- we have spent the last six years devel- ested in a predefined sequence of mile-
evant for the analysis of clinical oping a number of clustering, parti- stones and in retrieving the time span
processes, which are highly complex, tioning, and preprocessing techniques. between any pair of milestones.
dynamic, multi-disciplinary, and ad-hoc Such techniques have matured to the Incidentally, this approach also pro-
in nature. Until recently one could only point that they can be systematically vides the time span between the first
prescribe general guidelines for this kind applied to real-world event logs, and last activities, which can be used to

18 ERCIM NEWS 89 April 2012

Figure 1:
Control-flow model

determine the length of stay (LOS) of sized public hospital. The hospital has resulting in over 30 000 recorded
the patient in the hospital, one of the an Electronic Patient Record (EPR) events, although there are only 18 dis-
most sought-after KPIs in healthcare system, which records the events that tinct activities.
processes. take place in several departments. The
event log used in this experiment was Figure 1 depicts a control-flow model
Back in Portugal, we applied this collected over a period of 12 days. A for these activities, illustrating the
approach in a case study carried out in total of 4851 patients entered the emer- reason why such diagrams are often
the emergency department of a mid- gency department in that period, called “spaghetti” models. In Figure 2,
we present the results for some key
activities. The first step – triage – deter-
mines the priority of the patient and
takes place once the patient enters the
hospital. For patients who require med-
ical examination, it takes on average
two hours to perform the first exam.
About two hours and 30 minutes later,
the patient receives the diagnosis, and
then is quickly discharged, on average
within three minutes. The resulting LOS
amounts to an average of four hours and
30 minutes.

Figure 2 shows minimum, maximum,

and average times, and standard devia-
tions. While these results were gathered
for all patients that entered the emergency
department, similar analysis can be con-
ducted for patients with certain condi-
tions or with particular clinical paths.

Link: http://www.processmining.org/

Please contact:
Diogo R. Ferreira
IST – Technical University of Lisbon,
Portugal
Tel.: +351 21 423 35 52
Figure 2: Performance analysis E-mail: [email protected]

ERCIM NEWS 89 April 2012 19

Special Theme: Big Data

A Scalable Indexing Solution

to Mine Huge Genomic Sequence Collections
by Eric Rivals, Nicolas Philippe, Mikael Salson, Martine Leonard, Thérèse Commes
and Thierry Lecroq

With High Throughput Sequencing (HTS) technologies, biology is experiencing a sequence data
deluge. A single sequencing experiment currently yields 100 million short sequences, or reads, the
analysis of which demands efficient and scalable sequence analysis algorithms. Diverse kinds of
applications repeatedly need to query the sequence collection for the occurrence positions of a
subword. Time can be saved by building an index of all subwords present in the sequences before
performing huge numbers of queries. However, both the scalability and the memory requirement of
the chosen data structure must suit the data volume. Here, we introduce a novel indexing data
structure, called Gk arrays, and related algorithms that improve on classical indexes and state of the
art hash tables.

Biology and its applications in other life

sciences, from medicine to agronomy or 1,000
40
ecology, is increasingly becoming a
computational, data-driven science, as
500
testified by the launch of the Giga 20

Science journal (http://www.giga-

sciencejournal.com). In particular, the 0
5 10 15 20 25
0
5 10 15 20 25
advent and rapid spread of High
Throughput Sequencing (HTS) tech- (a) (b)
nologies (also called Next Generation
Sequencing) has revolutionized how k = 15
k = 20
research questions are addressed and k = 25
solved. To assess the biodiversity of an k = 30

area, for instance, instead

F
of patiently
determining species in the field, the Figure 1: Comparison of Gk arrays with a generalised Suffix Array (gSA) and a Hash Table
DNA of virtually all species present in solutions on the construction time and memory usage.
collected environmental samples (soil,
water, ice, etc.) can be sequenced in a
single run of a metagenomic experi- Let us take an example. Consider the (Google sparse hash) hit their limits
ment. The raw output consists of up to problem of assembling a genome from with such data volumes.
several hundred million short a huge collection of reads. Because
sequencing reads (eg from 30 to 120 sequencing is error prone and the To address the increasing demand for
nucleotides with an Illumina sequenced DNA vary between cells, the read indexing, we have developed a
sequencer). These reads are binned into read sequences are compared pairwise compact and efficient data structure,
classes corresponding to species, which to determine regions of approximate dubbed Gk arrays, that is specialized
allow to reliable estimation of their matches. To make it feasible, poten- for indexing huge collections of short
number and relative abundance. This tially matching regions between any reads (the term ‘collection’, rather than
becomes a computational question. read pair are selected on the presence of ‘set’, underlines that the same read
identical subwords of a given length k sequence can occur multiple times in
In other, genome-wide, applications, (k-mer). For the sake of efficiency, it is the input). An in-depth evaluation has
HTS serve to sequence new genomes, to advantageous, if not compulsory, to shown that Gk arrays, can be con-
catalogue active genes in a tissue, and index once for all the positions of all structed in a time similar to the best
soon in a cell, to survey epigenetic mod- distinct k-mers in the reads. Once con- hash tables, but outperform all concur-
ifications that alter our genome, to structed, the index data structure is kept rent solutions in term of memory usage
search for molecular markers of diseases in main memory and repeatedly (Figure 1). The Gk arrays combine
in a patient sample. In each case, the accessed to answer queries like ‘given a three arrays: one for storing the sorted
read analysis takes place in the com- k-mer, get the reads containing this k- positions where true k-mers start, an
puter, and users face scalability issues. mer (once/at least once)’. The question inverted array that allows finding the
The major bottleneck is memory con- of indexing k-mers or subwords has rank of any k-mer from a position in a
sumption. To illustrate the scale, cur- long been addressed for large texts, read, and a smaller array that records
rently sequences accumulate at a faster however classical solutions like the the intervals of positions of each dis-
rate than the Moore law, and large generalized suffix tree, or suffix array tinct k-mer in sorted order. Although
sequencing centres have outputs of giga- require too much memory for a read reads are concatenated for construc-
bases a day, so large that even network collection. Even state of the art imple- tion, Gk arrays avoid storing the posi-
transfer becomes problematic. mentations of sparse hash tables tions of (artificial) k-mers that overlap

20 ERCIM NEWS 89 April 2012

two adjacent reads. For instance, the arrays can be seen as an indexing layer Link:
query for counting the read containing that is accessed by higher level applica- Gk arrays library:
an existing k-mer takes constant time. tions. Future developments are planned http://www.atgc-montpellier.fr/gkarrays
Several types of queries have been to devise direct construction algo-
implemented and Gk arrays accommo- rithms, or a compressed version of Gk Please contact:
date fixed as well as variable length arrays that, like other index structures, Eric Rivals
reads. Gk arrays are packaged in an stores only some sampled positions and LIRMM, CNRS, Univ. Montpellier II,
independent C++ library with a simple reconstruct the others at runtime, hence France
and easy to use programming interface enabling the user to control the balance E-mail: [email protected]
(http://www.atgc-montpellier.fr/ngs/). between speed and memory usage. http://www.lirmm.fr/~rivals
They are currently exploited in a read
mapping and RNA-sequencing analysis
program; their scalability, efficiency, Gk arrays library is available on the
and versatility made them adequate for ATGC bioinformatics platform in
read error correction, read classifica- Montpellier: http://www.atgc-montpel-
tion, k-mer counting in assembly pro- lier.fr/gkarrays
gram, or other HTS applications. Gk

A-Brain: Using the Cloud to Understand the

Impact of Genetic Variability on the Brain
by Gabriel Antoniu, Alexandru Costan, Benoit Da Mota, Bertrand Thirion and Radu Tudoran

Joint genetic and neuroimaging data analysis on large cohorts of subjects is a new approach used to
assess and understand the variability that exists between individuals. This approach, which to date
is poorly understood, has the potential to open pioneering directions in biology and medicine. As
both neuroimaging- and genetic-domain observations include a huge number of variables (of the
order of 106), performing statistically rigorous analyses on such Big Data represents a
computational challenge that cannot be addressed with conventional computational techniques. In
the A-Brain project, researchers from Inria and Microsoft Research explore cloud computing
techniques to address the above computational challenge.

Several brain diseases have a genetic the brain and genome regions that may The Map-Reduce programming model
origin, or their occurrence and severity be involved in this link entails a huge has recently arisen as a very effective
is related to genetic factors. Genetics number of hypotheses, hence a drastic approach to develop high-performance
plays an important role in understanding correction of the statistical significance applications over very large distributed
and predicting responses to treatment for of pair-wise relationships, which in turn systems such as grids and now clouds.
brain diseases like autism, Huntington’s results in a crucial reduction of the sen- KerData has recently proposed a set of
disease and many others. Brain images sitivity of statistical procedures that aim algorithms for data management, com-
are now used to understand, model, and to detect the association. It is therefore bining versioning with decentralized
quantify various characteristics of the desirable to set up techniques that are as metadata management to support scal-
brain. Since they contain useful markers sensitive as possible to explore where in able, efficient, fine-grain access to mas-
that relate genetics to clinical behaviour the brain and where in the genome a sig- sive, distributed Binary Large OBjects
and diseases, they are used as an inter- nificant link can be detected, while cor- (BLOBs) under heavy concurrency. The
mediate between the two. Currently, recting for family-wise multiple com- project investigates the benefits of inte-
large-scale studies assess the relation- parisons (controlling for false positive grating BlobSeer with Microsoft Azure
ships between diseases and genes, typi- rate). storage services and aims to evaluate
cally involving several hundred patients the impact of using BlobSeer on Azure
per study. In the A-Brain project, researchers of with large-scale application experi-
the Parietal and KerData Inria teams ments such as the genetics-neu-
Imaging genetic studies linking func- jointly address this computational roimaging data comparisons addressed
tional MRI data and Single Nucleotide problem using cloud computing tech- by Parietal. The project is supervised by
Polyphormisms (SNPs) data may face a niques on Microsoft Azure cloud com- the Joint Inria-Microsoft Research
dire multiple comparisons issue. In the puting environment. The two teams Centre.
genome dimension, genotyping DNA bring their complementary expertise:
chips allow recording of several hundred KerData (Rennes) in the area of scal- Sophisticated techniques are required to
thousand values per subject, while in the able cloud data management and perform sensitive analysis on the tar-
imaging dimension an fMRI volume Parietal (Saclay) in the field of neu- geted large datasets. Univariate studies
may contain 100k-1M voxels. Finding roimaging and genetics data analysis. find an SNP and a neuroimaging trait

ERCIM NEWS 89 April 2012 21

Special Theme: Big Data

Figure 1: Identifying areas in the human brain (red and orange colors) in which activation is correlated with a given SNP data,
using A-Brain.

that are significantly correlated (eg the Azure platform are available for the Please contact:
amount of functional activity in a brain duration of the project (three years). In Gabriel Antoniu
region is related to the presence of a order to execute the complex A-Brain Inria, France
minor allele on a gene). In regression application one needs a parallel pro- Tel: +33 2 99 84 72 44
studies, some sets of SNPs predict a gramming framework (like E-mail: [email protected]
neuroimaging/behavioural trait (eg a set MapReduce), supported by a high per-
of SNPs predict a given brain character- formance storage backend. We there- Alexandru Costan
istic), while with multivariate studies, fore developed TomusBlobs, an opti- Inria, France
an ensemble of genetic traits predict a mized storage service for Azure clouds, Tel: +33 2 99 84 25 34
certain combination of neuroimaging leveraging the high throughput under E-mail: [email protected]
traits. Typically, the data sets involved heavy concurrency provided by the
contain 50K voxels and 500K SNPs. BlobSeer library developed at KerData. Bertrand Thirion
Additionally, in order to obtain results TomusBlobs is a distributed storage Inria, France
with a high degree of confidence, a system that exposes the local storage Tel: +33 1 69 08 79 92
number of 10K permutations is required from the computation nodes in the E-mail: [email protected]
on the initial data, resulting in a total cloud as a uniform shared storage to the
computation of 2.5 × 1014 associations. application. Using this system as a
Several regressions are performed, each storage backend, we implemented
giving a set of correlations, and all these TomusMapReduce, a MapReduce plat-
intermediate data must be stored in form for Azure. With these tools we
order to compare the values of each were able to execute the neuro-imaging
simulation and keep that which is most and genetic application in Azure and to
significant. The intermediate data that create a demo for it. Preliminary results
must be stored can easily reach 1.77 show that our solution brings substan-
PetaBytes. tial benefits to data intensive applica-
tions like A-Brain compared to
Traditional computing has shown its approaches relying on state-of-the-art
limitations in offering a solution for cloud object storage.
such a complex problem in the context
of Big Data. Performing one experi- The next step will be to design a per-
ment to determine if there is a correla- formance model for the data manage-
tion between one brain location and any ment layer, which considers the cloud’s
of the genes on a single core would take variability and provides some opti-
about five years. The computational mized deployment configurations. We
framework, however, can easily be run are also investigating new techniques to
in parallel and with the emergence of make more efficient correlations
the recent cloud platforms we could between genes and brain characteristics.
perform such computations in a reason-
able time (days). Links:
http://www.msr-
Our goal is to use Microsoft’s Azure inria.inria.fr/Projects/a-brain
cloud to performing such experiments. http://www.irisa.fr/kerdata/
For this purpose, two million hours per http://parietal.saclay.inria.fr/
year and 10 TBytes of storage on the http://blobseer.gforge.inria.fr/

22 ERCIM NEWS 89 April 2012

Big Web Analytics:
Toward a Virtual Web Observatory
by Marc Spaniol, András Benczúr, Zsolt Viharos and Gerhard Weikum

For decades, compute power and storage have become steadily cheaper, while network speeds,
although increasing, have not kept up. The result is that data is becoming increasingly local and thus
distributed in nature. It has become necessary to move the software and hardware to where the
data resides, and not the reverse. The goal of LAWA is to create a Virtual Web Observatory based on
the rich centralized Web repository of the European Archive. The observatory will enable Web-scale
analysis of data, will facilitate large-scale studies of Internet content and will add a new dimension
to the roadmap of Future Internet Research – it’s about time!

Flagship consumers of extreme large

computational problems are the ‘Web
3.0’ applications envisioned for the next
10-year period of Web development:
fast and personalized applications acces-
sible from any device, aided by data
mining solutions that handle distributed
data by cloud computing.

Academically, longitudinal data ana-

lytics – the Web of the Past – is chal-
lenging and has not received due atten-
tion. The sheer size and content of such
Web archives render them relevant to
analysts within a range of domains. The
Internet Archive currently holds more
than 150 billion versions of Web pages,
captured since 1996. Currently the same
coverage cannot be maintained as a few Figure 1: Screenshot of the relevant subgraph visualization demo
years ago as Web content has become so
diverse and dynamic. A high-coverage
archive would have to be an order of (Hungary), University of Patras As a central functionality of the VWO,
magnitude larger. (Greece), Internet Memory Foundation specific parts of the entire data collec-
(France) formerly called European tion can be defined and selected by the
In our applications, the size of the data Archive, and Hanzo Archives Ltd. use of machine learning. As an example
itself forms the key challenge. (UK). The latter two are professional we may classify different Web domains
Scalability issues arise from two main archival organizations. for spam or genre, quality, objectivity,
sources. Certain problems such as Web
or log processing are data intensive
where reading vast amounts of data
itself forms a bottleneck. Others, such as
machine learning or network analysis,
are computationally intensive as they
require complex algorithms run on large
data that may not fit into the internal
memory of a single machine.

Within the scope of the LAWA project –

as part of the Future Internet Research
and Experimentation (FIRE) initiative
founded by the European Commission –
we investigate temporal Web analytics
with respect to semantic and scalability
issues by creating a Virtual Web
Observatory (VWO). The consortium
consists of six partners: Max Planck
Institute for Informatics (Germany),
Hebrew University (Israel), SZTAKI Figure 2: Screenshot of the AIDA user interface in a browser window

ERCIM NEWS 89 April 2012 23

Special Theme: Big Data

trustworthiness or personally sensitive As part of our research on entity disam- problems that are both data and compu-
social content. The VWO will provide biguation we have developed the AIDA tational intensive.
graphical browsing to aid the user in system (Accurate Online Disambiguation
discovering information along with its of Named Entities), which includes an Our work is supported by the 7th
relationships. Hyperlinks constitute a efficient and accurate NED method, Framework IST programme of the
type of relationship that, when visual- suited for online usage. Our approach European Union through the focused
ized, provides insight into the connec- leverages the YAGO2 knowledge base as research project (STREP) on
tion of terms or documents and is useful an entity catalog and a rich source of rela- Longitudinal Analytics of Web Archive
for determining their information tionships among entities. We cast the data (LAWA) under contract no.
source, quality and trustworthiness. Our joint mapping into a graph problem: men- 258105.
main goal is to present relevant inter- tions from the input text and candidate
connections of a large graph with tex- entities define the node set, and we con- Links:
tual information, by visualizing a well- sider weighted edges between mentions LAWA project website:
connected small subgraph fitting the and entities, capturing context similari- http://www.lawa-project.eu/
information needs of the user. ties, and weighted edges among entities,
capturing coherence. AIDA webinterface:
A Web archive of timestamped versions https://d5gate.ag5.mpi-
of Web sites over a long-term time Figure 2 shows the user interface of sb.mpg.de/webaida/
horizon opens up great opportunities for AIDA. The left panel allows the user to
analysts. By detecting named entities in select the underlying disambiguation YAGO2 webinterface:
Web pages we raise the entire analytics methods and to insert an input text. The https://d5gate.ag5.mpi-sb.mpg.de/
to a semantic rather than keyword-level. panel in the middle shows for each men- webyagospotlx/WebInterface
Difficulties here arise from name ambi- tion (in green) the disambiguated entity,
guities, thus requiring a disambiguation linked with the corresponding Wikipedia Visualization demo:
mapping of mentions (noun phrases in articles. In addition, a clickable type http://dms.sztaki.hu/en/letoltes/wimmut
the text that can denote one or more cloud allows the user to explore the types -searching-and-navigating-wikipedia
entities) onto entities. For example, of the named entities contained. Finally,
“Bill Clinton” might be the former US the rightmost panel provides statistics FIRE website:
president William Jefferson Clinton, or about disambiguation process. http://cordis.europa.eu/fp7/ict/fire/
any other William Clinton contained in
Wikipedia. Ambiguity further increases In conclusion, supercomputing software Please contact:
if the text only contains “Clinton” or a architectures will play a key role in Marc Spaniol
phrase like “the US president”. The scaling data and computational inten- Max-Planck-Institute for Informatics,
temporal dimension may further intro- sive problems in business intelligence, Saarbrücken, Germany
duce complexity, for example when information retrieval and machine E-mail: [email protected]
names of entities have changed over learning. In the future we will be identi-
time (eg people getting married or fying new applications such as Web 3.0, András Benczúr
divorced, or organizations that undergo and considering the combination of dis- SZTAKI, Budapest, Hungary
restructuring in their identities). tributed and many-core computing for E-mail: [email protected]

Computational Storage in Vision Cloud

by Per Brand

Vision Cloud is an ongoing European project on cloud computing. The novel storage and
computational infrastructure is designed to meet the challenge of providing for tomorrow’s
data-intensive services.

The two most important trends in infor- (including storage, network, and com- low maintenance costs and efficient
mation technology today are the puting) are integrated to provide value- resource utilization.
increasing proliferation of data-inten- added services to users. The growing
sive services and the digital convergence number, scale, variety and sophistica- The two critical ingredients to deliver
of telecommunications, media and ICT. tion of data-intensive services impose converged data-intensive services are
A growing number of services aggre- demanding requirements that cannot be the storage infrastructure and the com-
gate, analyze and stream rich data to met using today’s computing technolo- putational infrastructure. Not only must
service consumers over the Internet. We gies. There is the need to simultane- the storage offer unprecedented scala-
see the inevitability of media, telecom- ously service millions of users, accom- bility, good and tunable availability, it
munications and ICT services becoming modate the rapid growth of services and must also provide the means to struc-
merged into one operating platform, the sheer volume of data, while at the ture, categorize, and search massive
where content and ICT resources same time providing high availability, data sets. The computational framework

24 ERCIM NEWS 89 April 2012

the data they access. Some examples
follow.

In a perfect world a user, organization

might define an appropriate metadata
scheme for object classification; each
and every user will appropriately tag

each and every object they inject into

the cloud. However, experience tells us

that this is not the case, users forget and
make mistakes. Hence it is useful to
have analysis programs that automati-
cally annotate and check objects as they
are injected.

Another example addresses the prolifer-
ation of different media formats. One

possibility is that upon ingestion data is

automatically converted by appropriate

storlets to all the formats that might be

needed upon access for all devices.
Another possibility is to, transparent to

the user, perform on-the-fly transcoding

from a given standard format to one
Figure 1: Example of computational storage Vision Cloud. suitable for the particular accessing
device being used. Note that if one data
item can be transcoded from another,
needs to provide user-friendly program- end-user metadata, organizational meta- the best choice depends on the relative
ming models capable of analyzing and data, metadata belonging to PaaS serv- costs of computation and storage. With
manipulating large data sets in a ices, to system metadata. The metadata storlets this choice can be done auto-
resource-aware and efficient manner. of a data object characterizes the object matically, transparently, and even
from different points of view. For dynamically.
The ongoing European IP project Vision example, one metadata field might
Cloud (October 2010-September 2013) specify that a document is ‘type: text’ In Figure 1 a user copies a storlet from a
addresses these issues and is developing and ‘subject: computer_science’, while storlet library, which could also be in
a distributed and federated cloud archi- another (system) field gives the creation the cloud. The user provides the storlet
tecture as well as a reference implemen- time of the object. The storage service with the necessary additional parame-
tation. Use cases are taken from the also offers search facilities whereby ters, which would typically include user
domains of telecommunication (with objects with given metadata tags and credentials – as the storlet is acting as an
Telenor, France Telecom, and values may be found. extension of the user with the same
Telefonica as partners), media (with rights and privileges. The storlet is
Deutsche Welle, and RAI), health In Vision Cloud the units of computa- injected into Vision Cloud. The system
(Siemens) and enterprise (SAP). In tion are called storlets. Storlets may be places the storlet appropriately (eg close
addition to SICS, other partners include seen as a computation agent that lives, to the data it will need to access (if
IBM, Engineering, National Technical works and eventually dies in the cloud. known). The system will also extract
University of Athens, Umeå University, The storlet is an encapsulated compu- the trigger conditions from the storlet,
the University of Messina and SNIA. tation where all interaction with other and register the storlet so when and if
cloud services is given in the storlet the appropriate trigger event occurs this
An important principle of the Vision parameters. The storlet programming information is propagated. The storlet
Cloud architecture is that storage and model specifies the format for these becomes active and performs some
computation are tightly coordinated and storlet parameters. Parameters include computation, possibly modifying data
designed to work together. Indeed, one credentials, the identity of objects it objects or their metadata. When the
could say there is no clear dividing line will access, constraints it must obey computation has finished the storlet
between the two. We will illustrate with during operation, and most impor- generally returns to the passive state
two examples taken from Vision Cloud tantly, its triggers; the conditions that waiting for a new trigger event.
use cases, but first we briefly describe will cause the storlet to become active
the storage service and computational and actually perform some computa- Link:
model. tion. The objects that the storlet will http://www.visioncloud.eu
work on are either provided by the data
The storage service encapsulates object parameters or given by the trig- Please contact:
objects together with their metadata. gering event that activates the storlet. Per Brand
Metadata is used by many different An important aspect of the runtime SICS, Sweden
actors at various levels, ranging from system is that storlets execute close to E-mail: [email protected]

ERCIM NEWS 89 April 2012 25

Special Theme: Big Data

Large-Scale Data Analysis on Cloud Systems

by Fabrizio Marozzo, Domenico Talia and Paolo Trunfio

The massive amount of digital data currently being produced by industry, commerce and research is
an invaluable source of knowledge for business and science, but its management requires scalable
storage and computing facilities. In this scenario, efficient data analysis tools are vital. Cloud systems
can be effectively exploited for this purpose as they provide scalable storage and processing services,
together with software platforms for developing and running data analysis environments. We present
a framework that enables the execution of large-scale parameter sweeping data mining applications
on top of computing and storage services.

The past two decades have been charac- Cloud systems can be effectively • a task queue that contains the data
terized by an exponential growth of dig- employed to handle this class of appli- mining tasks to be executed
ital data production in many fields of cation since they provide scalable • a task status table that keeps informa-
human activity, from science to enter- storage and processing services, as well tion about the status of all tasks
prise. In the biological, medical, astro- as software platforms for developing • a pool of k workers, where k is the
nomic and earth science fields, for and running data analysis environments number of virtual servers available,
example, very large data sets are pro- on top of such services. in charge of executing the data min-
duced daily from the observation or sim- ing tasks submitted by the users
ulation of complex phenomena. We have worked on this topic by devel- • a website that allows users to submit,
Unfortunately, massive data sets are oping Data Mining Cloud App, a soft- monitor the execution, and access the
hard to understand, and models and pat- ware framework that enables the execu- results of data mining tasks.
terns hidden within them cannot be iden- tion of large-scale parameter sweeping
The website includes three main sec-
tions: i) task submission that allows
users to submit data mining tasks; ii)
task status that is used to monitor the
status of submitted tasks and to access
results; iii) data management that allows
users to manage input data and results.

Figure 2 shows a screenshot of the task

submission section of the website, taken
during the execution of a parameter
sweeping data mining application. An
application can be configured by
selecting the algorithm to be executed,
the dataset to be analyzed, and the rele-
vant parameters for the algorithm. For
parameter sweeping applications, the
Figure 1: Architecture of the Data system submits to the Cloud a number
Mining Cloud App framework. of independent tasks that are executed
concurrently on a set of virtual servers.

tified by humans directly, but must be data analysis applications on top of The user can monitor the status of each
analyzed by computers using knowledge Cloud computing and storage services. single task through the task status sec-
discovery in database (KDD) processes The framework has been implemented tion of the website, as shown in Figure
and data mining techniques. using Windows Azure and has been 3. For each task, the current status (sub-
used to run large-scale parameter mitted, running, done or failed) and
Data analysis applications often need to sweeping data mining applications on a status update time are shown.
run a data mining task several times, Microsoft Cloud data centre. Moreover, for each task that has com-
using different parametric values before pleted its execution, the system enables
getting significant results. For this Figure 1 shows the architecture of the two links: the first (Stat) gives access to
reason, parameter sweeping is widely Data Mining Cloud App framework, as a file containing some statistics about
used in data mining applications to it is implemented on Windows Azure. the amount of resources consumed by
explore the effects of using different The framework includes the following the task; the second (Result) visualizes
values of the parameters on the results of components: the task result.
data analysis. This is a time consuming • a set of binary and text data containers
process when a single computer is used (Azure blobs) used to store data to be We evaluated the performance of the
to mine massive data sets since it can mined (input datasets) and the results of Data Mining Cloud App through the
require very long execution times. data mining tasks (data mining models) execution of a set of long-running

26 ERCIM NEWS 89 April 2012

Figure 2: Data Mining Cloud App website: A screenshot from the Figure 3: Data Mining Cloud App website: A screenshot from the
task submission section. task status section.

parameter sweeping data mining appli- Other than supporting users in designing Links:
cations on a pool of virtual servers and running parameter sweeping data http://www.microsoft.com/windowsazure
hosted by a Microsoft Cloud data mining applications on large data sets, http://grid.deis.unical.it
center. The experiments demonstrated we intend to exploit Cloud computing
the effectiveness of the Data Mining platforms for running knowledge dis-
Cloud App framework, as well as the covery processes designed as a combi- Please contact:
scalability that can be achieved through nation of several data analysis steps to Domenico Talia
the parallel execution of parameter be run in parallel on Cloud computing ICAR-CNR and
sweeping data mining applications on a elements. To achieve this goal, we are DEIS, University of Calabria, Italy
pool of virtual servers. For example, the currently extending the Data Mining Tel: +39 0984 494726
classification of a large dataset (290,000 Cloud App framework to also support E-mail: [email protected]
records) on a single virtual server workflow-based KDD applications, in
required more than 41 hours, whereas it which complex data analysis applica- Fabrizio Marozzo and Paolo Trunfio
was completed in less than three hours tions are specified as graphs that link DEIS, University of Calabria, Italy
on 16 virtual servers. This corresponds together data sources, data mining algo- E-mail: [email protected],
to an execution speedup equal to 14. rithms, and visualization tools. [email protected]

Big Software Data Analysis

by Mircea Lungu, Oscar Nierstrasz and Niko Schwarz

In today’s highly networked world, any researcher can study massive amounts of source code
even on inexpensive off-the-shelf hardware. This leads to opportunities for new analyses and
tools. The analysis of big software data can confirm the existence of conjectured phenomena,
expose patterns in the way a technology is used, and drive programming language research.

The amount and variety of available tems. A software ecosystem is a group a single system. As a result, analysis
external information associated with of software systems that is developed techniques that work for the individual
evolving software systems is stag- and co-evolves together in the same system no longer apply.
gering: data sources include bug environment. The usual environments
reports, mailing list archives, issue in which ecosystems exist are organi- Recently, we have seen the emergence
trackers, dynamic traces, navigation zations (companies, research centres, of a new type of large repository of
information extracted from the IDE, universities) or communities (open information associated with software
and meta-annotations from the ver- source communities, programming systems which can be orders of magni-
sioning system. All these sources of language communities). The systems tude larger than an ecosystem: the
information have a time dimension, within an ecosystem usually co- super-repository. Super-repositories
which is tracked in versioning control evolve, depend on each other, have are repositories of project repositories.
systems. intersecting sets of developers as The existence of super-repositories
authors, and use similar technologies provides us with an even larger source
Software systems, however, do not and libraries. Analyzing an entire of information to analyze, exceeding
exist in isolation but co-exist in larger ecosystem entails dealing with orders ecosystems again by orders of magni-
contexts known as software ecosys- of magnitude more data than analyzing tude.

ERCIM NEWS 89 April 2012 27

Special Theme: Big Data

Figure 1: A decade of in the software engineering commu-

evolution in a subset of the nity has shifted from ‘code clones
systems in the Gnome considered harmful’ to ‘code clones
ecosystem shows the size of considered harmful is harmful’.
source code monotonically Indeed, cloning is often a necessary
increasing. first step towards divergent evolu-
tion. It furthermore helps developers
deal with problems of code owner-
ship: it is often easier to fork code
than to modify it in its project of ori-
gin. As a result, current code cloning
research is about cloning manage-
ment and awareness, such as by link-
ing cloned source code snippets to the
original site and informing both ends
in the event of a change. Big Data
means that it is now possible to keep
Research has been empowered by the SqueakSource ecosystem, we track of all source code in all public
increased network bandwidth and raw were able to show that due to the tight repositories everywhere. This opens
computation power to analyze all of network of compile-time dependen- exciting opportunities to track soft-
these artifacts at possibly massive cies between systems (see Figure 2), ware clones and try and help develop-
scales. Therefore it shouldn’t surprise changes in a project can propagate to ers cope with them.
us that current software engineering many other projects and can impact
research uses the new wealth of infor- many developers. There is usually no 3. Programming Language Com-
mation to improve the lives of software way for the developers of a library to parisons. New programming lan-
developers. Analysis of software know which projects are impacted by guages and language features are
ecosystems and super-repositories a change. This presents an opportuni- often announced with exaggerated
enters the realm of big software data. ty to assist the developer: By keeping claims of increased productivity. We
track of all dependencies between believe that the analysis of big soft-
In our research, we have been working systems in an ecosystem — an appli- ware data can enable an evidence-
towards discovering opportunities and cation of Big Software Data — a based analysis of the various claims
challenges associated with big software developer can identify all systems regarding programming languages.
data. We summarize three of the prob- that will be impacted by a change. We are currently running a study in
lems that we are working on, which we which we compare the use of poly-
believe exemplify distinct directions in 2. Code Cloning as a Means of Reuse. We morphism in Java and Smalltalk. This
the big-software data analysis: have studied the occurrence of code will allow us to understand whether
cloning in Ohloh, a large corpus of Java there is any significant difference in
1. Studying Ripple Effects in Software programs, and in SqueakSource, which how people use object-oriented
Ecosystems. A change in a software tracks versions of almost the entire design in classical static and dynamic
system can propagate to other parts of Squeak Smalltalk ecosystem. We have languages. The fact that we can run
the system. The same effect can also discovered that reuse by code duplica- the study on hundreds of Java systems
be observed at the ecosystem level, as tion is very widespread in the Java and and hundreds of Smalltalk systems
we have shown in previous work. By Smalltalk worlds - between 14 and allows us to have confidence in the
data mining all versions of all the 18%. Based on the empirical study of results and their significance.
software systems that are available in the evolution of clones, the consensus
We believe that we are witnessing the
beginning of big software data analysis,
a field which will influence both soft-
ware engineering practice and program-
ming language design. In fact it might
be a good place for the two fields to
meet.

Link:
http://scg.unibe.ch/bigsoftwaredata

Please contact:
Mircea Lungu, Oscar Nierstrasz,
Niko Schwartz
Unversity of Bern, Switzerland
E-mail: [email protected],
Figure 2: A subset of the software systems in the SqueakSource ecosystem shows a tight network [email protected],
of compile-time dependencies. [email protected]

28 ERCIM NEWS 89 April 2012

Scalable Management of Compressed
Semantic Big Data
by Javier D. Fernández, Miguel A. Martínez-Prieto and Mario Arias

The potential of Semantic Big Data is currently severely underexploited due to their huge space
requirements, the powerful resources required to process them and their lengthy consumption time.
We work on novel compression techniques for scalable storage, exchange, indexing and query
answering of such emerging data.

The Web of Data materializes the basic compress these plain formats using uni- centric view. HDT modularizes the data
principles of the Semantic Web. It is a versal compression algorithms. The and uses the skewed structure of big
collective effort for the integration and resultant file lacks logical structure and RDF graphs to achieve large spatial sav-
combination of data from diverse there is no agreed way to efficiently ings. In addition, it includes metadata
sources, allowing automatic machine- publish such data, ie to make them describing the RDF dataset which
processing and reuse of information. (publicly) available for diverse pur- serves as an entrance point to the infor-
Data providers make use of a common poses and users. In addition, the data are mation on the dataset and leads to clean,
language to describe and share their hardly usable at consumption; the con- easy-to-share publications. Our experi-
semi-structured data, hence one person, sumer has to decompress the file and, ments show that big RDF is now
or a machine, can move through dif- then, to use an appropriate external exchanged and processed 10-15 times
ferent datasets with information about
the same resource. This common lan-
guage is the Resource Description
Framework (RDF), and Linked Open
Data is the project that encourages the
open publication of interconnected
datasets in this format.

Meanwhile, the world has entered a data

deluge era in which thousands of
exabytes (billion gigabytes) are created
each year. Human genome data, accurate
astronomical information, stock
exchanges worldwide, particle acceler-
ator results, logs of Internet use patterns
and social network data are just a few
examples of the vast diversity of data to
be managed. This variety, together with
the volume and the required velocity in Figure 1: Create-Publish-Exchange-Consume scenario in the Web of Data
data processing, characterize the “big
data” and its inherent scalability prob-
lems. With the current data deluge trans- management tool. Similar problems faster than traditional solutions, being
forming businesses, science and human arise when managing RDF in mobile rapidly available for final consumption.
relationships, to succeed in Semantic devices; although the amount of infor-
Big Data management means to convert mation could be potentially small, these RDF datasets are typically consumed
existing scattered data into profitable devices have even more restrictive through a specific query language,
knowledge. requirements, not only for transmission called SPARQL. Several RDF stores
costs/latency, but also for post-pro- provide SPARQL resolution both for
A new paradigm for such large scale cessing due to their inherent memory local consumption or facilitating an
management is emerging, and new tech- and CPU constraints. online query service for third parties
niques to store, share, query, and analyze through SPARQL Endpoints. We have
these big datasets are required. Our We address the scalability drawbacks developed a lightweight post-pro-
research is focused on scalable RDF depicted in the previous workflow by cessing which enhances the exchanged
management within a typical Create- proposing novel compressed structures HDT with additional structures to pro-
Publish-Exchange-Consume scenario, optimized for the storage, exchange, vide full SPARQL resolution in com-
as shown in Figure 1. To date, big RDF indexing and querying of big RDF. We pressed space. Furthermore, we were
datasets are created and serialized, in firstly design an efficient (binary) RDF one of the first groups to empirically
practice, using traditional verbose syn- serialization format, called HDT analyse SPARQL usage logs. These
taxes, still conceived under a document- (Header-Dictionary-Triples). In this results provide valuable feedback to our
centric perspective of the Web. Efficient proposal, we called for the need to work as well as other RDF store
interchange of RDF is limited, at best, to move forward RDF syntaxes to a data- designers, especially in the tasks of

ERCIM NEWS 89 April 2012 29

Special Theme: Big Data

query evaluation and index construc- these concepts is on sensor networks, Please contact:
tion. where data throughput plays a major Javier D. Fernández
role. In a near future, where RDF University of Valladolid, Spain
The HDT format has recently been exchange together with SPARQL query E-mail: [email protected]
accepted as a W3C Member resolution will be the most common
Submission, highlighting the relevancy daily task of Web machine agents, our Miguel A. Martínez-Prieto
of ‘efficient interchange of RDF efforts will serve to alleviate current University of Valladolid, Spain
graphs’. We are currently working on scalability drawbacks. E-mail: [email protected]
our HDT-based store. In this sense,
compression is not only useful for Links: Mario Arias
exchanging big data, but also for distri- DataWeb Research Group: DERI, National University of Ireland
bution purposes, as it allows bigger http://dataweb.infor.uva.es Galway, Ireland
amounts of data to be managed using RDF/HDT: http://www.rdfhdt.org E-mail: [email protected]
fewer computational resources. Another HDT W3C Member Submission:
research area where we plan to apply http://www.w3.org/Submission/2011/03/

SCAPE: Big Data Meets Digital Preservation

by Ross King, Rainer Schmidt, Christoph Becker and Sven Schlarb

The digital collections of scientific and memory institutions – many of which are already in the petabyte
range – are growing larger every day. The fact that the volume of archived digital content worldwide is
increasing geometrically, demands that their associated preservation activities become more scalable. The
economics of long-term storage and access demand that they become more automated. The present state
of the art fails to address the need for scalable automated solutions for tasks like the characterization or
migration of very large volumes of digital content. Standard tools break down when faced with very large
or complex digital objects; standard workflows break down when faced with a very large number of objects
or heterogeneous collections. In short, digital preservation is becoming an application area of big data, and
big data is itself revealing a number of significant preservation challenges.

The EU FP7 ICT Project SCAPE large scale data ingest, analysis,
(Scalable Preservation and maintenance.
Environments), running since
February 2011, was initiated in Data Analysis and Preservation
order to address these challenges Platform
through intensive computation The SCAPE Platform will pro-
combined with scalable moni- vide an extensible infrastructure
toring and control. In particular, for the execution of digital
data analysis and scientific preservation workflows on large
workflow management play an volumes of data. It is designed as
important role. Technical devel- an integrated system for content
opment is carried out in three sub-proj- Figure 1: Challenges of the SCAPE project holders employing a scalable architec-
ects and will be validated in three ture to execute preservation processes
Testbeds (refer to Figure 1). Council’s Diamond Synchrotron source on archived content. The system is
and ISIS suite of neutron and muon based on distributed and data-centric
Testbeds instruments). The Testbeds provide a systems like Hadoop and HDFS and
The SCAPE Testbeds will examine very description of preservation issues with a programming models like MapReduce.
large collections from three different special focus on data sets that imply a A suitable storage abstraction will pro-
application areas: Digital Repositories real challenge for scalable solutions. vide integration with content reposito-
from the library community (including They range from single activities (like ries at the storage level for fast data
nearly two petabytes of broadcast audio data format migration, analysis and exchange between the repository and
and video archives from the State identification), to complex quality the execution system. Many SCAPE
Library of Denmark, who are adding assurance workflows. These are sup- collections consist of large binary
more than 100 terabytes every year), ported at scale by solutions like the objects that must be pre-processed
Web Content from the web archiving SCAPE Platform or Planning and before they can be expressed using a
community (including over a petabyte of Monitoring services detailed below. structured data model. The Platform
web harvest data), and Research Data SCAPE solutions will be evaluated will therefore implement a storage hier-
Sets from the scientific community against defined institutional data sets in archy for processing, analysing and
(including millions of objects from the order to validate their applicability in archiving content, relying on a combi-
UK Science and Technology Facilities real life application scenarios, such as nation of distributed database and file

30 ERCIM NEWS 89 April 2012

system storage. Moreover, the Platform components involved need to scale up. able monitoring and control of large
is in charge of on-demand shipping and Only scalable monitoring and decision preservation operations. Initial results
deploying the required preservation making enables automated, large-scale in the form of deliverables, reports and
tools on the cluster nodes that hold the systems operation by scaling up the con- software are publicly available on the
data. This is being facilitated by trol structures, policies, and processes project website, wiki, and Github repos-
employing a software packaging model for monitoring and action. SCAPE will itory. We are confident that the project
and a corresponding repository. The thus address the bottleneck of decision will have significant impact over the
SCAPE vision is that preservation processes and information processing remaining two and one-half years.
workflows, based on assembling com- required for decision making. Based on
ponents using the Taverna graphical well-established principles and This work was partially supported by
workbench, may be created on desktop methods, the project will automate now- the SCAPE Project. The SCAPE project
computers by end-users (such as data manual aspects such as constraints mod- is co-funded by the European Union
curators). The Platform’s execution elling, requirements reuse, measure- under FP7 ICT-2009.4.1 (Grant
system will support the transformation ments, and continuous monitoring by Agreement number 270137).
of these workflows into programs that integrating existing and evolving infor-
can be executed on a distributed data mation sources and measurements. Links:
processing environment. http://www.scape-project.eu/
Conclusions http://wiki.opf-
Scalable Planning and Monitoring At the end of the first project year, the labs.org/display/SP/Home
Through its data-centric execution plat- SCAPE project can already offer real https://github.com/openplanets/scape
form, SCAPE will substantially improve solutions to some of the big data chal-
scalability for handling massive lenges outlined above, in the form of a Please contact:
amounts of data and for ensuring quality scalable platform design and infrastruc- Ross King, AIT Austrian Institute of
assurance without human intervention. ture, initial tools and workflows for Technology GmbH
But fundamentally, for a system to be large-scale content analysis and quality Tel: +43 (0) 50550 4271
truly operational on a large scale, all assurance, and an architecture for scal- E-mail: [email protected]

Brute Force Information Retrieval Experiments

using MapReduce
by Djoerd Hiemstra and Claudia Hauff

MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by

the Database Group of the University of Twente for running large scale information retrieval
experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion
web pages, totaling about 12.5 TB of data uncompressed. MIREX shows that the execution of test
queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a
search engine’s inverted index. MIREX is open source and available for others.

Research in the field of information non-trivial amount of coding to change documents. Some of the advantages of
retrieval is often concerned with an existing search engine. If, for this method are: 1) Researchers spend
improving the quality of search systems. instance, a new idea requires informa- less time on coding and debugging new
The quality of a search system crucially tion that is not currently in the search experimental retrieval approaches; 2) It
depends on ranking the documents that engine’s inverted index, then the is easy to include new information in
match a query. To get the best docu- researcher has to re-index the data or the ranking algorithm, even if that infor-
ments ranked in the top results for a even recode parts of the system’s mation would not normally be included
query, search engines use numerous sta- indexing facilities, and possibly recode in the search engine’s inverted index; 3)
tistics on query terms and documents, the query processing facilities that Researchers are able to oversee all or
such as the number of occurrences of a access this information. If the new idea most of the code used in the experiment;
term in the document, the number of requires query processing techniques 4) Large-scale experiments can be done
occurrences of a term in the collection, that are not supported by the search in reasonable time.
the number of hyperlinks pointing at a engine (for instance sliding windows,
document, the number of occurrences of phrases, or structured query expansion) MapReduce was developed at Google
a term in the anchor texts of hyperlinks even more work has to be done. as a framework for batch processing of
pointing at a document, etc. New large data sets on clusters of commodity
ranking ideas are tested off-line on Instead of using the indexing facilities machines. Users of the framework
query sets with human rated documents. of the search engine, we propose to use implement a mapper function that
If such ideas are radically new, experi- MapReduce to test new retrieval processes a key/value pair to generate a
mentally testing them might require a approaches by sequentially scanning all set of intermediate key/value pairs, and

ERCIM NEWS 89 April 2012 31

Special Theme: Big Data

a reducer function that processes inter-

mediate values associated with the same
intermediate key. For the example of
simply counting the number of terms
occurring across the entire collection of
documents, the mapper takes as input a
document URL (key) and the document
content (value) and outputs pairs of
term and term count in the document.
The reducer then aggregates all term
counts of a term together and outputs
the number of occurrences of each term
in the collection. Our experiments are
made of several such MapReduce pro-
grams: We extract anchor texts from
web pages, we gather global statistics
for terms that occur in our test queries,
we remove spam pages, and we run a Proud researchers and their cluster.
search experiment by reading web
pages one at a time, and on each page
we execute all test queries. Sequential the MapReduce runtime. We use Retrieval EXperiments), and it is avail-
scanning allows us to do almost any- Hadoop: an open source implementa- able as open source software from
thing we like, for instance sophisticated tion of Google’s file system and http://mirex.sourceforge.net
natural language processing. If the new MapReduce. A small cluster of 15 low
approach is successful, it will have to be cost machines suffices to run experi- MIREX is sponsored by the
implemented in a search engine’s ments on about half a billion web Netherlands Organization for Scientific
indexing and querying facilities, but pages, about 12.5 TB of data if uncom- Research NWO, and Yahoo Research,
there is no point in making a new index pressed. To give the reader an idea of Barcelona.
if the experiment is unsuccessful. the complexity of such an experiment:
Researchers at Google and Microsoft An experiment that needs two sequen- Links:
have recently reported on similar exper- tial scans of the data requires about MIREX: http://mirex.sourceforge.net
imental infrastructures. 350 lines of code. The experimental Database Group:
code does not need to be maintained: http://db.cs.utwente.nl
When implementing a MapReduce In fact, it should be retained in its orig- Web Information Systems Group:
program, users do not need to worry inal form to provide data provenance http://wis.ewi.tudelft.nl
about partitioning of the input data, and reproducibility of research results.
scheduling of tasks across the Once the experiment is done, the code Please contact:
machines, machine failure, or inter- is filed in a repository for future refer- Djoerd Hiemstra
process communication and logging: ence. We call our code repository University of Twente, The Netherlands
All of this is automatically handled by MIREX (MapReduce Information E-mail: [email protected]

A Big Data Platform

for Large Scale Event Processing
by Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente and Patrick
Valduriez

To date, big data applications have focused on the store-and-process paradigm. In this paper we
describe an initiative to deal with big data applications for continuous streams of events.

In many emerging applications, the instance, very high volumes of IP should be processed online and need to
volume of data being streamed is so packets and events from sensors at fire- be processed as quickly as possible in
large that the traditional ‘store-then- walls, network switches and routers and order to provide meaningful results in
process’ paradigm is either not suitable servers need to be analyzed and should real-time. An ideal system would detect
or too inefficient. Moreover, soft-real detect attacks in minimal time, in order fraud during the authorization process
time requirements might severely limit to limit the effect of the malicious that lasts hundreds of milliseconds and
the engineering solutions. Many sce- activity over the IT infrastructure. deny the payment authorization, mini-
narios fit this description. In network Similarly, in the fraud department of a mizing the damage to the user and the
security for cloud data centres, for credit card company, payment requests credit card company.

32 ERCIM NEWS 89 April 2012

In this context, researchers have pro- node of a distributed CEP must process uted and elastic CEP that delivers
posed a new computing paradigm called the whole input flow, which severely unmatched performance in terms of
Complex Event Processing. A complex limits scalability and application scope. throughput and allows for cost-effective
event processor (CEP) is a system resource utilization. The StreamCloud
designed to process continuous streams The real research challenge is how to project is carried out by the Distributed
of data in near real-time. Data flows in build a parallel-distributed CEP where System Lab at Universidad Politecnica
streams that are not stored, but are data is partitioned across processing de Madrid in collaboration with the
rather processed on-the-fly. nodes that (i) does not require any node Zenith team at Inria and LIRMM,
to process the whole input and (ii) pro- Montpellier. The system is being exer-
Similar to database management sys- vides the same results of an ideal cen- cised for a Security Information and
tems (DBMS), a CEP processes queries tralized execution (ie without any delay Event Management system in the
over tuples. However, while in the con- due to input tuples queuing up). MASSIF project.
text of DMBS the set of tuples to be
processed is fairly static, CEP deals The gist of the problem is how to dis- StreamCloud leverages a novel paral-
with an infinite sequence of events. tribute input tuples, so that tuples that lelization strategy that allows splitting
Data processing is performed through must be aggregated or joined together the logical input stream in multiple
continuous queries based on the sliding are actually received by the same pro- physical streams that are pushed
window model. This approach differs cessing node. towards processing nodes. The logical
from queries in traditional DBMS stream is never concentrated in a single
because a continuous query is con- Moreover, a parallel CEP should also node, in order to avoid bottlenecks.
stantly ‘standing’ over the streaming feature elasticity in order to adapt the Communication between nodes is mini-
mized and only used to guarantee
semantic transparency, ie that the out-
come of the computation matches the
one of a traditional centralized solution.
With this parallelization, StreamCloud
Figure 1: Overall architecture is able to aggregate the computing
power of hundreds of nodes to process
millions of events per second.

Further, StreamCloud is elastic and con-

tinuously monitors its processing nodes
and makes autonomous decisions on
whether to add or remove nodes to cope
with the incoming load with the min-
imal set of resources. This is crucial in
cloud environments with a pay-per-use
model. Node provisioning and decom-
missioning is complemented by
dynamic load balancing so that
StreamCloud can re-distribute the load
among processing node in case of
uneven load distribution.

Link:
http://www.massif-project.eu

Please contact:
events and results are output any time amount of computing resources to the Vicenzo Gulisano, Ricardo Jimenez-
the actual data satisfies the query predi- actual workload and achieve cost-effec- Peris, Marta Patiño-Martinez, Claudio
cate. A continuous query is modelled as tiveness. Indeed, any parallel system Soriente, Universidad Politécnica de
a graph where edges identify data flows with a static number of processing Madrid, Spain
and nodes represent operators that nodes might experience under-provi- Tel: +34 913367452
process input data. sioning (ie the overall computing power E-mail: [email protected],
is not enough to handle the input load) [email protected],
Centralized CEPs suffered from single or over-provisioning (ie the current load [email protected],
node bottlenecks and were quickly is lower than the system maximum [email protected]
replaced by distributed CEPs where the throughput and some nodes are running
query was distributed across several below their capacity). Patrick Valduriez
nodes, in order to decrease the per-node Inria, LIRMM, France
tuple processing time and increase the With those goals in mind, we are devel- Tel: +33 467149726
overall throughput. Nevertheless, each oping StreamCloud, a parallel-distrib- E-mail: [email protected]

ERCIM NEWS 89 April 2012 33

Special Theme: Big Data

CumuloNimbo: A Highly-Scalable Transaction

Processing Platform as a Service
by Ricardo Jimenez-Peris, Marta Patiño-Martinez, Kostas Magoutis, Angelos Bilas and Ivan Brondino

One of the main challenges facing next generation Cloud platform services is the need to
simultaneously achieve ease of programming, consistency, and high scalability. Big Data
applications have so far focused on batch processing. The next step for Big Data is to move to the
online world. This shift will raise the requirements for transactional guarantees. CumuloNimbo is a
new EC-funded project led by Universidad Politécnica de Madrid (UPM) that addresses these issues
via a highly scalable multi-tier transactional platform as a service (PaaS) that bridges the gap
between OLTP and Big Data applications.

CumuloNimbo aims at architecting and for familiar programming interfaces means that existing applications will be
developing an ultra-scalable transac- such as Java Enterprise Edition (EE), able to run totally unmodified on top of
tional Cloud platform as a service SQL, as well as No SQL data stores, CumuloNimbo and benefit automati-
(PaaS). The current state of the art in ensuring seamless portability across a cally from the underlying ultra-scala-
transactional PaaS is to scale by wide range of application domains. bility, elasticity and high availability.
resorting to sharding or horizontal parti- Simultaneously the platform is Semantic transparency means that
tioning of data across database servers, designed to support Internet-scale applications will continue to work
sacrificing consistency and ease of pro- Cloud services (hundreds of nodes pro- exactly as they did on centralized infra-
gramming. Sharding destroys transac- viding service to millions of clients) in structure, with exactly the same seman-
tional semantics since it is applied to terms of both data processing and tics and preserving the same coherence
only subsets of the overall data set. storage capacity. These challenges they had. The full transparency will
Additionally, it forces modifications to require careful consideration of archi- remove one of the most important
applications and/or requires rebuilding tectural issues at multiple tiers, from the obstacles to migration of applications to
them from scratch, and in most cases application and transactional model all the cloud, ie the need to heavily modify,
also changing the business rules to adapt the way to scalable communication and or even fully rebuild, them.
to the shortcomings of current technolo- storage.
gies. Thus it becomes imperative to CumuloNimbo adopts a novel approach
address these issues by providing an CumuloNimbo improves the scalability for providing SQL processing. Its main
easily programmable platform with the of transactional systems, enabling them breakthrough lies in the scalability of
same consistency levels as current to process update transaction rates in transactional management, which is
service-oriented platforms. the range of one million update transac- achieved by decomposing the different
tions per second in a fully transparent functions required for transactional pro-
The CumuloNimbo PaaS addresses way. This transparency is both syntactic cessing and scaling each of them sepa-
these challenges by providing support and semantic. Syntactic transparency rately in a composable manner (refer to

Figure 1: CumuloNimbo architecture

34 ERCIM NEWS 89 April 2012

Figure 1). In contrast to many of the Currently, the project has completed the CumuloNimbo project is part of the port-
current approaches that constrain the specification of the architecture and the folio of the Software & Service
query language to simple key-value development of a first version of the Architectures and Infrastructures Unit –
stores, CumuloNimbo provides full core components, which have been suc- D3, Directorate General Information
SQL support based on the snapshot iso- cessfully integrated. CumuloNimbo is Society (http://cordis.europa.eu/fp7/
lation transaction model and scaling to expected to have a very high impact by ict/ssai).
large update transaction rates. The SQL enabling scalability of transaction pro-
engines use a No-SQL data store cessing over Cloud infrastructures Link:
(Apache HBase) as an underlying without changes for OLTP and bridge http://cumulonimbo.eu
storage layer, leveraging support for the gap between Big-Data applications
scalable data access and version man- and OLTP. The project is carried out by Please contact:
agement. The project is optimizing this Universidad Politécnica de Madrid Ricardo Jimenez-Peris
data store to operate over indexed (UPM), Foundation for Research and Universidad Politécnica de Madrid,
block-level storage volumes in direct- Technology – Hellas (FORTH), Yahoo Spain
attached or network-accessible storage Iberia, University of Minho, McGill Tel: +34 656 68 29 48
devices. University, SAP, and Flexiant. The E-mail: [email protected]

ConPaaS, an Integrated Cloud Environment

for Big Data
by Thorsten Schuett and Guillaume Pierre

ConPaaS makes it easy to write scalable Cloud applications without worrying about the complexity
of the Cloud.

ConPaaS is the platform as a service

(PaaS) component of the Contrail FP7
project. It provides a runtime environ-
ment that facilitates deployment of end-
user applications in the Cloud. The team
encompasses developers and researchers
from the Vrije Universiteit in
Amsterdam, the Zuse Institute in Berlin,
and XLAB in Ljubljana.

In ConPaaS, applications are organized

as a collection of services. ConPaaS cur-
rently provide services for web hosting
(PHP and Java), SQL and NoSQL data-
bases (MySQL and Scalaris), data storage Figure 1: The main ConPaaS dashboard with three services running
(XtreemFS) and for large scale data pro-
cessing (Task Farming and MapReduce). large collection of independent tasks retrieved. ConPaaS comes together with
Using these services a bioinformatics such as those issued by Monte-Carlo the XtreemFS distributed file system for
application could, for example, be com- simulations. The ability of these serv- clouds. Like ConPaaS services,
posed of a MapReduce service backend ices to dynamically vary the number of XtreemFS is designed to be highly
to process genomic data, as well as a Web Cloud resources they use makes it well- available and fully scalable. Unlike
hosting and SQL database service to pro- suited to very large computations: one most other file systems for the Cloud,
vide a Web-based graphical interface to only needs to scale services up before a XtreemFS provides a POSIX API. This
the users. Each service can be scaled on big computation, and scale them down means that an XtreemFS volume can be
demand to adjust the quantity of com- afterwards. This organization provides mounted locally, giving transparent
puting resources to the capacity needs of all the benefits of Cloud computing to access to files in the Cloud.
the application. application developers -- without
having to worry about Cloud-specific One of our demonstrator applications is a
ConPaaS contains two services specifi- details. Wikipedia clone. It can load database
cally dedicated to Big Data: MapReduce dumps of the official Wikipedia and store
and TaskFarming. MapReduce provides An important element in all Big Data their content in the Scalarix NoSQL
users with the well-known parallel pro- applications is the requirement for a database service. The business logic is
gramming paradigm. TaskFarming scalable file system where input and written in Java and runs in the Web
allows the automatic execution of a output data can be efficiently stored and hosting service. Deploying Wikipedia in

ERCIM NEWS 89 April 2012 35

Special Theme: Big Data

the Cloud takes about 10 minutes. the performance she expects. ConPaaS allow third-party developers to upload
Increasing the processing capacity of the will dimension each service such that their own service as a plugin to the
application requires two mouse clicks. the system meets its performance guar- existing ConPaaS system.
antees, while using the smallest pos-
One of ConPaaS’s Big Data use cases is sible number of computing resources. In conclusion, ConPaaS is a runtime
a bioinformatics application that In the wiki example, for instance, one environment for Cloud applications. It
analyses large datasets across distrib- may want to request that user requests takes care of the complexity of Cloud
uted computers. It uses large amounts of are processed on average in no more environments, letting application devel-
data from a Chip-Seq analysis, a type of than 500 milliseconds. opers focus on what they do best: pro-
genomic analysis methodology, and an gram great applications to satisfy their
application that can be parallelized in We plan to allow users to upload com- customers’ needs.
order to make use of multiple instances plex applications in a single operation.
or processors to analyse data faster. The Instead of starting and configuring mul- Links:
application stores its data in XtreemFS tiple ConPaaS services one by one, a http://www.conpaas.eu/
and makes extensive use of ConPaaS’s user will be able to upload a single man- http://contrail-project.eu/
MapReduce, TaskFarming, and Web ifest file describing the entire applica- http://scalaris.googlecode.com/
hosting services. Users will use the tion organization. Thanks to this mani- http://xtreemfs.org/
application either directly through an fest, ConPaaS will be able to orchestrate
API or through a web interface. the deployment and configuration of Please contact:
entire applications automatically. Guillaume Pierre
Although ConPaaS is already suffi- Vrije Universiteit in Amsterdam,
ciently mature to support challenging Finally, we plan to provide an SDK for The Netherlands
applications, we have many plans for external users to implement their own E-mail: [email protected]
further developments. In the near services. For example, one could write a
future, instead of manually choosing the new service for demanding statistical Thorsten Schuett
number of resources each service analysis, for video streaming, or for sci- Zuse Institute in Berlin, Germany
should use, a user will be able to specify entific workflows. The platform will E-mail: [email protected]

Crime and Corruption Observatory:

Big Questions behind Big Data
by Giulia Bonelli, Mario Paolucci and Rosaria Conte

Can we have information in advance on organized crime movements? How can fraud and corruption be
fought? Can cybercrime threats be tackled in a safe way? The Crime and Corruption Observatory of the
European Project FuturICT will work at answering these questions. Starting from Big Data, it will face big
challenges, and will propose new ways to analyse and understand social phenomena.

The cost of crime in the United States is The starting point is Big Data: the huge important European universities and
estimated to be more than $1 trillion amount of digital information now avail- research institutions will cooperate to
annually. The cost of corruption ranges able enables the development of virtual develop the necessary components.
from 2 to 5 per cent of global Gross techno-socio-economic models from Scientists from many different fields -
Domestic Product (GDP), ie from $800 existing and new information tech- from cognitive and social science to crimi-
billion to $2 trillion US dollars. The cost nology systems. The Observatory will nology, from artificial intelligence to com-
of war on terrorism in the US since 9/11 collect huge data sets and run massive plexity science, from statistics to eco-
is over $1 trillion. data mining and large-scale computer nomics and psychology - will be involved,
simulations of social dynamics related to establishing a pool of varied expertise. The
These are just a few examples of the huge criminal activities. It will be built using method will thus be strongly interdiscipli-
impact of crime on social, legal and eco- innovative technological instruments. nary: this is the only way to promote a real
nomic systems. The situation in Europe is This approach requires an internation- paradigm shift in our approach to policy
similarly dramatic. To face the problem ally recognized scientifically grounded and decision-making. On the one hand, the
in a new way, the “Crime and Corruption strategy, able to embrace different way policies are designed can be enhanced
Observatory” is being set up in order to national policies, since global threats through innovative “what-if” analyses
develop new technology to study and pre- such as crime and corruption require developed by complex models and simu-
dict the evolution of phenomena that global answers. lations. On the other hand, the goal is to
threaten the security of our society. The create new tools to support police and
Observatory aims at building a data infra- The Observatory will be built as a security agencies and services with more
structure to support crime prevention and European network, with a central node effective instruments for law enforce-
reduce the costs of crime. probably in Italy. A large number of ment.

36 ERCIM NEWS 89 April 2012

The Crime and Corruption Observatory
is part of a broader European project
called FuturICT, which was the first of
six “FET Flagship”pilots selected by
the European Commission as part of the
Framework 7 Programme.

The mission of FuturICT is to unleash

the power of information for a sustain-
able future: a Living Earth Simulator
will be built to understand and manage
complex social systems, with a focus on
sustainability and resilience. Within this
framework, the Crime and Corruption
Observatory will identify the under- From BigData to virtual models of our society (image by courtesy of FuturICT)
lying economical, social and cultural
mechanisms that influence illegal phe-
nomena, in order to control them at Data lie Big Questions: the Crime and Link: http://www.futurict.eu
European level. Corruption Observatory will identify
fundamental issues about the dynamics Please contact:
Addressing important issues, such as of crime and their implications, devel- Giulia Bonelli
fighting against terrorism and organized oping solutions and innovative theories ISTC-CNR, Italy
crime, fraud detection, and maintaining at the same time. Tel: +39 338 8689020
internal and external security, certainly E-mail: [email protected]
requires the use of modern technology, Data technology will thus become both
but this is not enough. Data must be responsive and responsible: it will pro- Mario Paolucci, ISTC-CNR, Italy
transformed into information and then vide not only practical answers but also E-mail: [email protected]
into knowledge, to reveal the real reliable theoretical tools, since Big Data
meaning of the billions of bits gathered should always have underlying Big Rosaria Conte, ISTC-CNR, Italy
worldwide. For this reason, behind Big Questions. E-mail: [email protected]

Managing Big Data through Hybrid Data

Infrastructures
by Leonardo Candela, Donatella Castelli and Pasquale Pagano

Long-established technological platforms are no longer able to address the data and processing
requirements of the emerging data-intensive scientific paradigm. At the same time, modern
distributed computational platforms are not yet capable of addressing the global, elastic, and
networked needs of the scientific communities producing and exploiting huge quantities and
varieties of data. A novel approach, the Hybrid Data Infrastructure, integrates several technologies,
including Grid and Cloud, and promises to offer the necessary management and usage capabilities
required to implement the ‘Big Data’ enabled scientific paradigm.

A recent study, promoted by The Royal The management and processing of data types and data sources requiring
Society of London in cooperation with such datasets is beyond the capacity of integration, is high.
Elsevier, reviewed the changing patterns traditional technological approaches
of science highlighting that science is based on local, specialized data facili- Recent approaches, such as Grid and
increasingly a global, multidisciplinary ties. They require innovative solutions Cloud Computing, can only partially
and networked effort performed by sci- able to simultaneously address the satisfy these needs. Grid Computing
entists that dynamically collaborate to needs imposed by multidisciplinary col- was initially conceived as a technolog-
achieve specific objectives. The same laborations and by the new data-inten- ical platform to overcome the limita-
study also indicated that data-intensive sive pattern. These needs are character- tions in volume and velocity of single
science is gaining momentum in many ized by the well known three V’s: (i) laboratories by sharing and re-using
domains. Large-scale datasets come in Volume – data dimension in terms of computational and storage resources
all forms and shapes from huge interna- bytes is huge, (ii) Velocity – data collec- across laboratories. It offers a valid
tional experiments to cross-laboratory, tion, processing and consumption is solution in specific scientific domains
single laboratory, or even from a multi- demanding in terms of speed, and (iii) such as High Energy Physics. However,
tude of individual observations. Variety – data heterogeneity, in terms of Grid Computing does not handle

ERCIM NEWS 89 April 2012 37

Special Theme: Big Data

Figure 1: A conceptual picture of a gCube-

based Hybrid Data Infrastructure,
highlighting the aggregative nature of this
kind of infrastructure and the management
facility performed though a thin client (a web
browser).

‘variety’ well. It supports a very limited software are made accessible by the are associated with rich metadata and
set of data types and needs a common infrastructure and are exploited by users provenance information which facilitate
software middleware, dedicated hard- using a thin client (namely a web effective re-use. These software frame-
ware resources and a costly infrastruc- browser), through dedicated on-demand works can be configured to implement
ture management regulated by rigid Virtual Research Environments. different policies ranging from the
policies and procedures. enforcement of privacy via encryption
gCube operates a large federation of and secure access control to the promo-
Cloud Computing, instead, provides an computational and storage resources by tion of data sharing while guaranteeing
elastic usage of resources that are main- relying on a rich and open array of provenance and attribution.
tained by third-party providers. It is mediator services for interfacing with
based on the assumption that the man- Grid (eg European Grid Infrastructure), The infrastructure enabled by gCube is
agement of hardware and middleware commercial cloud (eg Microsoft Azure, now exploited to serve scientists oper-
can be centralized, while the applica- Amazon EC2), and private cloud (eg ating in different domains, such as biol-
tions remain in the hands of the con- OpenNebula) infrastructures. ogists generating model-based large-
sumer. This considerably reduces appli- Relational databases, geospatial storage scale predictions of natural occurrences
cation maintenance and operational systems (eg Geoserver), nosql data- of species and statisticians managing
costs. However, as it is a technology bases (eg Cassandra, MongoDB), and and integrating statistical data.
based on an agreement between the reliable distributed computing plat-
resource provider and the consumer, it forms (eg Hadoop) can all be exploited gCube is a collaborative effort of several
is not suitable to manage the integration as infrastructural resources. This guar- research, academic and industrial cen-
of resources deployed and maintained antees a technological solution suitable tres including ISTI-CNR (IT),
by diverse distributed organizations. for the volume, velocity, and variety of University of Athens (GR), University
the new science patterns. of Basel (CH), Engineering Ingegneria
The Hybrid Data Infrastructure (HDI) is Informatica SpA (IT), University of
a new, more effective solution for man- gCube is much more than a software Strathclyde (UK), CERN (CH). Its
aging the new types of scientific integration platform; it is also equipped development has been partially sup-
dataset. It assumes that several tech- with software frameworks for data man- ported by the following European proj-
nologies, including Grid, private and agement (access and storage, integra- ects: DILIGENT (FP6-2003-IST-2),
public Cloud, can be integrated to pro- tion, curation, discovery, manipulation, D4Science (FP7-INFRA-2007-1.2.2),
vide an elastic access and usage of data mining, and visualization) and work- D4Science-II (FP7-INFRA-2008-1.2.2),
and data-management capabilities. flow definition and execution. These iMarine (FP7-INFRASTRUCTURES-
frameworks offer traditional data man- 2011-2), and EUBrazilOpenBio (FP7-
The gCube software system, whose agement facilities in an innovative way ICT-2011-EU-Brazil).
technological development has been by taking advantage of the plethora of
coordinated by ISTI-CNR, implements integrated and seamlessly accessible Link:
the HDI approach. It was initially con- technologies. The supported data types gCube website:
ceived to manage distributed computing cover a wide spectrum ranging from tab- http://www.gcube-system.org
infrastructures. It has evolved to operate ular data – eg observations, statistics,
large-scale HDIs enabling a data-man- records – to research products – eg dia- Please contact:
agement-capability-delivery model in grams, maps, species distribution Pasquale Pagano, ISTI-CNR, Italy
which computing, storage, data and models, enhanced publications. Datasets E-mail: [email protected]

38 ERCIM NEWS 89 April 2012

Cracking Big Data
by Stratos Idreos

A fundamental and emerging need with big amounts of data is data exploration: when we are
searching for interesting patterns we often do not have a priori knowledge of exactly what we are
looking for. Database cracking enables such data exploration features by bringing, for the first time,
incremental and adaptive indexing abilities to modern database systems.
Source: Shutterstock

Good performance in state of the art

database systems relies largely on proper
tuning and physical design. Typically, all
tuning choices happen up front,
assuming sufficient workload knowl-
edge and idle time. Workload knowledge
is necessary in order to determine the
appropriate tuning actions, ie to decide
which proper indexes should be created,
while idle time is required in order to
actually perform those actions. In other
words, we need to know what kind of
queries we are going to ask and we need query after query, the system follows the extended to exploit a partition/merge-
to have enough time to prepare the exploration path of the scientist, incre- like logic as well as workload robust-
system for those queries. mentally building and refining indexes ness is achieved via stochastic cracking.
only for the data areas that seem inter- Future and ongoing research aims to
However, in dynamic environments esting for the exploration path. After a tackle concurrency control, disk based
with big data, workload knowledge and few queries, performance adaptively cracking, maintenance under long query
idle time are scarce resources. For improves to the level of a fully tuned sequences and holistic indexing.
example, in scientific databases, new system.
data arrive on a daily or even hourly This project takes place primarily at the
basis, while query patterns follow an From a technical point of view cracking Database Architectures group of CWI in
exploratory path as the scientists try to relies on continuously physically reor- Amsterdam, The Netherlands and is
interpret the data and understand the pat- ganizing data as the users pose more and part of the MonetDB column-oriented
terns observed; there is no time and more queries. Every query is used as a database system research projects.
knowledge to analyze and prepare a dif- hint on how data should be stored. For Other labs involved include HP Labs,
ferent physical design every hour or example a data column referenced in Palo Alto, USA, National University of
even every day; even a single index may queries is continuously reorganized (par- Singapore (NUS) and Rutgers
take several hours to create. titioned) as part of processing those University, USA. Several researchers
queries. The actual query selection are involved: Stratos Idreos (CWI),
Traditional indexing presents three funda- bounds are used for partitioning. This Eleni Petraki (CWI), Stefan Manegold
mental weaknesses in such cases: (a) the brings structure and partitioning infor- (CWI), Martin Kersten (CWI), Goetz
workload may have changed by the time mation, allowing future queries to access Graefe (HP Labs), Harumi Kuno (HP
we finish tuning; (b) there may be no time data faster. Future queries exploit the Labs), Panagiotis Karras (Rutgers U.),
to finish tuning properly; and (c) there is partitioning information gained from the Felix Halim (NUS), and Roland C Yap
no indexing support during tuning. previous queries but they also introduce (NUS).
even more partitioning, bringing the con-
In this project, we propose a new tinuous adaptation property. From a per- CWI and Stratos Idreos won several
approach to the physical design problem, formance point of view cracking awards for this research, including the
called database cracking. Cracking intro- imposes a minimal overhead compared 2011 ERCIM Cor Baayen Award and
duces the notion of continuous, incre- to the default performance of using no the 2011 ACM SIGMOD Jim Gray
mental, partial and on demand adaptive indexes, while at the same time it contin- Dissertation award.
indexing. Thereby, indexes are incre- uously converges at the optimal perform-
mentally built and refined during query ance of using full indexes even though it Links:
processing. The net effect is that there is requires zero initialization cost. http://homepages.cwi.nl/~idreos/
no need for any upfront tuning steps. In http://www.monetdb.org
turn, there is no need for any workload Cracking was proposed in the context of
knowledge and idle time to set up the modern column-stores and has been Please contact:
database system. Instead, the system hitherto applied for boosting the select Stratos Idreos
autonomously builds indexes during operator performance, joins, mainte- Database Architectures group
query processing, adjusting fully to the nance under updates, and arbitrary CWI, The Netherlands
needs of the users. For example, as a sci- multi-attribute queries. In addition, Tel: +31 20 592 4169
entist starts exploring a big data set, more recently these ideas have been E-mail: [email protected]

ERCIM NEWS 89 April 2012 39

Research and Innovation

Massively Multi-Author The World-Wide-Mind server (http://w2mind.computing.dcu.ie)

allows developers to pose software problems, (whether
Hybrid Artificial Intelligence related to AI or not), for others to pose solutions to.
Problems, such as a game of chess, or maze to be solved, are
by John Pendlebury, Mark Humphrys and Ray Walshe known as “worlds”. Solutions to these problems are known
as “minds”. Both worlds and minds can be developed off-
There is an emerging consensus in much of AI and line and uploaded to the World-Wide-Mind server. As facili-
cognitive science that “intelligence” is most likely the tated by many video hosting websites, such as Youtube,
product of thousands of highly specialised subsystems authors can upload their work and be assured that it will be
collaborating in some kind of ‘Network of Mind’. In hosted indefinitely.
2001, Mark Humphrys proposed that if Artificial
Intelligence (AI) is to “scale up”, it will require a Any web user can run a mind by selecting it from a list of
collaborative effort involving researchers from diverse minds displayed in a web browser. A new instance of the
disciplines, across multiple laboratories world and a new instance of the mind are created and run
(http://computing.dcu.ie/~humphrys/WWM/). Until now together, after which the world will assign a score to the
there has never been an easy system to facilitate the mind.
construction of hybrid AI from the work of multiple
laboratories. The World-Wide-Mind is the latest in a This score can be used by mind authors to choose the most
series of prototype systems, which enables the successful minds as components in their own hybrid minds,
construction of hybrid AI systems from multiple without the need to consult the original mind’s author, or
laboratories. install anything. It then becomes possible to create entire

Figure 1: Some worlds currently hosted on the World-Wide-Mind server, including a space simulation, a tennis game, a mining game and a battle
simulation.

Figure 2: An Architecture for the World-Wide-Mind showing the World-Wide-Mind server running three instances of worlds with minds; two of which
are running with hybrid minds and one running with an individual mind and a user proxy, allowing a user to interact with the world and mind.

40 ERCIM NEWS 89 April 2012

hierarchies of minds with one mind at the top of the hier- Bionic Packaging:
archy arbitrating between the actions of minds below it,
which might themselves be arbitrators of minds below them, A Promising Paradigm
and so on. During a run worlds can opt to output images. The
system also has the facility to generate a video using these for Future Computing
images.
by Patrick Ruch, Thomas Brunschwiler, Werner Escher,
In late 2011 funding was secured from the Irish Research Stephan Paredes and Bruno Michel
Council for Science, Engineering & Technology (IRCSET)
to enhance the World-Wide-Mind over three years. Current The spectacular progress in the development of
work in scaling up this platform is to move image generation computers has been following Moore’s Law for at least
to the client. This will allow the system to update the user’s six decades. Now we are hitting a barrier. While
view of a run in real-time. neuromorphic computing architectures have been touted
as a promising basis for low-power bio-inspired
Distributed games frequently use artificial intelligent agents microprocessors down the road, imitating the packaging
to enhance the user experience. As the user will be viewing of mammalian brains is a new concept which may open
the state of a world in real-time, there is no technical reason new horizons independent of novel transistor
why they cannot interact with minds in real-time during a technologies or non-Van Neumann architectures.
run. The advantage to our system would be to allow minds to
learn from direct interaction with human agents. With this If one looks at the transistor count in microprocessors over
enhancement the system will resemble a generic MMOG the last four decades we have gone from 2500 to
(Massively Mutli-player Online Game) platform capable of 2,500,000,000, a gain of six decades.
running any game that a user may upload.
But today, two major roadblocks to further progress have
Many worlds and minds have been written for the World- arisen: power density and communication delays. The cur-
Wide-Mind. A selection of the best of these can be found at rently fastest computer was built by Fujitsu and is at the
http://w2mind.computing.dcu.ie. Some minds on the current Riken Institute in Japan; it has a capacity of 8 petaflops and a
system are individuals; others are hybrids consisting of sev- power consumption of more than twelve megawatts ---
eral, or even dozens of individual minds. In the future, enough for some ten thousand households. As to communi-
instead of developing solutions consisting of dozens of spe- cation, while pure on-chip processing tasks can be completed
cialised minds, potentially we could develop solutions con- in shorter and shorter times as technology improves, the
sisting of thousands, or hundreds of thousands of minds. transmission delays between processors and memory or
other processors grow as a percentage of total time and
The World-Wide-Mind already resembles an ecology, where severely limit overall task completion.
unsuccessful minds are ignored and successful minds are
reused constantly. If this is indeed the case then what kinds of Energy considerations
problems will these new hybrids be capable of solving? We A processor architecture that imitates the mammalian brain
hope that this system will harness an unexploited creativity promises a revolution. Compared with the mammalian brain,
for creating hybrid AIs that until now has been almost today’s computers are terribly inefficient and wasteful of
entirely dormant. energy, with the number of computing operations per unit
energy of the best man-made machines being in the order of
Links: ~0.01% of the human brain depending on the workload. This
http://computing.dcu.ie/~humphrys/WWM/ inefficiency occurs not only in the processing chips them-
http://w2mind.computing.dcu.ie selves but also in the energy-hungry air conditioners that are
http://computing.dcu.ie/~humphrys/wwmdev/ needed to remove the heat generated by the processors.
selected.worlds.html
In the October 2009, No. 79, issue of ERCIM News the
Please contact: Aquasar project was described. In Aquasar, the individual
John Pendlebury, School of Computing, Dublin City Uni- semiconductor chips are water cooled, with water being
versity, Ireland 4,000 times more effective than air in removing heat.
Tel: +353 1 7005616 Aquasar uses micro-fluidic channels in chip-mounted water
E-mail: [email protected] coolers to transport the heat. This result is achieved with rel-
atively hot coolant so that the recovered thermal energy is
used for heating a building. Aquasar is installed and running
at the ETH in Zurich; a successor machine with a three
petaflop capacity is currently being installed in Munich. The
integration of microchannels into the semiconductor chips
themselves promises to sustain very high power dissipation
as the industry strives to increase integration density further
and also move to 3D chip stacks.

Going further in the new paradigm, another cause of energy

inefficiency is the loss (in power and in space) in delivering

ERCIM NEWS 89 April 2012 41

Research and Innovation

Figure 1: Vision of an ultra-dense assembly of chip stacks in a 3D computer with fluidic heat removal and power delivery.

the required electrical energy to the chips. The pins dedicated reasons behind the performance limitation faced by modern
to power supply in a chip package easily outnumber the pins computing systems known as the memory wall, in which it
dedicated to signal I/O in high-performance microprocessors, takes several hundred to thousand CPU clock cycles to fetch
and the number of power pins has been growing faster than data from main memory. A dense, three-dimensional phys-
the total number of pins for all processor types. This power ical arrangement of semiconductor chips would allow much
problem is essentially a wiring problem, which is aggravated shorter communication paths and a corresponding reduction
by the fact that wiring for signal I/O has a problem of its own. in internal-delay roadblocks. Such a dense packaging of
The energy needed to switch a 1 mm interconnect is more chips is physically possible were it not for the problems of
than an order of magnitude larger than the energy needed to heat removal and energy delivery using today’s architecture.
switch one MOSFET; 25 years ago, the switching energies Using the techniques just described for handling electrical
were roughly equal. This disparate evolution of communica- energy delivery and heat removal, this dense packaging can
tion and computation is even more pronounced for another be achieved and communication bottlenecks with associated
vital performance metric, latency. The reason behind this delay avoided.
trend is simply that transistors have become smaller while the
chip size has not followed suit, leading to substantially longer The fluidic means of removing heat allows huge increases in
total wire lengths. A proposed new solution to this two-fold packaging density while the fluidic delivery of power with
wiring problem, again patterned after the mammalian brain, is the same medium saves the space used by conventional hard-
to use the coolant fluid as the means of delivering energy to wired energy delivery. The savings in space allow much
the chips. Probably it is easiest to think of this in terms of a denser architectures, not to mention the sharply reduced
kind of electrochemical, distributed battery where the cooling energy needs and improved latency as communication paths
electrolyte is constantly “recharged” while the heat is are chopped up into much shorter fragments
removed. This is how energy is delivered to the mammalian
brain and with great effectiveness. Conclusion
Using this new paradigm, the hope is that a petaflop super-
Using the example of the human brain as the best low-power computer could eventually be built in the space taken by
density computer we use technological analogs for sugar as today’s desktop PC. This is only a factor of eight smaller in
fuel and the trans-membrane proton pumps which evolution performance than the above-mentioned fastest computer in
has brought about as the chemical to electrical energy con- the world today! The second reference below describes the
verters. These analogs are inorganic redox couples which paradigm in detail.
have been mainly studied for grid-scale energy storage in the
form of redox flow batteries. These artificial electrochemical Links:
systems offer superior power densities than their biological Gerhard Ingmar Meijer, Thomas Brunschwiler, Stephan
counterparts, but would still be pressed to meet the weighty Paredes, and Bruno Michel, “Using Waste Heat from Data
challenge of satisfying the energy need of a fully loaded Centres to Minimize Carbon Dioxide Emission”, ERCIM
microprocessor. However, future high-performance com- News, No. 79, October, 2009:
puters which could be built around this fluidic power http://ercim-news.ercim.eu/en79/special/using-waste-heat-
delivery scheme would be much less power-intensive due to from-data-centres-to-minimize-carbon-dioxide-emission
their reduced communication cost.
P. Ruch, T. Brunschwiler, W. Escher, S. Paredes, and B.
Integration density and communication Michel, IBM J. Res. Develop.,“Toward 5-Dimensional
The number of communication links between logic blocks in Scaling: How Density Improves Efficiency in Future Com-
a microprocessor depends on the complexity of the intercon- puters”, Vol. 55 (5), 15:1-15:13, October 2011:
nect architecture and on the number of logic blocks in a way http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnum-
that can be described by a power-law known as Rent’s Rule. ber=6044603
Today, all microprocessors suffer from a break-down of
Rent’s Rule for high logic block counts because the number Please contact:
of interconnects does not scale smoothly beyond the chip Bruno Michel, IBM Zurich Research Laboratory
edge. The limited number of package pins is one of the main E-mail: [email protected]

42 ERCIM NEWS 89 April 2012

NanoICT: (M. D’Acunto, S. Colantonio, D. Moroni, O. Salvetti,
Journal of Modern Optics, 57, 1695-1699, 2010). An
A New Challenge for ICT exciton is a bond state between an electron crossed on the
conduction band and its corresponding hole in the valence
by Mario D’Acunto, Antonio Benassi, Ovidio Salvetti band connected by the electrostatic Coulomb force. The
possibility to tune the exciton population during the self-
Nanotechnology is the manipulation or self-assembly of assembling process opens the road for the production of
individual atoms, molecules or molecular clusters into smart biosensors with many possible applications (genetic
structures to create material and devices through an diagnosis, screening of genetically modified food, etc.). The
exact control of size and form in the nanometer scale. smart biosensors, engineered by our simulations, will imple-
The immense potential of this field is presenting a ment DNA sequences that are complementary to the carbon
challenge for the ICT world. nanotubes and are compatible with specific biosensor
enzymes for many different compounds. In turn, it will be a
A nanometer (nm) is a billionth of a meter (1nm=10-9m), ie self-assembling guided procedure for biosensors at the bio-
about 1/80,000 of the diameter of a human hair, or 10 times molecular level. Such smart sensors could be strongly
the diameter of a hydrogen atom. The term nanotechnology improved using graphene sheets instead of carbon nan-
is generally used when referring to materials of size 0.1nm otubes, because of the large area for single strand DNA
to 100nm. Materials with nanometric structures often functionalization made available by graphene sheets, see
exhibit quite different properties- mechanical, optical, Figure 2. The construction of a smart nano-biosensor based
chemical, magnetic or electronic - compared with traditional on the self-assembly of DNA on graphene sheets is a future
bulk materials made from the same chemical composition. exciting challenge for scientists bridging the gap between
Two principal factors cause the properties of nanomaterials the behavior of matter on the nanoscale and the ICT world.
to differ significantly from other materials: increased rela- Another possible application that we are studying is to use
tive surface area, and quantum effects. These factors can electrospun nanofiber for hybridizing single stranded DNA.
change or enhance properties such as reactivity, strength and Electrospinning is a well-developed process used to produce
electricity, or optical characteristics, because the deviation nanofibers from a variety of materials. In electrospinning, a
of surface and interface properties from the bulk properties
of larger amounts of material sometimes leads to unex-
pected surface effects.

Information and Communications Technology (ICT) is one of

the areas that has most benefited from nanotechnologies,
where it has been traditionally associated with nanoelec-
tronics, in the efficient development and miniaturization of
items such as computer chips, information storage, and sen- Figure 1. Example of DNA wrapping a carbon nanotube. The
sors. Certain ICT procedures, such as distributed calculus and chemical bonding of nanotubes is composed entirely of sp2-bonds,
smart processing can be considered suitable for the imple- similar to those of graphite. These bonds provide nanotubes with
mentation of bottom-up nanotechnology procedures. The their unique strength, and other specific physical-chemical
self-assembly of nanostructures is the clearest evidence of a properties, selectively changed by the bonding with DNA filament.
bottom-up processing (as opposed to miniaturization that can
be considered the basic top-down procedure). Self-assembly
is the art of building by mixing. Chemists have been doing
this for centuries. The challenge today is to make such sys-
tems smart. In order to successfully use self-assembly to build
micro- and nano-devices it is important to use building blocks
that can be programmed to assemble in certain, pre-deter-
mined ways. If all the components that are being assembled
are of the same kind, a simple over-scale structure will be the
result. Using building blocks with differentiated binding sites
that only fit together in certain patterns is a way to program
the assembly process and makes it possible to build far more
advanced structures than just periodic ones. Such processes
are known as programmable self-assembly.

One important example of programmable self-assembly is

the mechanism of single stranded DNA hybridization on dif-
ferent surfaces, gold nanoparticles surfaces, carbon nan-
otubes, etc. DNA is an excellent self-assembly glue since it
is specific in its bonding interactions and can be used in var-
ious nanotechnology applications, see Figure 1. Recently, Figure 2: A graphene sheet: a one-atom-thick planar sheet of
we showed that the hybridization of DNA on a single-wall sp2-bonded carbon atoms densely packed in a honeycomb crystal
carbon nanotube (SWNT) is accomplished by a band-gap lattice. The crystalline or flake form of graphite consists of many
fluorescent shift due to changes in the exciton population graphene sheets stacked together.

ERCIM NEWS 89 April 2012 43

Research and Innovation

high voltage is applied to viscous solution on a sharp con- guages, the meaning of a cell entry in a table is most easily
ducting tip, causing it to form a Taylor cone. As the electric defined by the leftmost cell of the same row and the topmost
field is increased, a fluid jet is extracted from the Taylor cell of the same column. Even when the internal encoding pro-
cone and accelerated toward a grounded collecting sub- vides fine-grained annotation, the conceptual gap between the
strate. Nanofibers (having at least one dimension of 100 low level representation of PODs and the semantics of the ele-
nanometer (nm) or less) exhibit special properties mainly ments is extremely wide. This makes it difficult:
due to extremely high surface to weight ratio. Within this • for human and applications attempting to manipulate POD
main nanoICT challenge, we are now working on simulating content. For example, languages such as XPath 1.0 are
possible improvements of DNA-functionalized nanofibers currently not applicable to PDF documents;
to be used as smart nanobiosensors through a combination • for machines attempting to learn of extraction rules auto-
of selective DNA chemical bonding with nanofiber surface. matically. In fact, existing wrapper induction approaches
infer the regularity of the structure of PODs only by ana-
Please contact: lyzing their internal structure.
Mario D’Acunto, ISM-CNR, Italy
E-mail: [email protected] The effectiveness of manual and automated wrapper con-
struction is thus limited by the need to analyze the internal
Ovidio Salvetti or Antonio Benassi, ISTI-CNR, Italy encoding of PODs with increasing structural complexity.
E-mail: [email protected], The intrinsic print/visual oriented nature of PDF encoding
[email protected] poses many issues in defining ‘ad hoc’ information extrac-
tion approaches.

In the literature a number of spatial query languages for Web

Information Extraction pages, query languages for multimedia databases and presen-
tations, visual Web wrapping approaches, and PDF wrapping
from Presentation- approaches, have been proposed. However, so far, these pro-
posals provide limited capabilities for navigating and
Oriented Documents querying PODs for information extraction purposes. In par-
ticular, existing approaches are not able to generate extrac-
by Massimo Ruffolo and Ermelinda Oro tion rules that are reusable when the internal structure
changes, or for different documents in which information is
The Web is the largest knowledge repository ever. In presented by the same visual pattern. Information extraction
recent years there has been considerable interest in approaches are needed that can exploit the presentation fea-
languages and approaches providing structured (eg XML) tures of PODs.
and semantic (eg Semantic Web) representation of Web
content. However, most of the information available is ICAR-CNR is addressing these problems through the defini-
still accessed via Web pages in HTML and documents in tion of spatial and semantic wrapper induction and querying
PDF, both of which have internal encoding conceived to approaches that allow users to query PODs by exploiting the
present content on screen to human users. This makes visual patterns provided in the presentation. These
automatic information extraction problematic. approaches are grounded on document layout analysis and
page segmentation algorithms combined with techniques for
In Presentation-Oriented Documents (PODs) content is laid automatic wrapper induction and spatial languages like
out to provide visual patterns that help human readers to make SXPath, a spatial extension of XPath 1.0. The innovative
sense of it. A human reader is able to look at an arbitrary docu- approaches for information extraction from PODs now being
ment and intuitively recognize its logical structure and under- studied at ICAR-CNR permit: (i) the analysis of document
stand the various layout conventions and complex visual pat- layout and recognition of complex content structures like
terns that have been used in its presentation. This aspect is par- tables, sections, titles, data records, page columns, etc.; (ii)
ticularly evident, for instance, in Deep Web pages where Web the automatic learning of extraction rules and creation of
designers arrange data records and data items with visual regu- wrappers that enable relevant information to be extracted
larity, and in PDF documents where tables are used to meet the from documents such as records and objects belonging to
reading habits of humans. However, the internal representa- specific classes; (iii) the navigation and querying of both
tions of PODs are often very intricate and not expressive Web and PDF documents by spatial primitives that exploit
enough to allow the associated meaning to be extracted, even the spatial arrangement of content elements resulting from
though it is clearly evidenced by the presentation. documents presentation.

In order to extract data from such documents, for purposes A CNR spin-off and start-up company, Altilia srl, will imple-
such as information extraction, it is necessary to consider their ment the approaches defined at ICAR-CNR. Altilia will pro-
internal representation structures as well as the spatial relation- vide semantic content capture technologies for the content
ships between presented elements. Typical problems that must management area of the IT market.
be addressed, especially in the case of PDF documents, are
incurred by the separation between document structure and Link: http://www.altiliagroup.com
spatial layout. Layout is important as it often indicates the
semantics of data items corresponding to complex structures Please contact: Massimo Ruffolo, ICAR-CNR, Italy
that are conceptually difficult to query, eg in western lan- E-mail: [email protected]

44 ERCIM NEWS 89 April 2012

classification, using the advantage of categorical random
Region-based variables. Determining the necessary number of classes and
initialization are some drawbacks of the EM type algorithms.
Unsupervised Classification First time in this study, we combine hierarchical agglomera-
tion, CEM and a model order selection criterion to get rid of
of SAR Images the drawbacks of CEM in order to obtain an unsupervised
SAR image classification algorithm. In this way, we are able
by Koray Kayabol to determine the necessary number of land covers in a given
area. We start the CEM algorithm with a large number of
Many applications in remote sensing, varying from crop classes and then reduce the number of classes iteratively by
and forest classification to urban area extraction, use merging the weakest class in probability to the one that is
Synthetic Aperture Radar (SAR) image classification. As most similar to it with respect to a distance measure. Figure 1
ERCIM Fellows, we have studied the classification of demonstrates the algorithm where a SAR image is classified
land covers for a year. Our results on the classification of to three regions namely, urban, land and water areas.
water, land and urban areas can be used by city
administrators and planners to automatically classify This work started in November 2010 and major parts were
related land covers in order to control the development completed in the Ariana/Ayin research group at Inria, France,
of the city, while preserving its resources like forests and lead by Josiane Zerubia. Koray Kayabol carried out this
waters. work during the tenure of an ERCIM ‘Alain Bensoussan’
Postdoctoral Fellowship Programme. He is currently pur-
The aim of image classification is to assign each pixel of the suing his works in the Probability and Stochastic Networks
image to a class with regard to a fea-
ture space. These features can be the
basic image properties as intensity and
colours. Moreover, some more
advance abstract image descriptors as
textures can also be exploited as fea- ©Infoterra

ture. Radar images are preferred in

remote sensing because the acquisition
of the images is not affected by light
and weather conditions.

The developments in forming SAR

images took place in the years 1950- Figure 1: Region based classification of a TerraSAR-X image by hierarchical agglomeration based
1975 and the studies on processing unsupervised algorithm. The TerraSAR-X image was acquired over the city of Rosenheim in Germany.
and understanding SAR images
started in the late 1970s. By the tech-
nological developments, we are now
able to work with high resolution SAR images. For example, (PNA2) group at CWI, The Netherlands, with Marie-Colette
the images used in our studies have ground resolutions van Lieshout as part of the ERCIM MUSCLE Working
varying from 2.5 to 8.2 meters. The scope of our work is high Group. We presented this research during the MUSCLE
resolution SAR image classification. As a feature, we use the International Workshop on Computational Intelligence for
amplitudes of the SAR images. As a model density for SAR Multimedia Understanding held in Pisa, Italy, organized by
image amplitudes, we use the Nakagami density, since it is a the Working Group. The study is planned to be finalized in
basic multi-look amplitude model for SAR images. Multi- April 2012. The Italian Space Agency (ASI) provided the
look SAR images are formed by scanning the same region COSMO-SkyMed images. The TerraSAR-X images are pro-
multiple times and consequently have less noise than the vided by Astrium Services (Infoterra) from
single-look SAR images. Beside the amplitudes, we must http://www.infoterra.de/. The papers written during the study
take into consideration the spatial neighborhood of the pixels can be found in the links below.
to obtain some smooth and connected components in the
class label map. We introduce the spatial interactions
adopting the Multinomial Logistic model to obtain a smooth Links:
classification map. http://www-sop.inria.fr/ariana/en/lequipe.php?name=Kayabol
http://hal.archives-ouvertes.fr/hal-00612491/fr/
In this work, we follow the model based classification http://www.cwi.nl/pna2
approach to reach a novel unsupervised classification algo- http://repository.cwi.nl/search/searchrepository.php?cwi=2457
rithm for SAR images. The Finite Mixture Model (FMM) is a http://muscle.isti.cnr.it/pisaworkshop2011/index.html
suitable statistical model to represent the SAR image his-
togram and to perform a model based classification. Fitting a Please contact:
mixture model to some data can be realized by using the Koray Kayabol, CWI, The Netherlands
Expectation-Maximization algorithm. We use a computation- E-mail: [email protected]
ally less expensive version of the EM algorithm, namely
Classification EM (CEM), for both parameter estimation and Josiane Zerubia, Inria, France

ERCIM NEWS 89 April 2012 45

Research and Innovation

Computer-Aided stem from a training sample of some 500 patients’ data.(50

GB DICOM)
Diagnostics
An auto-correlation matrix of the pixel intensities of the
by Peter Zinterhof patches is made up and in a second step both the Eigenvalues
and Eigenvectors of that matrix are computed and the
In a joint project, scientists at University Salzburg and Eigenvectors corresponding to the 32 largest Eigenvalues are
SALK (Salzburger Landeskrankenanstalten) explore how retained for further processing. Based on these Eigenvectors
to apply machine-learning techniques to assess the we are able to build a codebook that holds the necessary infor-
huge amount of image data generated by computed mation for recognizing medically relevant features later on.
tomography.
Such codebooks are composed of millions of short vectors of
Modern radiology offers a wide range of imaging tools for length 32 (also being called feature vectors), which are
diagnosis and for planning of treatments of various severe ill- assigned specific class labels, eg ‘kidney’, ‘tissue’, ‘bone’.
nesses. As techniques such as MRI and CT (computed tomog- New data (ie image patches) will then also be decomposed
raphy) are employed on a large scale, clinicians all over the into feature vectors by means of the very same Eigenvectors.
world are confronted with an increasing amount of imaging At the core of the recognition algorithm we take a previously
data. These data have to be analysed and assessed by trained unknown feature vector, ‘look up’ its corresponding entry in
radiologists in a mostly manual process, which requires high the codebook, and assign that vector the same class label as
levels of concentration for long periods of time. Also, the the entry found in the codebook. The look-up process is
intrinsic power of modern medical imaging technology opens based on finding the entry with minimal Euclidean distance
up the possibility of detecting conditions in a patient that to the unknown vector.
weren’t even on the scope of the physician who initiated the
diagnostic run in the first place. As a consequence, in addition As we deal with a high-dimensional vector space model, the
to concentrating on the initial diagnostic directions, the lookup process itself is not suitable for typical optimizations,
assessing radiologist is also responsible for detecting and cor- such as kD-trees or clustering. This is a direct consequence
rectly reporting other medical issues revealed by a patient’s of the ‘curse of dimensionality’, which basically limits us to
data. As a result, the overall diagnostic process is very a standard linear search operation within the codebook as the
demanding and sometimes even error-prone. space- and execution-time optimal approach to solve the
retrieval problem.

For a new patient, the whole set of CT data is then checked

frame by frame at the ‘regions of interest’, which consist of
some 64 000 pixels per frame (see picture 1 with marked area
of the left kidney). On average 2 PB (2 million GB) of data
have to be processed per patient. Fortunately, this computa-
tionally complex process lends itself very well to distributed
parallel hardware, eg clusters of general purpose graphics
units (GPU) that are especially powerful in the field of high-
throughput computing.

In our cluster setup of 16 distributed NVIDIA GPUs (Fermi

and GT200 class) we achieve high recognition rates for the
complete segmentation of the kidney area in a patient’s CT
data in a time frame of 10-12 minutes (wall-time), based on
10 million codebook entries. One of the charms of PCA is its
Figure 1: Raw segmentation of the right kidney relative resilience against over-fitting of the model data, so
increasing the amount of training data in general will not
reduce its generalization capability (as can be the case in arti-
In a joint research project with the department for radiology ficial neural networks). Hence, the approach is easily
at SALK (Salzburger Landeskrankenanstalten), we aim to extendible both in terms of additional compute nodes and
explore the application of certain machine-learning tech- additional training data, which will help us to sift relevant
niques for image segmentation in the field of computed information from this vast data space even faster and at
tomography. One such approach is based on principle com- higher quality.
ponent analysis (PCA), also known as Karhunen-Loeve
transform. This method derives statistically uncorrelated Our developments aid both clinicians as well as patients by
components of some given input data set, allowing robust enabling automatic assessment of large amounts of visual
recognition of patterns in new datasets. Our algorithm works data on massively parallel computing machinery.
on image patches of 64 x 64 pixels, allowing us to detect
image details up to this size within a single CT frame. Please contact:
Peter Zinterhof, Universität Salzburg, Austria
During an initial setup phase we semi-automatically generate Tel: +43 662 8044 6771
and label a large number – usually millions – of patches that E-mail: [email protected]

46 ERCIM NEWS 89 April 2012

Computer-Aided (International Aeronautical and Maritime Search and
Rescue). IAMSAR procedures require many meteorological
Maritime Search parameters, they employ many data tables and curve plots in
order to evaluate the datum (a geographic point, line or area
and Rescue Operations used as reference in the search planning). However, such
procedures involve several complex computations and
by Salvatore Aronica, Massimo Cossentino, Carmelo require a lot of valuable time thus delaying the start of a
Lodato, Salvatore Lopes, Umberto Maniscalco. SAROP.

Information and communications technologies promise The use of a computer system implementing all the proce-
to have a significant impact on safety at sea. This is dures involved in SAROPs can reduce errors and the time
particularly true for smaller ships and boats that rarely needed to define a SAR plan; it can also improve on the
have active on board safety systems. We are currently probability of success of the rescue mission.
developing a system for computer-aided maritime
search and rescue operations within the ICT-E3 Project IAMSAR procedures were originally developed for manual
(ICT excellence programme of Western Sicily funded by calculation and they do not include the support for a computer
the Sicilian Regional Government). implementation. Hence, they avoid the adoption of complex
and effective algorithms for defining the search action plan;
A successful conclusion of a maritime search and rescue such algorithms are a viable solution when the support of a
(SAR) operation depends on several factors. Some, like the computer is available. For the same reason, the IAMSAR
weather and sea conditions, are uncontrollable; others can be manual only suggests two simple search paths for the naviga-
optimized and made more effective by the employment of tion of SAR units; furthermore, the handling and the alloca-
information and communications technologies. A system tion of these units over the search area, is quite rigid.
able to localize a vessel in trouble and to define the most effi-
cient plan for search and rescue activities is of great impor- Starting from these considerations, we have developed an
tance for safety at sea. enhanced implementation of IAMSAR procedures and have
integrated it with the automatic localization system outlined
The first step in building such a system is an accurate local- above. An enhanced statistical processing method (the
ization of vessels in trouble. Normally, the last known posi- Monte Carlo simulation technique) has been introduced in
tion (LKP) of a vessel is communicated to the rescuers by this implementation. It determines the search area instead of
the people on board. This apparently simple action can the prefixed probability maps suggested by the IAMSAR
become extremely difficult with adverse weather and sea manual. Crucial data, like wind force, state of the sea and
conditions. To add to the difficulty, those on
board may only know their positions approxi-
mately or, at worst, not at all. The vessel may
not be equipped with a global positioning
system or even a suitable compass to obtain at
least some bearings. Thus the localization pro-
vided by the people on board of a vessel in
trouble may generate imprecise or useless
information.

However, if a rescue request is received by radio,

localization can be achieved in a few seconds
with good precision and without any specific
information communicated by the vessel in
trouble. As a rule, a couple (or a net) of radio
direction finders (RDFs) can detect the direction
of an incoming radio signal via voice on VHF
channel 16 or by digital selective calling (a dis-
tress signal on VHF). Hence, automatic localiza-
tion is possible via a simple triangulation if at
least two or more RDFs, placed in different
points, get a bearing from a radio signal. If the
people awaiting rescue continue to communicate
via radio, our system can track the vessel by suc-
cessive localizations. With an accurate localiza-
tion of the vessel, search and rescue operations
(SAROPs) have a higher probability of success.

SAROPs are regulated at the international level Figure 1: Upper side: a simple scenario with two stations detecting an emergency call.
by a set of standard procedures defined and Lower side: datum, search area and probability distribution obtained by the Monte Carlo
described in the IAMSAR volume II simulation.

ERCIM NEWS 89 April 2012 47

Research and Innovation

water current models from SOAP messages, can also be inte- of XML dump, or the output was error prone and in many
grated. The system includes a database in which all SAR cases – when using, for example, a Mediawiki instance as a
units at disposal of a Coast Guard Station are stored with converter – the conversion was slow.
details of their salient features. A friendly graphic user inter-
face has been designed, where several graphic layers of Consequently a new converter needed to be written based on
information can be overlaid or hidden in visualization. the following features:
• article boundaries have to be kept
Other significant and innovative extensions of the IAMSAR • only the textual information is necessary
procedures have been carried out during the development of • infoboxes – as they are duplicated information – are fil-
the ICT-E3 project, and are now subject to a patent applica- tered out
tion. We wish to acknowledge the collaboration with the per- • comments, templates and math tags are dismissed
sonnel of the Mazara del Vallo Coast Guard Station, their • other pieces of information, like tables, are converted to
invaluable help and stimulating suggestions have been fun- text.
damental to the success of the activity.
Wikipedia dumps are published regularly and, as the aim is to
Link: http://ceur-ws.org/Vol-621/paper21.pdf be up-to-date, the system needed an algorithm which is able
to process the whole English Wikipedia in a fast and reliable
Please contact: way. As the text is also subject to a couple of language pro-
Massimo Cossentino, Carmelo Lodato, Salvatore Lopes, cessing steps to facilitate plagiarism search, these steps were
Umberto Maniscalco included and the whole processing moved to the SZTAKI
ICAR-CNR, Italy Desktop Grid service (operated by the Laboratory of Parallel
E-mail: [email protected], [email protected], and Distributed Systems), where more than 40,000 users have
[email protected], [email protected] donated their free computational resources to scientific and
social issues. Desktop grids are usually suited for parameter
Salvatore Aronica, IAMC-CNR, Italy study or “bag-of-tasks” type of applications and have other
E-mail: [email protected] minor requirements for the applications in exchange for the
large amount of “free” computing resources made available
through them. SZTAKI Desktop Grid, established in 2005,
utilizes mostly volatile and non-dedicated resources (usually
Wikipedia as Text the donated computation time of desktop computers) to solve
compute intensive tasks from different scientific domains like
by Máté Pataki, Miklós Vajna and Attila Csaba Marosi mathematics and physics. Donors run a lightweight client in
the background which downloads and executes tasks. The
When seeking information on the Web, Wikipedia is an client also makes sure that only the excess resources are uti-
essential source: its English version features nearly four lized so there is no slowdown for the computer and the donor
million articles. Studies show that it is the most is not affected in any other way.
frequently plagiarized information source, so when
KOPI, a new translational plagiarism checker was The Mediawiki converter was written in PHP to support easy
created, it was necessary to find a way to add this vast development and compatibility with the existing codebase of
source of information to the database. As it is the KOPI Portal. The main functionality could be imple-
impossible to download the whole database in an easy- mented with less than 400 lines of code. The result was
to-handle format, like HTML or plain text, and all the adapted to the requirements of the desktop grid with the help
available Mediawiki converters have some flaws, a of GenWrapper, a framework specially created for porting
Mediawiki XML dump to plain text converter has been existing scientific applications to desktop grids.
written, which runs every time a new database dump
appears on the site with the text version being
published for everybody to use.

The KOPI Plagiarism Search Portal was developed by the

Department of Distributed Systems (DSD) of MTA SZTAKI
in 2004 as a plagiarism checker tool for carbon copy plagia-
rism cases. In 2010, the system improved by adding a unique
feature to the search engine, the capability to find translated
plagiarism. For this function it was necessary to include the
whole English Wikipedia, as the number one source of
potential plagiarism, into the database.

When the Wikipedia dumps were first downloaded from the

server, all possibilities of converting them easily and quickly
to plain text were examined, as plain text is easy to manipu-
late. Consequently, the whole content had to be run through a
series of language processing steps. In most cases the avail-
able converters were not suitable for handling a larger chunk Wikipedia to text conversion process

48 ERCIM NEWS 89 April 2012

GenWrapper’s primary goal is lowering the porting effort Executive Board positions; and research publication record
required for ‘legacy applications’, which either have no exceeding 1000 peer reviewed research papers. They collab-
source code available or making changes to them is infea- orated with a group of equally high-ranking gender experts,
sible. In case of KOPI it allowed the development of the con- who provided expertise through lectures and research evi-
verter independently from the desktop grid and the result dence during the Consensus Seminars.
could be effortlessly deployed.
The consensus recommendations call for action in four pri-
This new arrangement allows Wikipedia in any language to ority areas of the gender dimension in science: science
be converted and preprocessed in a couple of days. As these knowledge making, deployment of human capital, institu-
text versions can be used for several other purposes as well, tional practices and processes, and regulation and
they are shared and made available to everybody. Currently compliance with gender‐related processes and practices. All
one can download the English (5.7 GB), Hungarian (300 of these recommendations are meant to be included within an
MB), German (2.2 GB) and French (1.5 GB) versions (sizes overall institutional science strategy. The work of the
are gz compressed size). Based on the project work and Science Leaders Panel has highlighted only the beginning of
plans, other languages will follow shortly. an important dialogue between gender experts and leaders of
scientific institutions.
Links:
Wikipedia as text download link: Here below there is a summary of the consensus recommen-
http://kopiwiki.dsd.sztaki.hu dations:
KOPI portal: http://kopi.sztaki.hu • Recommendation 1:
SZTAKI Desktop Grid: http://szdg.lpds.sztaki.hu Leaders must be convinced that there is a need to incorpo-
GenWrapper: http://genwrapper.sourceforge.net rate methods of sex and gender analysis into basic and
applied research; they must “buy into” the importance of
Please contact: the gender-dimension within knowledge making.
Máté Pataki, MTA SZTAKI, Budapest, Hungary • Recommendation 2:
Tel: +36 1 279 6269 Scientists should be trained in using methods of sex and
E-mail: [email protected] gender analysis. Both managerial levels and researchers
should be educated in such sex and gender analysis. Train-
ing in methods in sex and gender analysis should be inte-
grated into all subjects across all basic and applied science
Genset: Gender Equality curricula
• Recommendation 3:
for Science Innovation In all assessments – paper selection for journals, appoint-
ments and promotions of individuals, grant reviews, etc. –
and Excellence the use and knowledge of methods for sex and gender
analysis in research must be an explicit topic for consider-
by Stella Melina Vasilaki ation. Granting agencies, journal editors, policy makers at
all levels, leaders of scientific institutions, and agencies
GenSET was an innovative project aiming to improve the responsible for curricula accreditation, should be among
excellence of European science through inclusion of the those responsible for incorporating these methods into
gender dimension in research and science knowledge their assessment procedures.
making. It functioned as a forum for sustainable • Recommendation 4:
dialogue between European science leaders, science Research teams should be gender diverse. Institutions
stakeholder institutions, gender experts, and science should promote gender diversity of research teams
strategy decision-makers, to help implement effective through a variety of incentives (eg quality recognition and
overall gender strategies. The goal was to develop allocation of resources) and through transparency in hir-
practical ways in which gender knowledge and gender ing.
mainstreaming expertise can be incorporated within • Recommendation 5:
European science institutions in order to improve Gender balancing efforts should be made in all commit-
individual and collective capacity for action to increase tees, with priority given to key decision-making commit-
women’s participation in science. tees. Panels for selection of grants and applicants must be
gender diverse. This must be the goal for management
Between March and June 2010, three genSET Consensus teams as well.
Seminars brought together 14 European science leaders to • Recommendation 6:
share knowledge and experience and arrive at a consensus Institutions should seek to improve the quality of their
view on the gender dimension in science and on the priorities leadership by creating awareness, understanding, and
for gender action in scientific institutions. The Science appreciation of different management styles. This can be
Leaders Consensus Panel represents extensive knowledge of achieved through training, self-reflection, and various
different scientific fields and sectors, with over 500 years of feedback mechanisms. Diversity training, specifically, is
scientific and leadership experience; involvement in essential in this process.
appointing over 4000 researchers; direction of over 300 • Recommendation 7:
major research programmes and research funding of over Women already within scientific institutions must be made
€500 million; executive decision making through over 100 more visible. All public relations activities from scientific

ERCIM NEWS 89 April 2012 49

Research and Innovation

institutions should be gender‐proofed (represent women

appropriately), while avoiding tokenism. This could be Recommending Systems
done by including women in all promotional campaigns
for scientific careers, by leaders nominating women for and Control as a Priority
prizes, and by recognizing women’s achievements
appropriately. for the European
• Recommendation 8:
Assessment procedures must be re-defined to focus on Commission’s Work
the quality, rather than quantity, of individuals’ publica-
tions and research output. This must be consistently Programme
applied in individual, departmental, and other levels of
assessment. by Sebastian Engell and Françoise Lamnabhi-Lagarrigue
• Recommendation 9:
Persons with disproportionate committee and adminis- In a recently published position paper, members of the
trative duties should be provided with additional support HYCON2 Network of Excellence (‘Highly-Complex and
staff or reduced teaching assignments to ensure that their Networked Control Systems’) demonstrate that control is
research does not suffer. at the heart of the information and communication
• Recommendation 10: technologies of complex systems. As a consequence,
Policies and procedures specifically affecting working control should be supported as a priority in the coming
conditions that differentially impact men and women in European Commission’s work programmes, both at the
scientific institutions must be reviewed and revised, level of enabling technologies and at application level,
ensuring positive benefits for personal and professional including ‘public private partnerships’ (PPPs) on Energy-
development for both men and women. efficient buildings, Factories of Future and European
• Recommendation 11: Green Cars Initiatives. The recommendations for a
Specific strategies should be employed for attracting ‘European Research Agenda towards Horizon 2020’ are
women to apply for scientific positions. Announcements supported by major European enterprises and academia.
for recruitment should be formulated so that they encour-
age women to apply. That is, announcements should be Systems and control science provides the scientific founda-
broad, rather than narrowly focused. Job criteria for tions and technology to analyse and design complex, dynam-
employment should be objective and transparent. Addi- ically evolving systems, in particular systems in which feed-
tionally, leaders should not just rely on self-initiated pro- back plays an important role. Feedback means that the effect
motion but also encourage and promote applications, not of actions is monitored and future actions are planned taking
just accept them. Finally, if there are no women in the this information into account. Feedback is ubiquitous in
applicant pool, the positions should be re-advertised. technical systems where it enables automation and
• Recommendation 12: autonomy, and in social, socio-technical, economic and bio-
Explicit targets to improve gender balance and action logical systems.
plans to reach them must be included in the overarching
gender strategy of scientific institutions. The progress Systems and control science is both a scientific core disci-
must subsequently be regularly monitored and be made pline and a crucial part of application areas such as automo-
public. tive, aeronautics and aerospace, manufacturing, generation
• Recommendation 13: and distribution of electric energy, heating, ventilation and
Gender issues must be an integral part of internal and air conditioning, production of chemicals, paper, food and
external evaluation of institutions. Policies at all levels metals, robotics, supply chains and logistics.
must require this inclusion. This should begin with a crit-
ical review of gender mainstreaming processes within Systems and control science provides tools for modelling
each institution, identifying current successes and fail- dynamic physical, chemical, biological, economic and social
ures. A member of the leadership team should be respon- systems and develops concepts and tools for their analysis
sible for gender-related issues, such as following up on and design. It integrates contributions from mathematics,
the gender action strategy for the institution. signal processing, computer science, and from the applica-
tion domains. Systems and control science is indispensable
genSET was a project funded by the Science in Society to analyse, design, simulate, optimize, validate, and verify
Programme of the European Commission’s 7th the technological and socio-technical systems that will be
Framework, in the area of Capacity Support Action. characterized by massive interconnection, the processing of
huge amounts of data, new forms of synergy between
Link: humans and technical systems, and challenging requirements
The full report ‘The Consensus Report: Recommenda- for substantially improved performance, reliability, and
tions for Action on the Gender Dimension in Science’ is energy efficiency.
found on the project website:
http://www.genderinscience.org/resources.html The basic roles of systems and control science are thus the
following:
Please contact: • model physical phenomena and artefacts to understand
Stella Melina Vasilaki, FORTH-IACM, Greece and predict their dynamic behaviour and the interactions
E-mail: [email protected] among their components

50 ERCIM NEWS 89 April 2012

• develop control strategies and algorithms to optimize the behaviour of these complex biological systems will pro-
behaviour of systems so that they accomplish certain vide new opportunities for biotechnology and medicine
intended functions, satisfy constraints, and minimize neg- and support the design of such systems (‘synthetic biolo-
ative effects, eg consumption of resources gy’).
• implement the control strategies by selecting sensing
devices, computing elements and actuators and integrating • In the defence and security domains, besides the interest in
them into a system with maximum performance under cost improved performance of equipment of all kinds, there is
constraints an increasing demand for highly autonomous devices and,
• validate and verify that the implementation of control in particular in coordinated sets of devices and vehicles
strategies acting on the physical systems satisfies con- that cooperate to carry out a particular task, asking for dis-
straints and performance requirements. tributed multi-agent control.

Systems and control science has played an important The Systems and Control position paper outlines ten crucial
enabling role in all major technological evolutions, from the areas in which control will make a strong impact in the next
steam engine to rockets, high-performance aircrafts, space decade: Ground and air smart traffic management; green
ships, high-speed trains, ‘green’ cars, digital cameras, smart electricity and Smart Grid; improved energy efficiency in
phones, modern production technology, medical equipments, production systems; security in decentralized automation;
and many others. It provides a large body of theory that mechatronics and control co-design and automation;
enables the analysis of dynamic systems in order to better analysis, control and adaptation of large infrastructures;
understand their behaviour, improve their design, and aug- autonomous systems; neurosciences; health care ( from open
ment them by advanced information processing, leading to medication to closed loop control) and cellular and biomole-
qualitative leaps in performance. Over the last fifty years the cular research.
field of systems and control has seen huge advances, lever-
aging technology improvements in sensing and computation Then the main overarching challenges behind these applica-
with breakthroughs in the underlying principles and mathe- tions are summarized: 1) System-wide coordination control
matics. Motivated by this record of success, control technol- of large-scale systems 2) Distributed networked control sys-
ogists are addressing contemporary challenges as well, tems 3) Autonomy, cognition and control 4) Model-based
examples of which include: systems engineering 5) Human-machine interaction. This is
followed by a discussion of new sectors where control will
• The automotive industry is focusing on active safety tech- have a major role to play: control and health; control and
nologies, which may ultimately lead to partially social and economic phenomena and markets; control and
autonomous driving, where humans will become passen- quantum engineering.
gers of automated vehicles governed by automatic control
algorithms for substantial parts of their trips, leading to Finally some operational recommendations are listed in
improved safety, better fuel economy, and better utiliza- order to provide the means to develop this extremely impor-
tion of the available infrastructure. tant scientific and technological discipline whose critical role
in ICT is essential to meet European Policies in the future.
• Automatic control will help improve surgery. Robots are
already used to support surgeons to minimize invasive Link:
procedures and increase accuracy of operations. It is con- Systems and Control position paper, 30 pages:
ceivable that semi-autonomous robots, remotely super- http://www.hycon2.eu/extrafiles/CONTROL_position_pape
vised by surgeons, will be capable of carrying out r_FP8_28_10_2011.pdf
unprecedentedly complex operations.
Please contact:
• Automatic control will play a fundamental role in the Sebastian Engell
energy landscape of the future, both in the efficient use of TU Dortmund, Germany
energy from various sources in industry and in buildings, E-mail: [email protected]
and in the management of the generation, distribution and
consumption of electrical energy with increased use of Françoise Lamnabhi-Lagarrigue
renewable and decentralized generation. The management CNRS, France
of important schedulable loads (eg the recharging of elec- E-mail: [email protected]
tric cars) and of distributed sources (eg at customers’
homes) calls for completely new large-scale control struc-
tures.

• Systems and control science is important in all kinds of

maintenance tasks for large infrastructures. Robotic sensor
and actuator networks will be employed for autonomous
surveillance and maintenance in large buildings, distribu-
tion networks, etc.

• Control is ubiquitous in the biological mechanisms that

govern life. An improved understanding of the dynamic

ERCIM NEWS 89 April 2012 51

Events

brain substrates. In all these research this area. The school targets doctoral
First ‘NetWordS’ areas NetWordS intends to encourage students and junior researchers from
multidisciplinarily informed integration fields as diverse as Cognition,
Workshop on and synthesis of existing approaches. Computer Science, Brain Sciences and
Linguistics, with a strong motivation to
Understanding the More than 40 research institutions from advance their awareness of theoretical,
16 European countries participate in typological, psycholinguistic, computa-
Architecture of the NetWordS. Scientists involved in tional and neuro-physiological aspects
NetWordS are playing a leading role in of word structure and processing.
Mental Lexicon: the following areas:
• Theoretical issues in morphology and Link: http://www.networds-esf.eu
Integration of Existing its interfaces
• Typological, variationist and histori- Please contact:
Approaches cal aspects of word structure Claudia Marzi, ILC-CNR, Italy
• Cognitive issues in lexical architec- Email: [email protected]
by Claudia Marzi ture [email protected]
• Short-term and long-term memory [email protected]
A workshop, held 24-26 November issues
2011 at the CNR Research Campus • Neuro-physiological correlates of
in Pisa, was organised within the lexical organization and processing
framework of “NetWordS”, the • Psycho-linguistic evidence on lexical
European Science Foundation organization and processing IEEE Winter School
Research Networking Programme on • Machine-learning approaches to mor-
the Structure of Words in the phology induction on Speech and Audio
languages of Europe. • Psycho-computational models of the
mental lexicon Processing organized
The ambitious goal of the workshop • Distributional Semantics
was to lay the foundations for an inter- and hosted
disciplinary European research agenda NetWordS promotes development of
on the Mental Lexicon for the coming interdisciplinary transnational scientific by FORTH-ICS
10 years, with particular emphasis on partnerships through short-visit grants,
three main challenges: that are assigned yearly on the basis of by Eleni Orphanoudakis
• Lexicon and Rules in the grammar open calls for short-term project pro-
• Word knowledge and word use posals. Scholars taking part in interdis- The Signal Processing Laboratory of
• Words and meanings ciplinary activities funded through the Institute of Computer Science
NetWordS grants convene periodically (ICS) of the Foundation for Research
Leading scholars were invited to to discuss and disseminate results. and Technology – Hellas (FORTH)
address three basic questions: Short-visit grants are also geared organized the international Winter
• In the speaker’s area of expertise, towards planning focused collaborative School on Speech and Audio
what are the most pressing open work, with a view to catalysing credible Processing for Immersive
issues concerning the architecture of large-scale proposals within more appli- Environments and Future Interfaces
the Mental Lexicon? cation-oriented European projects and of the IEEE Signal Processing Society,
• What and how can progress in other initiatives. which took place from 16-20 January
research areas contribute to address- at FORTH, in Heraklion Crete, Greece.
ing these issues? NetWordS organises yearly Workshops
• How can advances in our understand- on interdisciplinary issues in word The Winter School involved a series of
ing of these issues contribute to structure, usually between late lectures from distinguished researchers
progress in other areas? November and early December. A from all over the world and was focused
major conference is planned to take on current research trends and applica-
The Workshop brought together 37 par- place in 2015. tions in the areas of audio and speech
ticipants (Scholars, Post-Docs, PhD stu- signal processing. More specifically,
dents) from a number of European NetWordS is pleased to announce the current trends were presented in areas
countries. Eighteen speakers, from first Summer School on such as automatic speech recognition,
diverse scientific domains, presented Interdisciplinary Approaches to speech synthesis, speech and audio
cross-disciplinary approaches to the Exploring the Mental Lexicon - 2nd-6th modeling and coding, speech capture in
understanding of the architecture of July 2012 – Dubrovnik (Croatia). The noisy environments using one or more
Mental Lexicon, reflecting the interdis- school offers a broad and intensive microphones, and 3D audio rendering
ciplinarity and synergy fostered by range of interdisciplinary courses on using two or more loudspeakers. The
NetWordS. Contributions were devoted methodological and topical issues Winter School also involved “hands-on”
to understanding the ontogenesis of related to the architecture of the mental sessions, where students were able to
word competence, creative usage of lexicon, its level of organisation, con- work on practical aspects of speech and
words in daily conversation, the archi- tent and functioning, and a series of audio signal processing, as well as
tecture of the mental lexicon and its key-note lectures on recent advances in “demo” sessions where researchers from

52 ERCIM NEWS 89 April 2012

FORTH and other institutes demon- co-editor of the 2012 Springer book on Call for Papers
strated their implemented systems in “Conquering Complexity” will give an
these fields. invited keynote. ERCIM/EWICS/DECOS
Many of the technologies that were pre- Participation is subject to registration to Cyberphysical
sented in the school will be implemented the ECCS conference. Attendance is
and demonstrated in the context of the open to non-presenters. Presenters will Systems Workshop
Ambient Intelligence (AmI) Program of be selected by an international PC on
FORTH-ICS. Specifically, sound-related the basis of a paper submission. The at SAFECOMP 2012
technologies, which are a natural means workshop’s proceedings will be pub-
of human-computer interaction in AmI lished for free as a volume of the open Dependable Embedded Systems,
environments, will be developed and online CEUR Workshop Proceedings Robotics, Systems-of-Systems:
demonstrated in a new laboratory space (ISSN 1613-0073). Challenges in Design, Development,
for 3D sound rendering and recording, to Validation & Verification and Certification
be included in an under-construction Call for Papers
building on the FORTH campus which Prospective presenters should submit Magdeburg, Germany,
will be dedicated to AmI technologies. either a position paper (5 pages) or a full 25 September 2012
research paper (10 pages). Submissions
This Winter School was the 4th in the should describe innovative research, Computers are everywhere – may they
Series of “Seasonal Schools in Signal research challenges in the field, or spec- be visible or integrated into every day
Processing” (S3P Program) of the IEEE ulative ideas with a potential to generate equipment, devices, and environment,
Signal Processing Society. Previous interesting discussions during the work- outside and inside us, mobile or fixed,
Seasonal Schools were organized in The shop. The deadline for submissions is 7 smart, interconnected and communi-
Netherlands, Austria, and Taiwan. In the May (tbc). cating. Comfort, health, services, safety
Winter School at FORTH, forty students and security of people depend more and
and professionals participated from all Topics more on these “cyber-physical systems”.
around the world, with the opportunity Topics of interest include, but are not They combine software, sensors and
to attend lectures taught by world- limited to: physics, acting independently, co-opera-
renowned researchers and to have a • cross-disciplinary research on com- tive or as “systems-of-systems” com-
close interaction with them, but also to plexity-related aspects from other posed of interconnected autonomous
meet each other to strengthen the audio disciplines that may play a role in systems originally independently devel-
and speech research community. software development oped to fulfil dedicated tasks. The
• advances and insights in factors and impact on society as a whole is tremen-
More information: mechanisms that affect, deteriorate or dous - thus dependability in a holistic
http://www.s3p-saie.eu/ reduce software complexity manner becomes an important issue,
• techniques and tools to compute, covering safety, reliability, availability,
visualise, analyse, measure, estimate, security, maintainability, robustness and
predict, control software complexity resilience, despite emergent behaviours
Annual workshop of the ERCIM Working • empirical and statistical studies on and interdependencies.
Group on Software Evolution software complexity and its evolution
over time Technology is developing very fast.
Research Challenges • studies on the relation between soft- Demanding challenges have to be met
ware complexity and software quality by research, engineering and education.
in Software Smart (embedded) systems are regarded
All kinds of software complexity can be as the most important business driver for
Complexity considered, including: European industry. They are a targeted
• data complexity research area for European Research
Brussels, 6 September 2012 • complexity of software processes, Programmes in Framework 7, in the
projects, communities, or ecosystems ARTEMIS Joint Undertaking, and in
In 2012, the annual workshop of the • algorithmic, computational and logi- several other dedicated Programmes and
ERCIM Working Group on Software cal complexity European Technology Platforms
Evolution will be co-located with the • complexity of the software product at (ARTEMIS, EPoSS). Their application
European Conference on Complex various levels of abstraction is not only in the traditional areas of
Systems. The one-day workshop will • complexity of the social networks aerospace, railways, automotive, or
explore the research challenges and involved in creating and using soft- process industry and manufacturing, but
recent advances in software complexity, ware also in robotics and services of all kind,
ranging from theoretical insights to • complexity of the environment sur- in home appliances (smart environ-
practical applications, and focusing on rounding the software system ments, smart homes, ambient assisted
all aspects surrounding software devel- living) and health care.
opment that potentially impact its com-
plexity. Interdisciplinary research to More information: This workshop at SAFECOMP follows
tackle software complexity problems is http://informatique.umons.ac.be/ already its own tradition since 2006.
particularly welcomed. Mike Hinchey, genlog/rcsc2012 Sessions will be held on:

ERCIM NEWS 89 April 2012 53

Events,
Joint
EventsERCIM
Books
Actions

• Dependable and resilient embedded IFIP Theoretical Systems’-Challenges”. ERCIM mem-

systems bers are invited to submit a draft paper
• Autonomous Systems and Robotics Computer Science (about 4 pages) for review by 21 April.
• Systems-of-Systems. The full paper is due (after notification
2012 by 19 May 19) by 9 June 2012.
They cover aspects from design, devel-
opment, verification and validation, Amsterdam, The Netherlands, Session: Reliance on Cyber-Physical
certification, maintenance, standardiza- 26-28 September 2012 Systems: “Systems-of-Systems”-
tion and education & training. This is a Challenges
workshop, and to be distinct from the The conference Theoretical Computer The ubiquitous deployment of software-
SAFECOMP conference mainstream, it Science, which is held every two years, intensive embedded (cyber-physical)
allows reports on “work in progress” either in conjunction with or in the systems requires to take into account the
aiming at hopefully fruitful discussions framework of the IFIP World complex interplay of software, hard-
and experience exchange. Reports on Computing Congress (http://www.wcc- ware, networks, environment and
European or national research projects 2012.org/), is the meeting place of the humans actors in different roles,
(as part of the required dissemination) TC1 community where new results of including unexpected and unpredictable,
as well as industrial experience reports computation theory are presented and emergent system behaviour (especially
are welcome. more broadly experts in theoretical com- in case of interlinked “systems of sys-
puter science meet to share insights and tems”, composed of (legacy) systems
You want to present your ideas and ask questions about the future directions originally designed as autonomous sys-
results? What do you have to provide? of the field. tems).
For this workshop, workshop pre-con-
ference proceedings by Springer LNCS TCS 2012 is associated with The Alan The design, operation, and protection,
(in addition to the SAFECOMP pro- Turing Year 2012 (http://www.math- but also risk assessment, validation, ver-
ceedings) are planned. Papers (6-8 comp.leeds.ac.uk/turing2012/) and will ification and certification, maintenance
pages, same format as for the SAFE- be located at CWI. and modification through the life cycle
COMP proceedings) will be reviewed of these systems have to take into
by at least three reviewers. Deadlines More information: account unexpected behavior or threats
are: http://tcs.project.cwi.nl experienced from the real-world envi-
• 25 May 2012: Full papers, send to ronment and the other interconnected
chairpersons systems. The interplay between humans,
• 15 June 2012: Notification of authors environment and systems must be con-
• 29 June 2012: Final camera-ready Call for Papers sidered in a holistic, interdisciplinary
papers view for the distribution of tasks,
ERCIM Dependable including mutual overriding mecha-
The International Programme nisms for automated and human deci-
Committee is composed of selected Embedded Systems sions, for performing interventions at
EWICS and ERCIM members. system failures, etc. Systems must be
Working Group robust to cope with these problems in an
Please contact: adaptive manner (“resilient systems),
Erwin Schoitsch Session at IDIMT 2012 which is an ever increasing challenge for
AIT Austrian Institute of Technolo- system design, verification, validation
gy/AARIT, Austria 12-14 September, Jindrichuv Hradec, and deployment.
E-mail: [email protected] Czech Republic
The session is chaired by Erwin
Amund Skavhaug IDIMT - Interdisciplinary Information Schoitsch, AIT Austrian Institute of
NTNU, Norway and Management Talks, organized by Technology (chair) and Gerhard
E-mail: [email protected] the University of Economics, Prague, Chroust, JKU Linz (co-chair).
and J. Kepler University, Linz, are a
More information: truly interdisciplinary and international Papers covering holistic aspects as well
SafeComp 2012: forum for the exchange of concepts and as papers covering partial aspects of
http://www-e.uni-magdeburg.de/ visions in the area of complex and/or cyber-physical systems are encouraged,
safecomp/ software intensive systems, manage- including academic as well as practical
ment and engineering of information industrial contributions.
and knowledge, systemic thinking, busi-
ness engineering, and related topics. Please contact:
Erwin Schoitsch
This year, Erwin Schoitsch, chair of the AIT Austrian Institute of
ERCIM Dependabe Embedded Systems Technology/AARIT, Austria
Working Group organises a session E-mail: [email protected]
related to the topics of the Working
Group entitled “Reliance on Cyber- More information about IDMT:
Physical Systems: ‘Systems-of- http://www.idimt.org

54 ERCIM NEWS 89 April 2012

In Brief

Computing in the Cloud Euro-India SPIRIT Project: Final

at EIT ICT Labs Recommendations Published
Seif Haridi, Chief Scientific The Euro-India SPIRIT project published its final recom-
Advisor at SICS, leads the mendations booklet “The New Digital Paradigm -
‘Computing in the Cloud’ Harnessing EU-India ICT Cooperation”. The project’s aim
action line at the European was, in brief, to examine the state of the art in Indian ICT and
Institute of Innovation & develop a set of recommendations for a mutually beneficial
Technology’s ICT Labs. research agenda between the EU and India. In order to do so,
SICS is also one of the main the project collated the discussions of 18 (nine Indian and
partners in the Europa nine European) renowned ICT experts specially selected for
project, one of four activities the project as well as taking input from the wider ICT com-
within the ‘Computing in munity in India through events such as eWorld and eIndia.
Cloud’ action line. The project also promoted the opportunities available to
Indian organisations, from an SME level upwards, through
Europa, which started in participation in EC-funded ICT projects. ERCIM was the
Seif Haridi, SICS, leads the mid-2011, is developing a administrative coordinator of this project.
‘Computing in the Cloud’ new cloud-based data-inten- More information:
action line at EIT ICT Labs. sive computing platform. final recommendations booklet download page:
The platform includes a pro- http://www.euroindia-
gramming model called PACT that extends the MapReduce ict.org/Pages/SelectedDocument.aspx?id_documento=8db6
model to include 2nd order functional programming opera- 3f3c-a86b-4c9a-8f37-e7c2775ca385
tions, and highly-available elastic data storage that extends
the HDFS platform. The project is led by Volker Markl at
Technical University Berlin and includes besides SICS also
Inria, University of Paris XI, Technical University-DELFT, Opening of Fraunhofer Project
AALTO University, and the University of Trento.
More information: http://eit.ictlabs.eu/research/research- Center in Brazil
action-lines/computing-in-the-cloud/

Book Announcement
Francesco Flammini (Editor)

Critical Infrastructure Security:

Assessment, Prevention,
Detection, Response
This book provides a com- The first Fraunhofer Project Center (FPC) in Brazil had its
prehensive survey of state- opening ceremony on 9 March 2012 in Salvador, Bahia. The
of-the-art techniques for the center is located in the Technological Park of Bahia, which
security of critical infrastruc- hosts companies such as IBM, Portugal Telecom Innovation,
tures (CI). It addresses both and several big Brazilian companies. The new Fraunhofer
logical and physical aspects Project Center will be active in the area of Software and
of security from an engi- Systems Engineering and will address topics such as innova-
neering point of view, and tive solutions for critical and/or large systems, mobile busi-
considers both theoretical ness applications, e-Government, and ambient assisted
aspects and practical applica- living. The center is a joint initiative of the Federal
tions for each topic. Recently University of Bahia (UFBA) and the Fraunhofer-
developed methodologies Gesellschaft in Germany. Brazil is the world’s eighth-largest
and tools for CI analysis are economy and one of the fastest-growing major economies.
investigated as well as strate- The country has a sophisticated technological sector, devel-
gies and technologies for CI oping projects that range from submarines to aircraft, and is
protection in the following also a pioneer in many fields, including ethanol production
strongly interrelated and multidisciplinary main fields: vulner- and deep-water oil research. In terms of software technolo-
ability analysis and risk assessment; threat prevention, detec- gies, Brazil was the first country in the world to have fully
tion and response; emergency planning and management. automated electronic elections. The establishment of the
Fraunhofer Project Center at UFBA will bring the
WIT Press, United Kingdom; ISBN: 978-1-84564-562-5, Fraunhofer-Gesellschaft into the Brazilian market for soft-
326 pages, 2012, Hardback ware and systems technologies.

ERCIM NEWS 89 April 2012 55

ERCIM – the European Research Consortium for Informatics and Mathematics is an organisa-
tion dedicated to the advancement of European research and development, in information
technology and applied mathematics. Its member institutions aim to foster collaborative work
within the European research community and to increase co-operation with European industry.

ERCIM is the European Host of the World Wide Web Consortium.

Austrian Association for Research in IT

I.S.I. - Industrial Systems Institute
c/o Österreichische Computer Gesellschaft
Patras Science Park building
Wollzeile 1-3, A-1010 Wien, Austria
Platani, PATRAS, Greece, 265 04
http://www.aarit.at/
http://http://www.isi.gr

Consiglio Nazionale delle Ricerche, ISTI-CNR

Portuguese ERCIM Grouping
Area della Ricerca CNR di Pisa,
c/o INESC Porto, Campus da FEUP,
Via G. Moruzzi 1, 56124 Pisa, Italy
Rua Dr. Roberto Frias, nº 378,
http://www.isti.cnr.it/
4200-465 Porto, Portugal

Czech Research Consortium Polish Research Consortium for Informatics and Mathematics
for Informatics and Mathematics Wydział Matematyki, Informatyki i Mechaniki,
FI MU, Botanicka 68a, CZ-602 00 Brno, Czech Republic Uniwersytetu Warszawskiego, ul. Banacha 2, 02-097 Warszawa, Poland
http://www.utia.cas.cz/CRCIM/home.html http://www.plercim.pl/

Centrum Wiskunde & Informatica Science and Technology Facilities Council,

Science Park 123, Rutherford Appleton Laboratory
NL-1098 XG Amsterdam, The Netherlands Harwell Science and Innovation Campus
http://www.cwi.nl/ Chilton, Didcot, Oxfordshire OX11 0QX, United Kingdom
http://www.scitech.ac.uk/

Fonds National de la Recherche Spanish Research Consortium for Informatics and Mathematics,
6, rue Antoine de Saint-Exupéry, B.P. 1777 D3301, Facultad de Informática, Universidad Politécnica de Madrid,
L-1017 Luxembourg-Kirchberg Campus de Montegancedo s/n,
http://www.fnr.lu/ 28660 Boadilla del Monte, Madrid, Spain,
http://www.sparcim.es/

Swedish Institute of Computer Science

FWO FNRS
Box 1263,
Egmontstraat 5 rue d’Egmont 5
SE-164 29 Kista, Sweden
B-1000 Brussels, Belgium B-1000 Brussels, Belgium
http://www.sics.se/
http://www.fwo.be/ http://www.fnrs.be/

Foundation for Research and Technology – Hellas Swiss Association for Research in Information Technology
Institute of Computer Science c/o Professor Daniel Thalmann, EPFL-VRlab,
P.O. Box 1385, GR-71110 Heraklion, Crete, Greece CH-1015 Lausanne, Switzerland
http://www.ics.forth.gr/ http://www.sarit.ch/
FORTH
Fraunhofer ICT Group Magyar Tudományos Akadémia
Friedrichstr. 60 Számítástechnikai és Automatizálási Kutató Intézet
10117 Berlin, Germany P.O. Box 63, H-1518 Budapest, Hungary
http://www.iuk.fraunhofer.de/ http://www.sztaki.hu/

Institut National de Recherche en Informatique University of Cyprus

et en Automatique P.O. Box 20537
B.P. 105, F-78153 Le Chesnay, France 1678 Nicosia, Cyprus
http://www.inria.fr/ http://www.cs.ucy.ac.cy/

Norwegian University of Science and Technology Technical Research Centre of Finland

Faculty of Information Technology, Mathematics and PO Box 1000
Electrical Engineering, N 7491 Trondheim, Norway FIN-02044 VTT, Finland
http://www.ntnu.no/ http://www.vtt.fi/

Order Form I wish to subscribe to the

If you wish to subscribe to ERCIM News
o printed edition o online edition (email required)
free of charge
or if you know of a colleague who would like to Name:
receive regular copies of
ERCIM News, please fill in this form and we Organisation/Company:
will add you/them to the mailing list.
Address:
Send, fax or email this form to:
ERCIM NEWS
2004 route des Lucioles
BP 93 Postal Code:
F-06902 Sophia Antipolis Cedex
Fax: +33 4 9238 5011 City:
E-mail: [email protected]
Country
Data from this form will be held on a computer database.
By giving your email address, you allow ERCIM to send you email E-mail:

You can also subscribe to ERCIM News and order back copies by filling out the form at the ERCIM Web site at
http://ercim-news.ercim.eu/

Limitations
No ratings yet
Limitations
26 pages
Future Challenges in Computer Science
No ratings yet
Future Challenges in Computer Science
26 pages
Future of Big Data
No ratings yet
Future of Big Data
3 pages
Jin 2015
No ratings yet
Jin 2015
6 pages
Demchenko - Addressing Big Data Issues in Scientific Data Infrastructure
No ratings yet
Demchenko - Addressing Big Data Issues in Scientific Data Infrastructure
9 pages
J BDR 2015 01 006
No ratings yet
J BDR 2015 01 006
6 pages
Understanding Big Data: Concepts & Importance
No ratings yet
Understanding Big Data: Concepts & Importance
3 pages
Big Data Challenges in Bioinformatics
No ratings yet
Big Data Challenges in Bioinformatics
47 pages
PHD Proposal On Big Security For Big Data in IoT Applications
No ratings yet
PHD Proposal On Big Security For Big Data in IoT Applications
15 pages
Big Data Analytics Towards A European Research Agenda
No ratings yet
Big Data Analytics Towards A European Research Agenda
21 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
A Survey On Big Data Analytics Challenges, Open Research Issues and Tools
No ratings yet
A Survey On Big Data Analytics Challenges, Open Research Issues and Tools
11 pages
Understanding Data Science & Big Data
No ratings yet
Understanding Data Science & Big Data
221 pages
Big Data - A Primer
100% (3)
Big Data - A Primer
195 pages
BigData Computing
67% (3)
BigData Computing
562 pages
A Review Paperbased On Big Data Analytics: Rashmi
No ratings yet
A Review Paperbased On Big Data Analytics: Rashmi
7 pages
Kitchin (2014) - Big Data, New Epistemologies and Paradigm Shifts
No ratings yet
Kitchin (2014) - Big Data, New Epistemologies and Paradigm Shifts
12 pages
Big Data Analytics and Big Data Science A Survey 2016
No ratings yet
Big Data Analytics and Big Data Science A Survey 2016
43 pages
Future Generation Computer Systems
No ratings yet
Future Generation Computer Systems
3 pages
How Do Your Data Grow
No ratings yet
How Do Your Data Grow
2 pages
The Scientific World Journal - 2014 - Khan - Big Data Survey Technologies Opportunities and Challenges
No ratings yet
The Scientific World Journal - 2014 - Khan - Big Data Survey Technologies Opportunities and Challenges
18 pages
Big Data Seminar Report Overview
100% (2)
Big Data Seminar Report Overview
27 pages
Understanding Big Data Characteristics
No ratings yet
Understanding Big Data Characteristics
56 pages
Big Data Analytics: Insights & Innovations
No ratings yet
Big Data Analytics: Insights & Innovations
6 pages
Big Data Insights by Divyanshu Bhardwaj
No ratings yet
Big Data Insights by Divyanshu Bhardwaj
19 pages
Assignment Stid (Group 18) - Big Data
No ratings yet
Assignment Stid (Group 18) - Big Data
28 pages
Big Data: Challenges and Technologies
No ratings yet
Big Data: Challenges and Technologies
6 pages
Petabyte-Scale Big Data Management Techniques
No ratings yet
Petabyte-Scale Big Data Management Techniques
9 pages
Lesson 6 Issues of Technology &amp Ethics
No ratings yet
Lesson 6 Issues of Technology &amp Ethics
9 pages
Big Data
No ratings yet
Big Data
12 pages
Security Issues Associated With Big Data - Final
No ratings yet
Security Issues Associated With Big Data - Final
15 pages
Dsbda Unit - 1
No ratings yet
Dsbda Unit - 1
21 pages
EN129 Web
No ratings yet
EN129 Web
48 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
68 pages
Big Data Tools and Techniques
No ratings yet
Big Data Tools and Techniques
4 pages
Big Data Management Challenges & Applications
No ratings yet
Big Data Management Challenges & Applications
11 pages
Icdam 2024 Volume 1
No ratings yet
Icdam 2024 Volume 1
661 pages
Big Data Complexity Solutions by Shreeya
No ratings yet
Big Data Complexity Solutions by Shreeya
9 pages
Big Data Analytics Overview and Challenges
No ratings yet
Big Data Analytics Overview and Challenges
9 pages
Big Data in Healthcare: Trends & Insights
No ratings yet
Big Data in Healthcare: Trends & Insights
12 pages
Big Data Security in IoT and Cloud
No ratings yet
Big Data Security in IoT and Cloud
22 pages
Systematic Mapping of Big Data Research
No ratings yet
Systematic Mapping of Big Data Research
27 pages
Big Data Criteria
No ratings yet
Big Data Criteria
10 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
Big Data 2019 IEEE PROJECTS IEEE PAPERS
No ratings yet
Big Data 2019 IEEE PROJECTS IEEE PAPERS
8 pages
Big Data Management Challenges Explained
No ratings yet
Big Data Management Challenges Explained
8 pages
Big Data Security Challenges & Solutions
No ratings yet
Big Data Security Challenges & Solutions
3 pages
Big Data: Characteristics and Challenges
No ratings yet
Big Data: Characteristics and Challenges
4 pages
Quantum Computing in Big Data Analytics
No ratings yet
Quantum Computing in Big Data Analytics
5 pages
Big Data Applications and Challenges
No ratings yet
Big Data Applications and Challenges
19 pages
Critical Questions on Big Data Ethics
No ratings yet
Critical Questions on Big Data Ethics
19 pages
Big Data and Hadoop Technology Overview
No ratings yet
Big Data and Hadoop Technology Overview
11 pages
Big Data in Cloud Computing
No ratings yet
Big Data in Cloud Computing
11 pages
Banking On AI in Financial Services
No ratings yet
Banking On AI in Financial Services
16 pages
Artificial Intelligence Improves The Accuracy of R
No ratings yet
Artificial Intelligence Improves The Accuracy of R
9 pages
AI Trends Report 2025
No ratings yet
AI Trends Report 2025
114 pages
AI Agents - How To Build Digital Workers - by Alfredo Sone - Nov, 2024 - Medium
No ratings yet
AI Agents - How To Build Digital Workers - by Alfredo Sone - Nov, 2024 - Medium
12 pages
Mastering AI Agents - A Practical Handbook For Understanding, Building, and Leveraging LLM-Powered Autonomous Systems To... (Marcus Lighthaven)
100% (3)
Mastering AI Agents - A Practical Handbook For Understanding, Building, and Leveraging LLM-Powered Autonomous Systems To... (Marcus Lighthaven)
124 pages
Bangla Grammar Error Detection & Correction
No ratings yet
Bangla Grammar Error Detection & Correction
153 pages
Newwhitepaper - Prompt Engineering - v4
0% (1)
Newwhitepaper - Prompt Engineering - v4
65 pages
Information-Theoretic tf-idf Analysis
No ratings yet
Information-Theoretic tf-idf Analysis
21 pages
Sage 9.1 Reference Manual: The Sage Command Line: Release 9.1
No ratings yet
Sage 9.1 Reference Manual: The Sage Command Line: Release 9.1
149 pages
ERP Deployment Template - Evershine Northeast Apartments
No ratings yet
ERP Deployment Template - Evershine Northeast Apartments
8 pages
Internship Report on Home Automation Systems
No ratings yet
Internship Report on Home Automation Systems
26 pages
IEC 61850 Conformance and Interoperability Certificate Statement SEL-421 Protection, Automation, and Control System
No ratings yet
IEC 61850 Conformance and Interoperability Certificate Statement SEL-421 Protection, Automation, and Control System
1 page
Modular Ground Based Air Defence Overview
No ratings yet
Modular Ground Based Air Defence Overview
2 pages
Platform Technologies PDF
No ratings yet
Platform Technologies PDF
16 pages
Cookies Chrome Dev Default
No ratings yet
Cookies Chrome Dev Default
2 pages
Digital Health Platforms and Infrastructure
No ratings yet
Digital Health Platforms and Infrastructure
256 pages
Template PowerPoint - ViewSonic Projector 02
No ratings yet
Template PowerPoint - ViewSonic Projector 02
26 pages
THINKMD's COVID-19 Home-Based Screening Tool
No ratings yet
THINKMD's COVID-19 Home-Based Screening Tool
5 pages
Unit II - Unified Data Standards
No ratings yet
Unit II - Unified Data Standards
2 pages
Comparative Analysis of Open-source 5G Software
No ratings yet
Comparative Analysis of Open-source 5G Software
6 pages
Interoperability Assessment of IEC61850 Devices
No ratings yet
Interoperability Assessment of IEC61850 Devices
7 pages
4586 Eed 3 Draft
No ratings yet
4586 Eed 3 Draft
426 pages
DMTF-Redfish and RDE RedfishDeviceEnablement SDC 2018
No ratings yet
DMTF-Redfish and RDE RedfishDeviceEnablement SDC 2018
33 pages
BIM Key Contractual
No ratings yet
BIM Key Contractual
4 pages
Setting a PACS on FHIR
No ratings yet
Setting a PACS on FHIR
9 pages
Movistar Curl
No ratings yet
Movistar Curl
4 pages
Industrial Cybersecurity Trends 2024
No ratings yet
Industrial Cybersecurity Trends 2024
67 pages
APEC Symposium on Sustainable Cities 2023
No ratings yet
APEC Symposium on Sustainable Cities 2023
39 pages
Identifying and Overcoming Policy-Level Barriers T
No ratings yet
Identifying and Overcoming Policy-Level Barriers T
10 pages
HL7 HIRA - Healthcare Interoperability Reference Architecture and Clinical Pathway
No ratings yet
HL7 HIRA - Healthcare Interoperability Reference Architecture and Clinical Pathway
16 pages
WEF Digital Transformation For SMEs 2024
No ratings yet
WEF Digital Transformation For SMEs 2024
17 pages
Future Heritage: Innovations in Architecture
100% (2)
Future Heritage: Innovations in Architecture
228 pages
Digital Archieve System
No ratings yet
Digital Archieve System
102 pages
FAIR Principles for Data Management
No ratings yet
FAIR Principles for Data Management
9 pages
Java Swing GUI Components Examples
No ratings yet
Java Swing GUI Components Examples
9 pages
TSI OPE Application Guide 2019
No ratings yet
TSI OPE Application Guide 2019
85 pages
Mix
No ratings yet
Mix
78 pages
Adoption of BIM in FM Practices in Malaysia
No ratings yet
Adoption of BIM in FM Practices in Malaysia
12 pages
Data Governance
No ratings yet
Data Governance
15 pages